Stonebraker's OS dump
What do Operating Systems do that force DBMS designers to start from scratch? Can the
OS take over some of this stuff?
Note: Some of these gripes have been fixed in commercial OS'es. Some in
research OS'es. Many remain as artifacts of the distinction between OS'es and DBMSs.
- Performance problems
- getting a page from the OS to user space is a system call (process switch) and a copy
- Replacement policy control
- LRU/clock/etc. often ineffective (stochastic, based on non-DB workloads)
- we will see this in detail when we discuss buffer management
- DBMS knows access pattern in advance - should dictate policy
- This is a major OS/DBMS distinction!
- Prefetch: note that DBMS knows of multiple "orders" for a set of records, OS
only knows physical order
- Crash Recovery: requires page-level control of flushing (examples are out of date, but
- lack of clustering (translate: block granularity problem. Want segments.)
- multiple trees: (dir, file, database). Unify?
- put DBMS below file system (Inversion)? Perhaps integrate RSS and file system (SHORE)?
- process-per-user vs. server tradeoffs
- Convoys (also mentioned in System R retrospective). System R's solution? Not a
problem in a single-server model (why?)
- I/O process model
- Expensive messages?
- Some systems do this now & get good performance.
- OS locking granularity fixed (pages)
- byte-level recover (fsck) doesn't help with transactions
- locking, recovery & buffer management interact
- e.g.: must write log pages in order, so must be able to flush individual pages.
can't release locks until commit is logged.
- if any of these is missing from OS, DBMS must implement all of them
- Note: one of the problems in "modular DB design". It's even worse than this --
Access Methods also tie in!
Should virtual memory (mmap) be used for database?
- address space big enough?
- 2^32 = 4 Gb
- 2^64 = 16,777,216 Tb = 16,384 Petabytes = 16 Exabytes (10^18)
- Page table overhead (4 byte overhead per page?)
- 2 page faults per I/O, or pin page table in memory
- disk segments use start/offset notation for compactness
- Alternative: bind file chunks into address space
- cost of bind/unbind = cost of file open?
- Buffering & virtual memory?