redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

DB/OS Interface

Stonebraker's OS dump

What do Operating Systems do that force DBMS designers to start from scratch? Can the OS take over some of this stuff?

Note: Some of these gripes have been fixed in commercial OS'es.  Some in research OS'es. Many remain as artifacts of the distinction between OS'es and DBMSs.

Buffer Pool

  • Performance problems
    • getting a page from the OS to user space is a system call (process switch) and a copy
  • Replacement policy control
    • LRU/clock/etc. often ineffective (stochastic, based on non-DB workloads)
    • we will see this in detail when we discuss buffer management
    • DBMS knows access pattern in advance - should dictate policy
    • This is a major OS/DBMS distinction!
  • Prefetch: note that DBMS knows of multiple "orders" for a set of records, OS only knows physical order
  • Crash Recovery: requires page-level control of flushing (examples are out of date, but point remains)

File System

  • lack of clustering (translate: block granularity problem. Want segments.)
  • multiple trees: (dir, file, database). Unify?
  • put DBMS below file system (Inversion)? Perhaps integrate RSS and file system (SHORE)?

Scheduling, IPC

  • process-per-user vs. server tradeoffs
  • Convoys (also mentioned in System R retrospective).  System R's solution? Not a problem in a single-server model (why?)
  • I/O process model
    • Expensive messages?
    • Some systems do this now & get good performance.


  • OS locking granularity fixed (pages)
  • byte-level recover (fsck) doesn't help with transactions
  • locking, recovery & buffer management interact
    • e.g.: must write log pages in order, so must be able to flush individual pages.  can't release locks until commit is logged.
    • if any of these is missing from OS, DBMS must implement all of them
    • Note: one of the problems in "modular DB design". It's even worse than this -- Access Methods also tie in!

Should virtual memory (mmap) be used for database?

  • address space big enough?
    • 2^32 = 4 Gb
    • 2^64 = 16,777,216 Tb = 16,384 Petabytes = 16 Exabytes (10^18)
  • Page table overhead (4 byte overhead per page?)
    • 2 page faults per I/O, or pin page table in memory
    • disk segments use start/offset notation for compactness
  • Alternative: bind file chunks into address space
    • bookkeeping
    • cost of bind/unbind = cost of file open?
  • Buffering & virtual memory?

Annotated references!



1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.