DB/OS Interface

Stonebraker's OS dump

What do Operating Systems do that force DBMS designers to start from scratch? Can the OS take over some of this stuff?

Note: Some of these gripes have been fixed in commercial OS'es. Some in research OS'es. Many remain as artifacts of the distinction between OS'es and DBMSs.

Buffer Pool

Performance problems

getting a page from the OS to user space is a system call (process switch) and a copy

Replacement policy control

LRU/clock/etc. often ineffective (stochastic, based on non-DB workloads)
we will see this in detail when we discuss buffer management
DBMS knows access pattern in advance - should dictate policy
This is a major OS/DBMS distinction!

Prefetch: note that DBMS knows of multiple "orders" for a set of records, OS only knows physical order
Crash Recovery: requires page-level control of flushing (examples are out of date, but point remains)

File System

lack of clustering (translate: block granularity problem. Want segments.)
multiple trees: (dir, file, database). Unify?
put DBMS below file system (Inversion)? Perhaps integrate RSS and file system (SHORE)?

Scheduling, IPC

process-per-user vs. server tradeoffs
Convoys (also mentioned in System R retrospective). System R's solution? Not a problem in a single-server model (why?)
I/O process model

Expensive messages?
Some systems do this now & get good performance.

Consistency/Locking

OS locking granularity fixed (pages)
byte-level recover (fsck) doesn't help with transactions
locking, recovery & buffer management interact

e.g.: must write log pages in order, so must be able to flush individual pages. can't release locks until commit is logged.
if any of these is missing from OS, DBMS must implement all of them
Note: one of the problems in "modular DB design". It's even worse than this -- Access Methods also tie in!

Should virtual memory (mmap) be used for database?

address space big enough?

2^32 = 4 Gb
2^64 = 16,777,216 Tb = 16,384 Petabytes = 16 Exabytes (10^18)

Page table overhead (4 byte overhead per page?)

2 page faults per I/O, or pin page table in memory
disk segments use start/offset notation for compactness

Alternative: bind file chunks into address space

bookkeeping
cost of bind/unbind = cost of file open?

Buffering & virtual memory?

Annotated references!