redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

POSTGRES Storage System

One More Concurrency Control Technique

Time Stamping: Bernstein TODS ’79 (see also paper in Stonebraker book)



Write TS Read TS


  • Every xact gets a unique timestamp at startup
  • on Read: OK if Xact TS > WTS. Install new RTS if xact TS > RTS.
  • on Write: OK if xact TS > MAX(RTS, WTS). Install new WTS.



  1. forces time-stamp order (tighter restriction than other schemes)
  2. cascaded aborts (no isolation)


Multi-version timestamping techniques:

  • Reed’s PhD, MIT ‘78
    • reads get the appropriate version
    • writes are a bit trickier – can be added if nobody read object between the new write and any "later" writes


Timestamping is not dead, but it is not popular, either.

POSTGRES Storage System

An extremely simple solution to the complex recovery problem.


  • WAL recovery code is really complicated (witness ARIES)
  • recovery code must be flawless
  • failures can be arbitrary – testing is hard to pull off

What’s wrong with this picture? 

          |     DBMS       |
               /       \
              /         \
           -----      -----
            DB         Log
           -----      -----

Alternative: A no-overwrite storage system.

  1. Time travel comes for free
  2. instantaneous recovery
  3. no crash recovery code


  • Life of a xact:
    • increment and grab current global XID
    • do processing
    • change status to committed in log (more on this below)
    • force data & log to stable storage (in that order!)
  • There is a log:
    • tail of log (oldest active xact to present) needs 2 bits per transaction to record state (committed, aborted, in progress)
    • body of log needs only 1 bit per xact (committed or aborted)
    • at 1 xact per second, 1 year of transactions fits in 4Mb log space!
    • Detail: if this is still too big, use a Bloom filter to represent aborted xacts (lossy compression)
    • with just a little NVRAM, the log essentially never needs forcing


Each tuple has a bunch of system fields:

  • OID: a database-wide unique ID across all time
  • Xmin: XID of inserter
  • Tmin: commit time of Xmin
  • Cmin: command ID of inserter
  • Xmax: XID of deleter (if any)
  • Tmax: commit time of Xmax (if any)
  • Cmax: command ID of deleter (if any)
  • PTR: pointer to chain of deltas


Updates work as follows:

  1. Xmax & Cmax set to updater’s XID
  2. new replacement tuple appended to DB with:
    • OID of old record
    • Xmin & Cmin = XID of updater
    • in fact, store this as delta off original tuple


Deleters simply set Xmax & Cmax to their XID

The first version of a record is called the Anchor Point, which has a chain of associated delta records

"Hopefully", delta records fit on the same page as their anchor point.

CC, Timestamps, Archiving:

If we actually got timestamps at xact start, we’d get timestamp ordering CC.

Instead, do 2PL, and get timestamp at commit time.

How to set Tmin and Tmax if you don’t have the commit time?

  • XID is taken as an oid from the TIME relation
  • at commit:
    • update your appropriate TIME tuple with the wall-clock time.
    • then force data pages to stable storage, change status to committed in tail of log
  • 3 levels of archiving
    1. no archive: old versions not needed
    2. light archive: old versions not to be accessed often
    3. heavy archive: old versions to be accessed regularly
  • on first access to a tuple from a "heavy archive" relation, you update the OIDs in Tmin and Tmax with values from the TIME relation

Time Travel

Allows queries over a table as of some wall-clock time in the past.

Rewrite queries to handle the system fields in tuples

Reading a Record: get record, follow delta chain until you’ve got the appropriate version constructed.

Indexes all live on disk, and are updated in place (overwrites here)


  • historical data can be forced to archive via the vacuum cleaner
    1. write archive record(s)
    2. write new anchor record
    3. reclaim space of old anchor/deltas
  • crash during vacuum?
    • indexes may lose archive records: this will be discovered at runtime and fixed via a Seq. Scan
    • duplicate records may be forced to archive: OK because POSTGRES doesn’t do multisets
  • Can build R-trees over lifetime intervals of data on archive

Performance Study vs. WAL


  • records fit on a single page
  • deltas live on the same page as anchors
  • single-record xacts
  • update-only workload (?!)

NVRAM required to make POSTGRES compete on this benchmark.

The Real Story

  • Illustra never claimed to be a TP competitor
  • Informix replaced Illustra’s no-overwrite storage manager with Informix’s WAL
  • List of Problems (to be discussed in class)
  • Tuple differencing never implemented
  • R-trees over archive not used
  • Tie-ins with LFS. What makes LFS more viable (if anything??)

1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.