redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

Distributed Transactions & Replication

Transaction Management in R*

Unravels details of logging & messages sent.


  • update in place, WAL
  • batched force of log records

Desired Characteristics

  • guaranteed xact atomicity
  • ability to "forget" outcome of commit ASAP
  • minimal log writes & message traffic
  • optimized performance in no-failure case
  • exploitation of completely or partially R/O xacts
  • maximize ability to perform unilateral abort

In order to minimize logging and comm:

  • rare failures do not deserve extra overhead in normal processing
  • hierarchical commit better than 2P

Normal Processing (2PC)

Coordinator Log


Subordinate Log


















since subords force abort (& commit) before ACKing, they never need to ask coord about final outcome.
Rule: never need to ask something you used to know; log before ACKing.
    Guarantees atomicity.

Total cost:
     subords: 2 forced log-writes (prepare/commit), 2 messages (YES/ACK)
     coord: 1 forced log write (commit), 1 async log write (end), 2 messages/subord (prepare/commit)

2PC & Failures

Recovery process per site handles xacts committing at crash, as well as incoming recovery messages.

  1. on restart, read log and accumulate committing xact info in main mem
  2. if you discover a local xact in the prepared state, contact coord to find out what to do
  3. if you discover a local xact that didn’t get prepared, UNDO it, write abort record, and forget
  4. if a local xact was committing (i.e. this is the coord), then send out COMMIT msgs to subords that haven’t ACKed. Similar for aborting.

Upon discovering a failure elsewhere:

    If a coord discovers that a subord is unreachable...
          while waiting for its vote: coord aborts xact as usual
          while waiting for an ACK: coord gives xact to recovery manager
    If a subord discovers that a coord is unreachable...
          if it hasn’t sent a YES vote yet: abort ("unilateral abort")
          if it has sent a YES vote, subord gives xact to recovery manager

 If a recovery mgr receives an inquiry from a subord in prepared state
          if main mem info says xact is committing or aborting, send COMMIT/ABORT
          if main mem info says nothing...?

An Aside: Hierarchical 2PC

If you have a tree-shaped process graph.

     root (which talks to user) is a coordinator
     leaves are subordinates
     interior nodes are both
          after receiving PREPARE, propagate it to children
          vote after children; any NO below causes a NO vote
          after receiving a COMMIT record, force-write log, ACK to parent, and propagate to children. Similar for ABORT.


Presumed Abort
     recall... if main-mem info say nothing, coord says ABORT
     SO... coord can forget a xact immediately after deciding to abort it! (write abort record, THEN forget)
     abort can be async write
     no ACKS required from subords on ABORT
     no need to remember names of subords in abort record, nor write end record after abort
     if coord sees subord has failed, need not pass xact to recovery system; can just ABORT.

Now, look at R/O xacts:
     subords who have only read send READ VOTEs instead of YES VOTEs, release locks, write no log records
     logic is: READ & YES = YES, READ & NO = NO, READ & READ = READ
     if all votes are READ, there’s no second phase
     commit record at coord includes only YES sites
     Tallying up the R/O work:
          nobody writes log records
          nonleaf processes send one message to children
          children send one message (to parent)

Presumed Commit

Idea: let’s invert the logic above, since commit is the fast path:

     require ACK for ABORT, not COMMIT
     subords force abort records, not commit
     no information? Presume commit!


  1. subord prepares to commit
  2. coord crashes
  3. on restart, coord aborts transaction and forgets it
  4. subord asks about transaction, coord says "no info = commit!"
  5. subord commits, everyone else does not

     coord records names of subords on stable storage before allowing them to prepare ("collecting" record)
     then it can tell them about aborts on restart
     everything else analogous (mirror) to Presumed Abort
     Tallying up the R/O work:
          nonleaf writes collecting (forced) and commit (async)
          nonleaf sends one message to all children (PREPARE)
          children send one message (to parent)

Performance analysis in paper:
     PA > 2PC (> = "beats")
     PA > PC for R/O transactions
     for xacts with only one write subord, PC > PA (equal log writes, PA needs an ACK from subord)
     for n-1 writing subords, PC >> PA (equal logging, but PA forces n-1 times when PC does not – commit records of subords.  Also PA send n extra messages)
     choice between PA and PC could be made on a transaction-by transaction basis


Gray, et al on Replication

The Upshot: Deadlock/reconciliation rates grow exponentially with replication factor. This gets even worse with disconnected operation (mobile computers).


  • eager replication (deadlocks) vs. lazy replication (reconciliations)
  • group (update anywhere) vs. master (primary copy)
  • scaleup pitfall: replication looks fine on small demos, dies when you scale up
  • system delusion: lots of inconsistent versions floating around, too hard to reconcile them all

 Important Observations

  • in group mode, every update generates an update at all nodes (i.e. nodes times more work). This generates actions2 more work!
  • Eager Group replication:
    • deadlocks grow like nodes3
    • deadlocks grow like actions5 (where actions = # of actions per transaction)
    • mobile nodes cannot run when disconnected
  • Eager Master replication
    • Reduces deadlocks: like a single-site system with higher TPS
    • Still grow like actions5
  • Lazy Group replication
    • transaction which would wait under eager needs to be reconciled here
      • Assume wait is rare. Deadlock is rare2 (i.e. very unlikely)
      • so reconciliation is much more common than deadlock
    • reconciliations grow like TPS2“ (actions “ nodes)3
  • Lazy Master replication
    • like RPC mechanism of previous paper (reads should send read-lock requests to master)
    • deadlock rate at master grows with (TPS“ Nodes)2“ actions4
    • mobile nodes can’t run while disconnected

 Intuitions for a Solution

  • checkbook example
  • Lotus Notes example: convergence semantics (with no new updates & connectivity, everybody will eventually get the same state). Uses timestamps.
  • lost update problem (more recent account update wins, old one is lost)
    • solution: commutative operations (e.g. increment/decrement)

Two-Tier Replication

  • The world consists of base nodes and mobile nodes
  • mobile nodes contain 2 versions of objects: a (maybe stale) master version, and a tentative version
  • 2 types of xacts:
    • base xacts work only on master data, involve at most 1 mobile node
    • tentative xacts work only on local tentative data. Only involve data mastered at base nodes or the local node (no other mobile nodes)
    • on reconnect:
      • tentative versions are removed
      • tentative xacts are rerun as real xacts
      • before committing the base xacts, an acceptance criterion is used to make sure the results are close enough to the original tentative versions
  • Features:
    • mobile nodes may make tentative updates
    • base xacts execute with single-copy serializability
    • xact is durable when the base xact completes
    • replicas at all sites converge to base system state
    • if all xacts commute, there are no reconciliations
  • Their "solution" to the dangers:
    • use lazy master with timestamps & commutativity to avoid high deadlock rates
    • use 2-tier replication to handle disconnected operation

© 1998, Joseph M. Hellerstein.  Last modified 08/19/98.
Feedback welcomed.