redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

Concurrency Control 4: Tree Locking Protocols

Kung and Robinson said we can do optimistic locking on B-trees, since each lookup touches only h pages, and the tree will have fh pages.

  • Quibble 1: Everybody has already implemented locking, since it's a better general-purpose solution (and was invented first)
  • Quibble 2: The h pages are not chosen at random.

What can happen if you proceed optimistically?

Problem: Design a locking protocol that allows highly concurrent access to a B-tree.

Solution 1: release locks early (non-2PL!)  "Latches".  OK, but when?
Solution 1.1: Extra lock modes + "Safety analysis" (Bayer/Schkolnick)
Solution 1.2: Extra data structure details (Lehman/Yao)
Solution 1.3: Careful ordering of actions + "repositioning" logic (ARIES IM)

B-Link Trees (Lehman/Yao)

A super-high concurrency solution, at the expense of a little extra complexity in the data structure.

  • take a B+-tree (they call it a B*-tree)
  • add "high keys" to each page
  • add right-links to each page (Idea: think of two nodes with a right-link as one big node)
  • ensure that people search top-down, left-to-right
  • ensure that people insert bottom-up
  • Requires NO locking for read (!!)
  • "Lock coupling" for writes is rare (question: why is lock coupling so bad?)


current = root;
A = get(current);
while (current is not a leaf) {
        current = scannode1(v, A);
        A = get(current);
while ((t = scannode(v,A)) == link pointer of A) {
        current = t;
        A = get(current);
if (v is in A)
else return(failure);

Simple! Only trick is to have scannode know about high-keys and right-links.

(Footnote: The scannode(v,A) routine examines memory page A and finds the appropriate pointer for value v.  Note that it may return a right-link pointer instead of a child pointer.)


First, we find a leaf node, and keep a stack of the rightmost node we visited at each level:

initialize stack;
current = root;
A = get(current);
while (current is not a leaf) {
        t = current;
        current = scannode(v,A);
        if (current not link pointer in A)
                push t;
        A = get(current);

When we get to the leaf level, we may need to search right for the appropriate leaf. The move_right procedure scans right across the bottom, with lock coupling (i.e. if you have to move right, first lock right neighbor, then release lock on current).

A = get(current);

Now, assuming the key/ptr pair is not already in the tree, we proceed to insert & possibly split:

if A is safe {
        insert new key/ptr pair on A;
        put(A, current);
else { // gonna have to split
        u = allocate(1 new page for B);
        redistribute A over A and B;
        y = max value on A now;
        make high key of B equal old high key of A;
        make right-link of B equal old right-link of A;
        make high key of A equal y;
        make right-link of A point to B;
        put (B, u);
        put (A, current);
        oldnode = current;
        new key/ptr pair = (y, u); // high key of new page, new page
        current = pop(stack);
        A = get(current);
        move_right(); // at this point we may have 3 locks: oldnode,
                      // and two at the parent level while moving right
        goto Doinsertion;

Note the worst-case multiple locking here. (Improvement later proposed by Sagiv, and by Lanin & Shasha: unlock current after put(A, current)).


Just remove from the leaf. They punt on underflow – just let leaves get empty, never delete them (hence never do deletion from internal nodes.) If you think your tree is too empty, then reorganize it offline.

Why does this all work?

  1. Deadlock-free: there is a total order on tree nodes (bottom-up, left-to-right) that is followed by locking protocol
  2. Correctness of Tree Modifications: By doing put(B, u) before put(A, current) in insert, we make node splitting atomic at each level. This means at any time the tree will appear consistent (though it may have "big" nodes)
  3. Correct Interaction: it is somewhat tricky to show that reader/writer and writer/writer pairs don’t step on each other. The worrisome case is when a reader reads node N, which is subsequently updated by a writer. In this case, the reader will subsequently detect relevant inserts by seeing (possibly lower in the tree) that she needs to keep looking right.

Interesting potential problem: Livelock.

What’s missing from this discussion???

Alternative techniques are used in ARIES: ARIES/KVL & ARIES/IM. Don’t require right-links, add a little more constraint on ordering of operations. On occasion need to "reposition" (i.e. find the appropriate spot on a level to continue.) Papers handle lots more details than Lehman/Yao, including degree-3 consistency, deletion, logging/recovery, savepoints.

Extensions for R-trees & GiSTs

A CS286 class project in ’94, published VLDB ’95 (Kornacker & Banks). Improved for GiST with concurrency, degree-3 consistency, deletion, savepoints in SIGMOD ’97 (Kornacker, Mohan & Hellerstein).

Main differences to focus on:

  • search traverses multiple nodes
  • keys no longer linearly ordered

Raises 2 questions:

  1. how do we detect a node has split?
    • this can’t be done using data values in an R-tree
  2. how do we limit extent to which we move right?
    • no way to know we’ve caught up with inserters

Idea: impose an ordering that has nothing to do with the data.

Each page gets a Node Sequence Number (NSN), like a timestamp.

On page split, the new right sibling gets the original NSN, and the left sibling gets a new NSN, and parent’s NSN is updated on insertion of pointer to new sibling.

Split detection: if child’s NSN is greater than the NSN in the parent entry, child has since split.

Limiting right-traversal: only scan until a lower NSN.

Some extra details:

  • degree-3 consistency especially tricky! Predicate locks attached to nodes.
  • The ARIES/IM ideas don’t work – link trees are basically required because "repositioning" doesn’t work.

1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.