redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.


Relational Roots

Some Notes on Codd '70

  • The case for physical independence.
  • Notes on strict relational model that might surprise SQL users:
    1. all rows are distinct!
    2. columns identified by position (column names are for semantic info)
    3. "relationships": relations where columns are identified by name
    4. non-algebraic definition of join leads to a "plurality of joins"
  • Simple domains vs. non-simple domains ("complex objects").
  • Normalization: removing non-simple domains.  Normalization theory was a big research area in the 70's.  We will skip it here.  The material in an undergrad textbook is probably all you need to know.
    • OODBMSs use non-simple domains.  Reasons Codd gives to avoid non-simple domains:
      1. hard to represent [this problem popped up in the late 80's as OODB clustering]
      2. hard to translate to other systems [this problem popped up in the 90's as OODB bulk loading]
  • Support for modelling Generations of relations (time travel! -- this popped up in the POSTGRES storage manager, mid '80s.  We'll read about this down the road.)

Relational System Architecture

Databases are BIG pieces of software. Typically very hard to modularize. Lots of system design decisions at the macro and micro scale. We focus mostly on micro decisions in subsequent lectures. Here we focus on macro design.

Disk management choices:

    • file per relation
    • big file in file system
    • raw device

Process Model:

    • process per user
    • server
    • multi-server

Basic modules:

    • parser
    • query rewrite
    • optimizer
    • query executor
    • access methods
    • buffer manager
    • lock manager
    • log/recovery manager

Query Rewriter

    • Flattens views (why?)
    • may change query semantics (constraints, protection, etc.)

Optimizer

    • large space of equivalent relational plans
    • pick one that’s going to be "optimal" (?)
    • produces either an interpretable plan tree, or compiled code

Executor

    • modules to perform relation operations like joins, sorts, aggregations, etc.
    • calls Access Methods for operations on base and temporary relations

Access Methods

    • uniform relational interface (open, get next), a la INGRES AMI, System R's RSS
    • multiple implementations: heap, B-tree, extensible hashing

Buffer Manager

    • Intelligent user-level disk cache
    • must interact with transaction manager & lock manager

Lock Manager

    • must efficiently support lock table
    • System R architecture influential:
      • physical and logical locks treated uniformly
      • multiple granularity of locks
      • set intent locks at high levels
      • we will study this in more detail later (Gray)
    • deadlock handling: detection

Log/Recovery Manager

    • "before/after" log on values
    • checkpoint/restore facility for quick recovery
    • Redo/Undo on restore
    • Support soft crashes off disk, hard crashes off tape.
    • System R’s shadowing is too slow. Use Write-Ahead Logging! (WAL) More on this to come.
    • Hard to get right!

Notes on INGRES

What of this stuff is in today’s systems?

    • query rewrite, especially for views and integrity constraints
    • use of UNIX
    • catalogs as relations
    • idea of extensible access methods
    • thoughts on strange-length transactions
    • deferred updates
    • interpreted query plans (?)
    • simple, clear language is a good thing

Things it got wrong:

    • static trees & hashing
    • interpreted query plans (?)
    • file/relation
    • DECOMP and OVQP (note retrospection comments)
    • process(es)/user
    • locking: column-level (actually table). Deadlock avoidance.
    • logging/recovery
    • argued that clustering wasn’t important

 

Notes on System R

See the System R reunion notes for fun background and gossip.

More influential? A lot of the ideas remain in today’s systems:

    • optimizer remains, largely unchanged
    • RSS/RDS divide remains in many systems
    • SQL, cursors, duplicates, NULLs, etc.
      • the pros and cons of duplicates. Alternatives?
      • pros and cons of NULLs. Alternatives?
      • grouping and aggregation
    • updatable single-table views
    • begin/end xact at user level
    • savepoints and restore
    • catalogs as relations
    • flexible security (GRANT/REVOKE)
    • integrity constraints
    • triggers (!!)
    • clustering
    • compiled queries
    • B-trees
    • Nest-loop & sort-merge join, all joins 2-way
    • dual logs to support log failure

Stuff they got wrong:

    • shadow paging
    • predicate locking
    • SQL language
      • duplicate semantics
      • subqueries vs. joins
      • outer join
    • rejected hashing
 

© 1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.