Relational Roots

Some Notes on Codd '70

The case for physical independence.
Notes on strict relational model that might surprise SQL users:

all rows are distinct!
columns identified by position (column names are for semantic info)
"relationships": relations where columns are identified by name
non-algebraic definition of join leads to a "plurality of joins"

Simple domains vs. non-simple domains ("complex objects").
Normalization: removing non-simple domains. Normalization theory was a big research area in the 70's. We will skip it here. The material in an undergrad textbook is probably all you need to know.

OODBMSs use non-simple domains. Reasons Codd gives to avoid non-simple domains:

hard to represent [this problem popped up in the late 80's as OODB clustering]
hard to translate to other systems [this problem popped up in the 90's as OODB bulk loading]

Support for modelling Generations of relations (time travel! -- this popped up in the POSTGRES storage manager, mid '80s. We'll read about this down the road.)

Relational System Architecture

Databases are BIG pieces of software. Typically very hard to modularize. Lots of system design decisions at the macro and micro scale. We focus mostly on micro decisions in subsequent lectures. Here we focus on macro design.

Disk management choices:

file per relation
big file in file system
raw device

Process Model:

process per user
server
multi-server

Basic modules:

parser
query rewrite
optimizer
query executor
access methods
buffer manager
lock manager
log/recovery manager

Query Rewriter

Flattens views (why?)
may change query semantics (constraints, protection, etc.)

Optimizer

large space of equivalent relational plans
pick one that’s going to be "optimal" (?)
produces either an interpretable plan tree, or compiled code

Executor

modules to perform relation operations like joins, sorts, aggregations, etc.
calls Access Methods for operations on base and temporary relations

Access Methods

uniform relational interface (open, get next), a la INGRES AMI, System R's RSS
multiple implementations: heap, B-tree, extensible hashing

Buffer Manager

Intelligent user-level disk cache
must interact with transaction manager & lock manager

Lock Manager

must efficiently support lock table
System R architecture influential:

physical and logical locks treated uniformly
multiple granularity of locks
set intent locks at high levels
we will study this in more detail later (Gray)

deadlock handling: detection

Log/Recovery Manager

"before/after" log on values
checkpoint/restore facility for quick recovery
Redo/Undo on restore
Support soft crashes off disk, hard crashes off tape.
System R’s shadowing is too slow. Use Write-Ahead Logging! (WAL) More on this to come.
Hard to get right!

Notes on INGRES

What of this stuff is in today’s systems?

query rewrite, especially for views and integrity constraints
use of UNIX
catalogs as relations
idea of extensible access methods
thoughts on strange-length transactions
deferred updates
interpreted query plans (?)
simple, clear language is a good thing

Things it got wrong:

static trees & hashing
interpreted query plans (?)
file/relation
DECOMP and OVQP (note retrospection comments)
process(es)/user
locking: column-level (actually table). Deadlock avoidance.
logging/recovery
argued that clustering wasn’t important

Notes on System R

See the System R reunion notes for fun background and gossip.

More influential? A lot of the ideas remain in today’s systems:

optimizer remains, largely unchanged
RSS/RDS divide remains in many systems
SQL, cursors, duplicates, NULLs, etc.

the pros and cons of duplicates. Alternatives?
pros and cons of NULLs. Alternatives?
grouping and aggregation

updatable single-table views
begin/end xact at user level
savepoints and restore
catalogs as relations
flexible security (GRANT/REVOKE)
integrity constraints
triggers (!!)
clustering
compiled queries
B-trees
Nest-loop & sort-merge join, all joins 2-way
dual logs to support log failure

Stuff they got wrong:

shadow paging
predicate locking
SQL language

duplicate semantics
subqueries vs. joins
outer join

rejected hashing