Relational Roots
Some Notes on Codd '70
- The case for physical independence.
- Notes on strict relational model that might surprise SQL users:
- all rows are distinct!
- columns identified by position (column names are for semantic info)
- "relationships": relations where columns are identified by name
- non-algebraic definition of join leads to a "plurality of joins"
- Simple domains vs. non-simple domains ("complex objects").
- Normalization: removing non-simple domains. Normalization theory was a big
research area in the 70's. We will skip it here. The material in an undergrad
textbook is probably all you need to know.
- OODBMSs use non-simple domains. Reasons Codd gives to avoid non-simple domains:
- hard to represent [this problem popped up in the late 80's as OODB clustering]
- hard to translate to other systems [this problem popped up in the 90's as OODB bulk
loading]
- Support for modelling Generations of relations (time travel! -- this popped up in the
POSTGRES storage manager, mid '80s. We'll read about this down the road.)
Relational System Architecture
Databases are BIG pieces of software. Typically very hard to modularize. Lots of system
design decisions at the macro and micro scale. We focus mostly on micro decisions in
subsequent lectures. Here we focus on macro design.
Disk management choices:
- file per relation
- big file in file system
- raw device
Process Model:
- process per user
- server
- multi-server
Basic modules:
- parser
- query rewrite
- optimizer
- query executor
- access methods
- buffer manager
- lock manager
- log/recovery manager
Query Rewriter
- Flattens views (why?)
- may change query semantics (constraints, protection, etc.)
Optimizer
- large space of equivalent relational plans
- pick one thats going to be "optimal" (?)
- produces either an interpretable plan tree, or compiled code
Executor
- modules to perform relation operations like joins, sorts, aggregations, etc.
- calls Access Methods for operations on base and temporary relations
Access Methods
- uniform relational interface (open, get next), a la INGRES AMI, System R's RSS
- multiple implementations: heap, B-tree, extensible hashing
Buffer Manager
- Intelligent user-level disk cache
- must interact with transaction manager & lock manager
Lock Manager
- must efficiently support lock table
- System R architecture influential:
- physical and logical locks treated uniformly
- multiple granularity of locks
- set intent locks at high levels
- we will study this in more detail later (Gray)
- deadlock handling: detection
Log/Recovery Manager
- "before/after" log on values
- checkpoint/restore facility for quick recovery
- Redo/Undo on restore
- Support soft crashes off disk, hard crashes off tape.
- System Rs shadowing is too slow. Use Write-Ahead Logging! (WAL) More on this to
come.
- Hard to get right!
Notes on INGRES
What of this stuff is in todays systems?
- query rewrite, especially for views and integrity constraints
- use of UNIX
- catalogs as relations
- idea of extensible access methods
- thoughts on strange-length transactions
- deferred updates
- interpreted query plans (?)
- simple, clear language is a good thing
Things it got wrong:
- static trees & hashing
- interpreted query plans (?)
- file/relation
- DECOMP and OVQP (note retrospection comments)
- process(es)/user
- locking: column-level (actually table). Deadlock avoidance.
- logging/recovery
- argued that clustering wasnt important
Notes on System R
See the System
R reunion notes for fun background and gossip.
More influential? A lot of the ideas remain in todays systems:
- optimizer remains, largely unchanged
- RSS/RDS divide remains in many systems
- SQL, cursors, duplicates, NULLs, etc.
- the pros and cons of duplicates. Alternatives?
- pros and cons of NULLs. Alternatives?
- grouping and aggregation
- updatable single-table views
- begin/end xact at user level
- savepoints and restore
- catalogs as relations
- flexible security (GRANT/REVOKE)
- integrity constraints
- triggers (!!)
- clustering
- compiled queries
- B-trees
- Nest-loop & sort-merge join, all joins 2-way
- dual logs to support log failure
Stuff they got wrong:
- shadow paging
- predicate locking
- SQL language
- duplicate semantics
- subqueries vs. joins
- outer join
- rejected hashing
|