Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

Extensibility & Object-Relational Systems

Motivation & Politics

In the early 80’s, it became clear that Relational systems were not robust enough for non-administrative data-intensive applications of the day:

CAD/CAM
CASE
GIS
etc.

Two buzz-phrases began to emerge: "Object-Oriented" and "Extensible"

Much vision & politics ensued:

Various data models (NF2, ER, Functional, Semantic)
Object-Oriented DB System Manifesto (OO-ness)
Third-Generation DB System Manifesto (Extensibility)
Many query languages proposed

Systems were built, companies started, etc.

Today, the field has settled down into two arenas:

Persistent OO PL systems (e.g. EXODUS, ObjectStore, Objectivity, Versant, etc.)
Query-based systems with OO features (e.g. Starburst, Postgres, Illustra, Informix & ORacle "Universal Servers", DB/2 UDB)
Almost nobody does both well

Few people continue to argue in terms of the paradigms. (Exhaustion?)

Systems History

Three influential research systems:

Starburst (IBM Almaden)
POSTGRES (Berkeley)
EXODUS (Wisconsin)
Others include O2 (Altair), ORION (MCC), Iris (HP), Genesis (Texas)

We will focus on POSTGRES & EXODUS, discuss Starburst briefly.

EXODUS

EXODUS was intended to be both a persistent PL system and a query system

"Toolkit" extensibility

query processing engine never got built, though the EXODUS optimizer architecture was influential (Graefe & DeWitt)
Ended up focusing on OODB stuff, so we will discuss it more later.
SHORE (follow-on to EXODUS) is delivering on EXODUS promises, but rather late.

Persistent C++
Query Processing (w/ GIS features): Paradise
Extensible Optimizer: Opt++
Parallelism

We'll see Exodus/SHORE work on pointer swizzling, client-server caching

Object-Relational Systems

Informix's buzzphrase for Illustra and look-alikes (courtesy Roger Sippl). Didn't patent the name!

Query-based, extensible systems with some OO features like inheritance and OIDs.

SQL3 draft is very much like this.

Stonebraker’s application matrix:

Query	RDBMS	ORDBMS
No Query	File Sys.	OODBMS
	Simple Data	Complex Data

Argues that the upper right is growing, and will engulf upper left and lower right.

Hot Applications:

Web servers, full-text collections
Time-series data
"Asset Management"
GIS
image DB

Players:

Informix Universal Server (head of illustra, body of informix). Shipping now.
IBM DB2 UDB (head of Starburst, body of DB2). Extensibility features coming along.
UniSQL (Won Kim of ORION fame). Went out of business recently.
Oracle Universal Server (marketing-ware). Shipping now.
NCR (Teradata) just bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students)
Other big R vendors are late (Sybase, Tandem, etc.)

O vendors still not running queries

Overview: Things needed in an Object-Relational DBMS

(From the "Third-Generation Database System Manifesto")

The relational data model (as implemented!) is "semantically impoverished":

fixed set of base types (integer, float, etc)
only structuring allowed is normal-form relations
only operations are relational algebra, using comparators from base types

Instead, people want a "semantically rich" data model

extensible ADTs
complex types based on type constructors, and methods for those types
inheritance
"Object identity"

Zaniolo: GEM

One of the nicest, cleanest language papers, influential in the design of object-relational data models and query languages (starting with Postquel)

ref types, nested dots, implicit joins (implication about OIDs)
generalization hierarchies (converted to "collection inheritence")
set-valued attrs (non-1st-normal-form!)

Stonebraker: Extensible ADTs

Seminal paper.

Idea: you should be able to add new atomic types to the system, along with methods for the types, and new access methods.

Type is defined by:

storage size (can be variable)
input method
output method
any other methods a user wishes to provide

Then you could do standard relational processing over those types.

Example: add 2d spatial operators to RDBMS

Engineering issues (some from the paper, some not):

Parsing

must know about user-defined types and methods (table-driven)

Optimization

must be able to compute selectivity for user-defined predicates
must know about cost of user-defined methods, and consider predicate pullup
must know how to match user-defined predicates to fancy new access methods
must know whether user-defined join predicates can be evaluated by hash or merge

Execution:

must have dynamic linking support (24x7 operation)
methods called via "function pointers", "functors" or some similar construct

can slow down standard processing

support for "untrusted" functions (hot topic today)
support for "large objects"
caching for expensive methods (a la subquery caching)
extensible aggregations

3 functions: init, iter, end

Access Methods

Stonebraker: An access method is a generic object that provides

open, get-first, get-next, close, insert, delete
if it’s fancy, it takes SARGs and evaluates them quickly
needs to provide cost estimates to optimizer

Problems

integration with CC

can be solved by physical logging
can open up logging interface for AM-specific logging

integration with Buffer Manager

In practice, almost nobody used the Postgres Access method extensibility
GiST!

Buffer Management

Large objects require new schemes

Transactions

integration with access methods

Note: Extensible ADTs do not fundamentally change a relational system

they fit naturally with the relational model
as Stonebraker shows, they fit naturally into RDBMS query processing
for relational "believers" this is all you should need to solve all problems

POSTGRES

Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES.

The Postgres Data Model

Co-opt the OO terminology
- class = relation
- instance = tuple
- object-id = tuple-id
- method = attribute or function of attributes
Support extensible ADTs
- extensible procedures using C functions
- binary operators, which interface to extensible AM
Support type constructors
- trick: use queries
  - columns can be parameterized Postquel functions (returns setof, or tuple)
  - queries can live in fields of a tuple (returns setof or tuple)
  - another exploitation of the view paradigm!
  - these derived objects can optionally be cached (never implemented)
  - nested-dots used to traverse complex object structures
  - leverages EXISTING techniques for relational processing. Cute!
- added array support directly
added class inheritance (gives method inheritance and collection hierarchies)

Implementation Details

originally written in LISP, then ran Lisp2C, resulting in a horrible built-in inheritance mechanism over C
only access methods added to Postgres were done "in house"

B-tree and R-tree early
linear hashing late
GiST added in the last couple years

"Fast Path" to AMs, as an alternative to "Persistent X"

never well-documented or used outside Berkeley

ADTs as described above
No Overwrite Storage, time travel, etc
Research project was "shut down" in 94. 2 Berkeley students did a major cleanup (remove lispisms, remove a number of theses), migrated to SQL, and released Postgres95. This was picked up by freeware hackers on the net, and now PostgreSQL seems to be the serious freeware db of choice (www.postgresql.org).

More Stuff

Postgres Rule System: Active Database support (we will skip this topic this year)
Shared-mem parallel version, with new optimization techniques
"Inversion" file system
PICASSO UI (Rowe & students)
Support for tertiary memory
Method Indexing
Partial Indexes
Expensive predicate optimization
Commercialized as Illustra
Concepts ported into Informix

Editorial Comments

Postgres code was a mess. Miraculous that it worked. More miraculous that Illustra salvaged it!
Data model was sloppy but clever. Burst some OO bubbles.

Roger King: "My cat is object-oriented"
Missed an important distinction between Class and Collection.

No-overwrite storage was not as bad as you think. Expect the concept to resurface.
In many ways, the most successful of the 3 projects

used to be written off as goofy research, but Illustra is now ahead of the game
ADT extensibility & dynamic linking are very useful – Illustra’s "DataBlades", Oracle's "Data Cartridges"
though not all aspects worked well, almost all were novel and influential
Many radical ideas in one system!

Starburst

Original goal: build a nice playpen for whatever comes next.

Extensible "in-house". Not by users!

No one survey paper seems to capture the work they did. Best bet: "Starburst Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990

Plumbing:

clean internal query representation (QGM). Key to Query Rewrite!
non-normalized catalogs for efficiency – normalized view for users
uniform record structure across RSS & RDS
a single expression evaluator for RSS & RDS
WAL instead of shadow pages
B+-tree compression
Buffer Pool Manager accepts hints from optimizer (a la DBMIN)

Extensibility features:

User-defined functions:

table expressions: queries or C functions
scalar functions
no dynamic linking

Rule-based query rewrite engine

a little rule system with QGM as "working memory"
conditions and actions are C functions that check and change QGM
some nifty rule control mechanisms (rule classes, rule budgets, multiple conflict res.)

Extensible access methods (as in POSTGRES)
"Attachments": routines to be automatically called before/after dealing with an access method

used by Starburst Rule System to generate transition logs
used to implement pre-computed joins (see below)
Complex objects implemented in a "wrapper" (SQL-XNF), translated down to Starburst SQL

"Proofs" of Extensibility:

added "signature" attachment to automatically tag tuples with some derived values

though writing the attachment was hard, hooking it in took only 1 day

added Outer Join

not so simple – required adding things in QGM, optimizer, and executor

IMS attachment

pre-computed joins using pointers, which are maintained
written by "outsiders" – Mike Carey & Beau Shekita from Wisconsin – in a summer visit to Almaden

Grammar-like rule-based query optimizer combined with System R bottom-up construction
2nd Rule System called "Alert", based on infinitely-running queries. Lightweight and pretty effective.
Persistent C++ support (SMRC)
Recursive query processing!
Lots of technology for query rewriting

Many of the Starburst folks took a "vacation" from research the last few years, and merged Starburst technology into DB2 UDB.

Editorial Comments

Probably the best code of the bunch
The world’s most effective query optimizer?
The DB2 experience good for all concerned
Attention paid to details, shows up in complex parts of system
A long-term advantage? Depends on market, IBM’s development culture...

� 1998, Joseph M. Hellerstein. Last modified 08/19/98.
Feedback welcomed.