redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

Extensibility & Object-Relational Systems

Motivation & Politics

In the early 80’s, it became clear that Relational systems were not robust enough for non-administrative data-intensive applications of the day:

    • CAD/CAM
    • CASE
    • GIS
    • etc.

Two buzz-phrases began to emerge: "Object-Oriented" and "Extensible"

Much vision & politics ensued:

    • Various data models (NF2, ER, Functional, Semantic)
    • Object-Oriented DB System Manifesto (OO-ness)
    • Third-Generation DB System Manifesto (Extensibility)
    • Many query languages proposed

Systems were built, companies started, etc.

Today, the field has settled down into two arenas:

    • Persistent OO PL systems (e.g. EXODUS, ObjectStore, Objectivity, Versant, etc.)
    • Query-based systems with OO features (e.g. Starburst, Postgres, Illustra, Informix & ORacle "Universal Servers", DB/2 UDB)
    • Almost nobody does both well

Few people continue to argue in terms of the paradigms. (Exhaustion?)
 

Systems History

Three influential research systems:

    • Starburst (IBM Almaden)
    • POSTGRES (Berkeley)
    • EXODUS (Wisconsin)
    • Others include O2 (Altair), ORION (MCC), Iris (HP), Genesis (Texas)

We will focus on POSTGRES & EXODUS, discuss Starburst briefly.
 

EXODUS

    • EXODUS was intended to be both a persistent PL system and a query system
      • "Toolkit" extensibility
    • query processing engine never got built, though the EXODUS optimizer architecture was influential (Graefe & DeWitt)
    • Ended up focusing on OODB stuff, so we will discuss it more later.
    • SHORE (follow-on to EXODUS) is delivering on EXODUS promises, but rather late.
      • Persistent C++
      • Query Processing (w/ GIS features): Paradise
      • Extensible Optimizer: Opt++
      • Parallelism
    • We'll see Exodus/SHORE work on pointer swizzling, client-server caching

Object-Relational Systems

Informix's buzzphrase for Illustra and look-alikes (courtesy Roger Sippl).  Didn't patent the name!

Query-based, extensible systems with some OO features like inheritance and OIDs.

SQL3 draft is very much like this.

Stonebraker’s application matrix:
 

Query  RDBMS ORDBMS 
No Query File Sys.  OODBMS 
  Simple Data Complex Data

 

Argues that the upper right is growing, and will engulf upper left and lower right.

 

Hot Applications:

    • Web servers, full-text collections
    • Time-series data
    • "Asset Management"
    • GIS
    • image DB

Players:

    • Informix Universal Server (head of illustra, body of informix). Shipping now.
    • IBM DB2 UDB (head of Starburst, body of DB2). Extensibility features coming along.
    • UniSQL (Won Kim of ORION fame). Went out of business recently.
    • Oracle Universal Server (marketing-ware). Shipping now.
    • NCR (Teradata) just bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students)
    • Other big R vendors are late (Sybase, Tandem, etc.)

    • O vendors still not running queries

Overview: Things needed in an Object-Relational DBMS

(From the "Third-Generation Database System Manifesto")

The relational data model (as implemented!) is "semantically impoverished":

    • fixed set of base types (integer, float, etc)
    • only structuring allowed is normal-form relations
    • only operations are relational algebra, using comparators from base types

Instead, people want a "semantically rich" data model

    • extensible ADTs
    • complex types based on type constructors, and methods for those types
    • inheritance
    • "Object identity"

Zaniolo: GEM

One of the nicest, cleanest language papers, influential in the design of object-relational data models and query languages (starting with Postquel)

  • ref types, nested dots, implicit joins (implication about OIDs)
  • generalization hierarchies (converted to "collection inheritence")
  • set-valued attrs (non-1st-normal-form!)

Stonebraker: Extensible ADTs

Seminal paper.

Idea: you should be able to add new atomic types to the system, along with methods for the types, and new access methods.

Type is defined by:

    • storage size (can be variable)
    • input method
    • output method
    • any other methods a user wishes to provide

Then you could do standard relational processing over those types.

Example: add 2d spatial operators to RDBMS

Engineering issues (some from the paper, some not):

    • Parsing
      • must know about user-defined types and methods (table-driven)
    • Optimization
      • must be able to compute selectivity for user-defined predicates
      • must know about cost of user-defined methods, and consider predicate pullup
      • must know how to match user-defined predicates to fancy new access methods
      • must know whether user-defined join predicates can be evaluated by hash or merge
    • Execution:
      • must have dynamic linking support (24x7 operation)
      • methods called via "function pointers", "functors" or some similar construct
        • can slow down standard processing
      • support for "untrusted" functions (hot topic today)
      • support for "large objects"
      • caching for expensive methods (a la subquery caching)
      • extensible aggregations
        • 3 functions: init, iter, end
    • Access Methods
      • Stonebraker: An access method is a generic object that provides
        • open, get-first, get-next, close, insert, delete
        • if it’s fancy, it takes SARGs and evaluates them quickly
        • needs to provide cost estimates to optimizer
      • Problems
        • integration with CC
          • can be solved by physical logging
          • can open up logging interface for AM-specific logging
        • integration with Buffer Manager
      • In practice, almost nobody used the Postgres Access method extensibility
      • GiST!
    • Buffer Management
      • Large objects require new schemes
    • Transactions
      • integration with access methods

 Note: Extensible ADTs do not fundamentally change a relational system

    • they fit naturally with the relational model
    • as Stonebraker shows, they fit naturally into RDBMS query processing
    • for relational "believers" this is all you should need to solve all problems

POSTGRES

Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES.

 The Postgres Data Model

  1. Co-opt the OO terminology
    • class = relation
    • instance = tuple
    • object-id = tuple-id
    • method = attribute or function of attributes
  2. Support extensible ADTs
    • extensible procedures using C functions
    • binary operators, which interface to extensible AM
  3. Support type constructors
    • trick: use queries
      • columns can be parameterized Postquel functions (returns setof, or tuple)
      • queries can live in fields of a tuple (returns setof or tuple)
      • another exploitation of the view paradigm!
      • these derived objects can optionally be cached (never implemented)
      • nested-dots used to traverse complex object structures
      • leverages EXISTING techniques for relational processing. Cute!
    • added array support directly
  4. added class inheritance (gives method inheritance and collection hierarchies)

Implementation Details

  • originally written in LISP, then ran Lisp2C, resulting in a horrible built-in inheritance mechanism over C
  • only access methods added to Postgres were done "in house"
    • B-tree and R-tree early
    • linear hashing late
    • GiST added in the last couple years
  • "Fast Path" to AMs, as an alternative to "Persistent X"
    • never well-documented or used outside Berkeley
  • ADTs as described above
  • No Overwrite Storage, time travel, etc
  • Research project was "shut down" in 94.  2 Berkeley students did a major cleanup (remove lispisms, remove a number of theses), migrated to SQL, and released Postgres95.  This was picked up by freeware hackers on the net, and now PostgreSQL seems to be the serious freeware db of choice (www.postgresql.org).

More Stuff

  • Postgres Rule System: Active Database support (we will skip this topic this year)
  • Shared-mem parallel version, with new optimization techniques
  • "Inversion" file system
  • PICASSO UI (Rowe & students)
  • Support for tertiary memory
  • Method Indexing
  • Partial Indexes
  • Expensive predicate optimization
  • Commercialized as Illustra
  • Concepts ported into Informix

Editorial Comments

  • Postgres code was a mess. Miraculous that it worked. More miraculous that Illustra salvaged it!
  • Data model was sloppy but clever. Burst some OO bubbles.
    • Roger King: "My cat is object-oriented"
    • Missed an important distinction between Class and Collection.
  • No-overwrite storage was not as bad as you think. Expect the concept to resurface.
  • In many ways, the most successful of the 3 projects
    • used to be written off as goofy research, but Illustra is now ahead of the game
    • ADT extensibility & dynamic linking are very useful – Illustra’s "DataBlades", Oracle's "Data Cartridges"
    • though not all aspects worked well, almost all were novel and influential
    • Many radical ideas in one system!

Starburst

Original goal: build a nice playpen for whatever comes next.

Extensible "in-house". Not by users!

No one survey paper seems to capture the work they did. Best bet: "Starburst Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990

Plumbing:

    • clean internal query representation (QGM). Key to Query Rewrite!
    • non-normalized catalogs for efficiency – normalized view for users
    • uniform record structure across RSS & RDS
    • a single expression evaluator for RSS & RDS
    • WAL instead of shadow pages
    • B+-tree compression
    • Buffer Pool Manager accepts hints from optimizer (a la DBMIN)

Extensibility features:

    • User-defined functions:
      • table expressions: queries or C functions
      • scalar functions
      • no dynamic linking
    • Rule-based query rewrite engine
      • a little rule system with QGM as "working memory"
      • conditions and actions are C functions that check and change QGM
      • some nifty rule control mechanisms (rule classes, rule budgets, multiple conflict res.)
    • Extensible access methods (as in POSTGRES)
    • "Attachments": routines to be automatically called before/after dealing with an access method
      • used by Starburst Rule System to generate transition logs
      • used to implement pre-computed joins (see below)
      • Complex objects implemented in a "wrapper" (SQL-XNF), translated down to Starburst SQL

"Proofs" of Extensibility:

    • added "signature" attachment to automatically tag tuples with some derived values
      • though writing the attachment was hard, hooking it in took only 1 day
    • added Outer Join
      • not so simple – required adding things in QGM, optimizer, and executor
    • IMS attachment
      • pre-computed joins using pointers, which are maintained
      • written by "outsiders" – Mike Carey & Beau Shekita from Wisconsin – in a summer visit to Almaden
    • Grammar-like rule-based query optimizer combined with System R bottom-up construction
    • 2nd Rule System called "Alert", based on infinitely-running queries. Lightweight and pretty effective.
    • Persistent C++ support (SMRC)
    • Recursive query processing!
    • Lots of technology for query rewriting

Many of the Starburst folks took a "vacation" from research the last few years, and merged Starburst technology into DB2 UDB.

Editorial Comments

    • Probably the best code of the bunch
    • The world’s most effective query optimizer?
    • The DB2 experience good for all concerned
    • Attention paid to details, shows up in complex parts of system
    • A long-term advantage? Depends on market, IBM’s development culture...
 

1998, Joseph M. Hellerstein.  Last modified 08/19/98.
Feedback welcomed.