redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.


Motivation & DBMS Architecture Overview

Why databases? Why DB research?

  • The technology trend angle: emphasis in CS research has shifted from computation to information management.
  • Evidence:

    • Hardware: high-performance computer companies on hard times (Thinking Machines, KSR, Cray, SGI?).  The exemplary success story in massive parallelism: Teradata (now sold by NCR).  Been around since the 70's.  "Shared-Nothing" (sometimes called "clusters" or "NOW/COW/WOW" etc.)  Successes have been largely database-centric.
    • "Low-end" users: scramble to webspace reflects desire to give/receive info. Success of these efforts is questionable, and the disorganization will get worse as things grow.
    • "High-end" users: scientists, the biggest users of high-powered computation, now have data management problems that exceed their appetite for cycles
    • Other researchers: architecture, OS, theoreticians, AI are all moving this way.
    • PS: you will see all this in the job market!
  • The utilitarian angle: "Database: the boring part of accounting"? Not anymore! Interesting, world-changing apps:
    • digital libraries
    • digital ``asset mgmt'' -- i.e., multimedia & entertainment
    • digital mapping & geo apps
    • scientific applications: earth science, DNA, molecular docking, experiment management, etc.
    • decision-support, data analysis & "mining"
    • your mom and pop care about this stuff!  (as do funding agencies, companies, etc.)
  • The intellectual angle:
    • Big, beautiful ideas: relational model & languages, concurrency control, query processing, etc.
    • Real, meaty systems work: the serious 24x7, high performance, complex systems engineering domain
    • Room for both kinds of contributions, separately and simultaneously
      • plenty of room to take an idea from theory to practice
    • lots of useful research left to do

The database-centric view of the CS research universe (take with a large grain of salt):

  • OS & Architecture are ``finished'': Ascendancy of Linux and FreeBSD. If Microsoft & Intel can mass-market these, it must be easy.
  • PL has become arcane
  • Theory is ... theoretical.
  • AI is that which cannot be done
  • etc...
  • while you may (should!) disagree with all the above in some respects, it is true that DB research is notably relevant and fertile these days. Lots of meaty problems remain that people care about.

An outline of ongoing database research:

  • Big: massive datasets
    • Tertiary storage: EOSDIS 1 Tb/day, keep it all for 15 years
    • Parallelism: data parallelism is natural in a DBMS. How to do DB operations in parallel and balance load well? WalMart (365 node, 6Tb online, 4billion row table, 200million updates daily, 4000 queries/day, 1500 users/week, 4 min DS response time w/ avg. 60000 rows out)
    • Data Analysis, Data Mining: given huge amounts of data, try to find interesting information in the data.  What is the "killer query"?
  • Wide: wide-scale distribution
    • World-Wide Web (a bad example)
    • Distributed databases for the 00's: autonomous "pay & play" databases
  • Complex: complex datatypes and their associated lookups
    • complex base types: geographic data, multimedia, scientific data, CAD data
    • complex objects
    • extensible query processing engines
    • indexing new data types
  • Old & hetero: the data integration problem
    • schema integration: trying to figure out how different schemas fit together. Hard!!!
    • DBMS integration: trying to semi-transparently glue different kinds of database systems together

DBMS History

  • late 60's: network (CODASYL) & hierarchical (IMS) DBMS.
    • Low-level ``record-at-a-time'' DML, i.e. physical data structures reflected in DML (no data independence)
  • 1970: Codd's paper. The most influential paper in DB research. Set-at-a-time DML. Data independence. Allows for schema and physical storage structures to change under the covers''. Truly important theory, led to "paradigm shift" in thinking and in practice.   (Papadimitriou: "as clear a paradigm shift as we can hope to find in computer science").  Turing award.
  • early-to-mid-70's: raging debate between the two camps. "great debate" in 1975
  • mid 70's: 2 full-function (sort of) prototypes. Ancestors of essentially all today's commercial systems
  • Ingres: UCB 1974-77
    • a ``pickup team'', including Stonebraker & Wong. early and pioneering. begat Ingres Corp (CA), Sybase, MS SQL Server, Britton-Lee, Wang's PACE.
  • System R: IBM San Jose (now Almaden)
    • 15 PhDs. begat IBM's SQL/DS & DB2, Oracle, HP's Allbase, Tandem's Non-Stop SQL. System R arguably got more stuff ``right''
  • Both were viable starting points, proved practicality of relational approach. Beautiful example of theory  -> practice!!
  • early 80's: commercialization of relational systems
  • mid 80's: SQL becomes ``intergalactic standard''.
    • DB2 becomes IBM's flagship product.
    • IMS ``sunseted''
  • today: network & hierarchical essentially dead (though commonly in use!)
    • relational is mainstream, not even sexy
    • SQL (& perhaps RDBMS) too flawed to last in current form.
      • semantically flawed in various ways (Date, 1985).
      • anemic
      • in an effort to fix it up, standards committees are making a mess
        • design by committee leads to kitchen sink
        • standards body as designers, rather than codifiers
        • leads to wasting time (Sybase) or irrelevance of standard (Informix & IBM shipping SQL3 before standardized)
    • various players in research, industry and both scrambling to standardize the "next thing"

Modern DBMS taxonomy

Two axes:

  • Functionality: RDBMS, OODBMS, ORDBMS.
    • RDBMS: query in, data out.
      • simple data model: tables with rows and columns, simple data types.
      • widely standardized definitions, languages
      • clean mathematical foundation
    • OODBMS: term is somewhat nebulous. usually, a persistent programming environment
      • no queries (or only VERY simple ones).
      • data model comes from PL, includes lots of good OO stuff.
      • theoretical "foundations" after the fact, very complicated.
    • ORDBMS: term is getting better defined as products mature (Informix, IBM)
      • an attempt to provide best of both worlds: queries & rich data types.
      • query interface.
      • Rich data types with lots of OO features, esp. object identity, type-extensibility and inheritance.
      • Basic ``outer'' data type is relation, with extensible data types in the fields.
      • relational theory applies to outer operations only
  • Implementation:
    • Single-Site (i.e. traditional)
    • Parallel: lots of tightly-coupled machines solve one query together. A database supercomputer.
    • Distributed: geographically distributed machines, each "hosting" different data, participate in a more loosely coupled manner
 

© 1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.