Motivation & DBMS Architecture Overview
Why databases? Why DB research?
- The technology trend angle: emphasis in CS research has shifted from computation to
information management.
Evidence:
- Hardware: high-performance computer companies on hard times (Thinking Machines, KSR,
Cray, SGI?). The exemplary success story in massive parallelism: Teradata (now sold
by NCR). Been around since the 70's. "Shared-Nothing" (sometimes
called "clusters" or "NOW/COW/WOW" etc.) Successes have been
largely database-centric.
- "Low-end" users: scramble to webspace reflects desire to give/receive info.
Success of these efforts is questionable, and the disorganization will get worse as things
grow.
- "High-end" users: scientists, the biggest users of high-powered computation,
now have data management problems that exceed their appetite for cycles
- Other researchers: architecture, OS, theoreticians, AI are all moving this way.
- PS: you will see all this in the job market!
- The utilitarian angle: "Database: the boring part of accounting"? Not anymore!
Interesting, world-changing apps:
- digital libraries
- digital ``asset mgmt'' -- i.e., multimedia & entertainment
- digital mapping & geo apps
- scientific applications: earth science, DNA, molecular docking, experiment management,
etc.
- decision-support, data analysis & "mining"
- your mom and pop care about this stuff! (as do funding agencies, companies, etc.)
- The intellectual angle:
- Big, beautiful ideas: relational model & languages, concurrency control, query
processing, etc.
- Real, meaty systems work: the serious 24x7, high performance, complex systems
engineering domain
- Room for both kinds of contributions, separately and simultaneously
- plenty of room to take an idea from theory to practice
- lots of useful research left to do
The database-centric view of the CS research universe (take with a large grain of
salt):
- OS & Architecture are ``finished'': Ascendancy of Linux and FreeBSD. If Microsoft
& Intel can mass-market these, it must be easy.
- PL has become arcane
- Theory is ... theoretical.
- AI is that which cannot be done
- etc...
- while you may (should!) disagree with all the above in some respects, it is true that DB
research is notably relevant and fertile these days. Lots of meaty problems remain that
people care about.
An outline of ongoing database research:
- Big: massive datasets
- Tertiary storage: EOSDIS 1 Tb/day, keep it all for 15 years
- Parallelism: data parallelism is natural in a DBMS. How to do DB operations in parallel
and balance load well? WalMart (365 node, 6Tb online, 4billion row table, 200million
updates daily, 4000 queries/day, 1500 users/week, 4 min DS response time w/ avg. 60000
rows out)
- Data Analysis, Data Mining: given huge amounts of data, try to find interesting
information in the data. What is the "killer query"?
- Wide: wide-scale distribution
- World-Wide Web (a bad example)
- Distributed databases for the 00's: autonomous "pay & play" databases
- Complex: complex datatypes and their associated lookups
- complex base types: geographic data, multimedia, scientific data, CAD data
- complex objects
- extensible query processing engines
- indexing new data types
- Old & hetero: the data integration problem
- schema integration: trying to figure out how different schemas fit together. Hard!!!
- DBMS integration: trying to semi-transparently glue different kinds of database systems
together
DBMS History
- late 60's: network (CODASYL) & hierarchical (IMS) DBMS.
- Low-level ``record-at-a-time'' DML, i.e. physical data structures reflected in DML (no
data independence)
- 1970: Codd's paper. The most influential paper in DB research. Set-at-a-time DML. Data
independence. Allows for schema and physical storage structures to change under the
covers''. Truly important theory, led to "paradigm shift" in thinking and in
practice. (Papadimitriou: "as clear a paradigm shift as we can hope to
find in computer science"). Turing award.
- early-to-mid-70's: raging debate between the two camps. "great debate" in 1975
- mid 70's: 2 full-function (sort of) prototypes. Ancestors of essentially all today's
commercial systems
- Ingres: UCB 1974-77
- a ``pickup team'', including Stonebraker & Wong. early and pioneering. begat Ingres
Corp (CA), Sybase, MS SQL Server, Britton-Lee, Wang's PACE.
- System R: IBM San Jose (now Almaden)
- 15 PhDs. begat IBM's SQL/DS & DB2, Oracle, HP's Allbase, Tandem's Non-Stop SQL.
System R arguably got more stuff ``right''
- Both were viable starting points, proved practicality of relational approach. Beautiful
example of theory -> practice!!
- early 80's: commercialization of relational systems
- mid 80's: SQL becomes ``intergalactic standard''.
- DB2 becomes IBM's flagship product.
- IMS ``sunseted''
- today: network & hierarchical essentially dead (though commonly in use!)
- relational is mainstream, not even sexy
- SQL (& perhaps RDBMS) too flawed to last in current form.
- semantically flawed in various ways (Date, 1985).
- anemic
- in an effort to fix it up, standards committees are making a mess
- design by committee leads to kitchen sink
- standards body as designers, rather than codifiers
- leads to wasting time (Sybase) or irrelevance of standard (Informix & IBM shipping
SQL3 before standardized)
- various players in research, industry and both scrambling to standardize the "next
thing"
Modern DBMS taxonomy
Two axes:
- Functionality: RDBMS, OODBMS, ORDBMS.
- RDBMS: query in, data out.
- simple data model: tables with rows and columns, simple data types.
- widely standardized definitions, languages
- clean mathematical foundation
- OODBMS: term is somewhat nebulous. usually, a persistent programming environment
- no queries (or only VERY simple ones).
- data model comes from PL, includes lots of good OO stuff.
- theoretical "foundations" after the fact, very complicated.
- ORDBMS: term is getting better defined as products mature (Informix, IBM)
- an attempt to provide best of both worlds: queries & rich data types.
- query interface.
- Rich data types with lots of OO features, esp. object identity, type-extensibility and
inheritance.
- Basic ``outer'' data type is relation, with extensible data types in the fields.
- relational theory applies to outer operations only
- Implementation:
- Single-Site (i.e. traditional)
- Parallel: lots of tightly-coupled machines solve one query together. A database
supercomputer.
- Distributed: geographically distributed machines, each "hosting" different
data, participate in a more loosely coupled manner
|