redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

DB Machines, Parallel DBMS & Gamma


The database machine
(taken from Boral/DeWitt '83, "Database Machines: An Idea Whose Time has Passed?")

  • Processor Per Track (PPT): Examples: CASSM, RAP, RARES.  Too expensive.  Wait for bubble memory, charge-couple devices (CCDs)?  Still waiting...
  • PP Head: Examples: DBC, SURE.  Helps with selection queries.  Depends on parallel readout disks (read all heads at once).  This has become increasingly difficult over time ("settle" is increasingly tricky as disks get faster and smaller).
  • Off-the-disk DB Machines: Examples: DIRECT, RAP.2, RDBM, DBMAC, INFOPLEX. Precursors of today's parallel DBs.  A controller CPU, and special-purpose query processing CPUs with shared disks and memory.
    • 1981: Britton-Lee Database Machine
    • "There was 1 6Mhz Z8000 Database processor, up to 4 Z8000 communications processors. Up to 4 "Intellegent" disk controllers (they could optimize their own sector schedule, I think). I think you could put up to 16 .5Gig drives on the thing and probably 4-6 Megs of memory. About a year later we upped the DBP to 10Mhz (yeilding a mighty 1 mips). " -- Mike Ubell

All failed.  Why?

  • these don't help much with sort, join, etc.
  • special-purpose hardware is a losing proposition
    • prohibitively expensive (no economy of scale)
    • slow to evolve

Is it time to revisit all this??

  • IDISK @ Berkeley
  • Think more about this next lecture

Parallel DB 101

Performance metrics:

  • Speedup
  • Scaleup
    • Transaction scaleup: N times as many TPC-C’s for N machines
    • Batch scaleup: N times as big a query for N machines

2 kinds of parallelism

    • pipelined (most are short)
    • partition

3 barriers to linearity:

    • startup overheads
    • interference
    • skew

3 basic architectures

    • shared-memory
    • shared-disk
    • shared-nothing

DeWitt et al., GAMMA DB Machine

  • Gamma: "shared-nothing" multiprocessor system
  • Other shared-nothing systems: Bubba (MCC), Volcano (Colorado), Teradata, Tandem, Informix, IBM Parallel Edition
  • Version 1.0 (1985)
    • token ring connecting 20 VAX 11/750's.
    • Eight processors had a disk each.
    • Issues:
      • Token Ring packet size was 2K, so they used 2K disk blocks. Mistake.
      • Bottleneck at Unibus (1/20 the bandwidth of the network itself, slower than disk!)
      • Network interface also a bottleneck – only buffered 2 incoming packets
      • Only 2M RAM per processor, no virtual memory
  • Version 2.0 (1989)
    • Intel iPSC-2 hypercube. (32 386s, 8M RAM each, 1 333M disk each.)
    • networking provides 8 full duplex reliable channels at a time
    • small messages are datagrams, larger ones form a virtual circuit which goes away at EOT
    • Usual multiprocessor story: tuned for scientific apps
      • OS supports few heavyweight processes
      • Solution: write a new OS! (NOSE) – actually a thread package
      • SCSI controller transferred only 1K blocks
    • ported GAMMA 1.0 – some flaky porting stuff
    • Other complaints
      • want disk-to-memory DMA. As of now, 10% of cycles wasted copying from I/O buffer, and CPU constantly interrupted to do so (13 times per 8K block!)
  • "The Future", circa 1990: CM-5, Intel Touchstone Sigma Machine.
  • "The Future", circa 1995: COW/NOW/SHRIMP with high-speed LAN (Myrinet, ATM, Fast Ethernet, etc.)
  • "The Future", circa 2000???
  • How does this differ from a distributed dbms?
    • no notion of site autonomy
    • centralized schema
    • all queries start at "host"
  • Storage organization: all relations are "horizontally partitioned" across all disk drives in four ways:
    • round robin: default, used for all query outputs
    • hashed: randomize on key attributes
    • range partitioned (specified distribution), stored in range table for relation
    • range partitioned (uniform distribution – sort and then cut into equal pieces.)
    • Within a site, store however you wish.
    • indices created at all sites
    • A better idea: heat (Bubba?)
      • hotter relations get partitioned across more sites
      • why is this better?
  • "primary multiprocessor index" identifies where a tuple resides.

Query Execution

  • An operator tree constructed, each node assigned one or more operator processes at each site.
    • These operator processes never die
  • Hash-based algorithms only for join & grouping.
  • "Left-deep" trees (I call this "right-deep". Think of it as "building-deep")
    • pipelines at most 2 joins deep
    • this was done because of lack of RAM
    • "right-deep" (my "left-deep", or "probing-deep") is probably better
  • Life of a query:
    • parsed and optimized at host (Query Manager)
    • if a single-site query, send to that site for execution
    • otherwise sent to a dispatcher process (load control)
    • dispatcher process gives query to scheduler process
    • scheduler process passes pieces to operator processes at the sites
    • results sent back to scheduler who passes them to Query Manager for display
  • More detail:
    • the example query in the paper
    • split table used to route result tuples to appropriate processors. Three types:
      • Hashed
      • Range-Partitioned.
      • Round-Robin.
    • selection: uses indices, compiled predicates, prefetching.
    • join: basic idea is to use hybrid hash, with one bucket per processor.
      • tuples corresponding to each logical bucket should fit in the aggregate mem of participating processors
      • each logical bucket after the first is split across all disks (may be diskless processors)
    • aggregation: done piecewise. Groups accumulated at individual nodes for final result.
      • Question: can all aggs be done piecewise?
    • updates: as usual, unless it requires moving a tuple (when?)
    • control messages: 3x as many as operators in the plan tree
      • scheduler: Initiate
      • operator: ID of port to talk to
      • operator: Done
      • That’s all the coordination needed!!
  • CC: 2PL with bi-granularity locking (file & page). Centralized deadlock detection.
  • ARIES-based recovery, static assignment of processors to log sites.
  • Failure Management: Chained Declustering
    • nodes belong to relation clusters
    • relations are declustered within a relation cluster
    • backups are declustered "one disk off"
    • tolerates failure of a single disk or processor
    • discussion in paper vs. interleaved declustering
      • these tradeoffs have been beaten to death in RAID world
    • glosses over how indexes are handled in case of failure (another paper)
  • Simplified picture: Gamma gets parallelism by
    • Running multiple small queries in parallel at (hopefully) disjoint sites.
    • Running big queries over multiple sites --- idea here is:
      • Logically partition problem so that each subproblem is independent, and
      • Run the partitions in parallel, one per processor.
        • Examples: hash joins, sorting, ...
  • Performance results:
    • a side-benefit of declustering is that small relations mean fewer & smaller seeks
    • big queries get linear speedup!
      • not perfect, but amazingly close
    • pretty constant scaleup!
    • some other observations
      • hash-join goes faster if already partitioned on join attribute
        • but not much! Redistribution of tuples isn’t too expensive
      • as you add processors, you lose benefits of short-circuited messages, in addition to incurring slight overhead for the additional processes
  • Missing Research Issues (biggies!):
    • query optimization (query rewriting for subqueries, plan optimization)
    • load balancing: inter-query parallelism with intra-query parallelism
    • disk striping, reliability, etc.
    • admin utilities & database design
    • skew handling for non-standard data types

1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.