DB Machines, Parallel DBMS & Gamma
The database machine
(taken from Boral/DeWitt '83, "Database Machines: An Idea Whose Time has
- Processor Per Track (PPT): Examples: CASSM, RAP, RARES. Too expensive. Wait
for bubble memory, charge-couple devices (CCDs)? Still waiting...
- PP Head: Examples: DBC, SURE. Helps with selection queries. Depends on
parallel readout disks (read all heads at once). This has become increasingly
difficult over time ("settle" is increasingly tricky as disks get faster and
- Off-the-disk DB Machines: Examples: DIRECT, RAP.2, RDBM, DBMAC, INFOPLEX. Precursors of
today's parallel DBs. A controller CPU, and special-purpose query processing CPUs
with shared disks and memory.
- 1981: Britton-Lee Database Machine
- "There was 1 6Mhz Z8000 Database processor, up to 4 Z8000 communications
processors. Up to 4 "Intellegent" disk controllers (they could optimize their
own sector schedule, I think). I think you could put up to 16 .5Gig drives on the thing
and probably 4-6 Megs of memory. About a year later we upped the DBP to 10Mhz (yeilding a
mighty 1 mips). " -- Mike Ubell
All failed. Why?
- these don't help much with sort, join, etc.
- special-purpose hardware is a losing proposition
- prohibitively expensive (no economy of scale)
- slow to evolve
Is it time to revisit all this??
- IDISK @ Berkeley
- Think more about this next lecture
Parallel DB 101
- Transaction scaleup: N times as many TPC-Cs for N machines
- Batch scaleup: N times as big a query for N machines
2 kinds of parallelism
- pipelined (most are short)
3 barriers to linearity:
- startup overheads
3 basic architectures
DeWitt et al., GAMMA DB Machine
- Gamma: "shared-nothing" multiprocessor system
- Other shared-nothing systems: Bubba (MCC), Volcano (Colorado), Teradata, Tandem,
Informix, IBM Parallel Edition
- Version 1.0 (1985)
- token ring connecting 20 VAX 11/750's.
- Eight processors had a disk each.
- Token Ring packet size was 2K, so they used 2K disk blocks. Mistake.
- Bottleneck at Unibus (1/20 the bandwidth of the network itself, slower than disk!)
- Network interface also a bottleneck only buffered 2 incoming packets
- Only 2M RAM per processor, no virtual memory
- Version 2.0 (1989)
- Intel iPSC-2 hypercube. (32 386s, 8M RAM each, 1 333M disk each.)
- networking provides 8 full duplex reliable channels at a time
- small messages are datagrams, larger ones form a virtual circuit which goes away at EOT
- Usual multiprocessor story: tuned for scientific apps
- OS supports few heavyweight processes
- Solution: write a new OS! (NOSE) actually a thread package
- SCSI controller transferred only 1K blocks
- ported GAMMA 1.0 some flaky porting stuff
- Other complaints
- want disk-to-memory DMA. As of now, 10% of cycles wasted copying from I/O buffer, and
CPU constantly interrupted to do so (13 times per 8K block!)
- "The Future", circa 1990: CM-5, Intel Touchstone Sigma Machine.
- "The Future", circa 1995: COW/NOW/SHRIMP with high-speed LAN (Myrinet, ATM,
Fast Ethernet, etc.)
- "The Future", circa 2000???
- How does this differ from a distributed dbms?
- no notion of site autonomy
- centralized schema
- all queries start at "host"
- Storage organization: all relations are "horizontally partitioned" across all
disk drives in four ways:
- round robin: default, used for all query outputs
- hashed: randomize on key attributes
- range partitioned (specified distribution), stored in range table for relation
- range partitioned (uniform distribution sort and then cut into equal pieces.)
- Within a site, store however you wish.
- indices created at all sites
- A better idea: heat (Bubba?)
- hotter relations get partitioned across more sites
- why is this better?
- "primary multiprocessor index" identifies where a tuple resides.
- An operator tree constructed, each node assigned one or more operator processes at each
- These operator processes never die
- Hash-based algorithms only for join & grouping.
- "Left-deep" trees (I call this "right-deep". Think of it as
- pipelines at most 2 joins deep
- this was done because of lack of RAM
- "right-deep" (my "left-deep", or "probing-deep") is
- Life of a query:
- parsed and optimized at host (Query Manager)
- if a single-site query, send to that site for execution
- otherwise sent to a dispatcher process (load control)
- dispatcher process gives query to scheduler process
- scheduler process passes pieces to operator processes at the sites
- results sent back to scheduler who passes them to Query Manager for display
- More detail:
- the example query in the paper
- split table used to route result tuples to appropriate processors. Three types:
- selection: uses indices, compiled predicates, prefetching.
- join: basic idea is to use hybrid hash, with one bucket per processor.
- tuples corresponding to each logical bucket should fit in the aggregate mem of
- each logical bucket after the first is split across all disks (may be diskless
- aggregation: done piecewise. Groups accumulated at individual nodes for final result.
- Question: can all aggs be done piecewise?
- updates: as usual, unless it requires moving a tuple (when?)
- control messages: 3x as many as operators in the plan tree
- scheduler: Initiate
- operator: ID of port to talk to
- operator: Done
- Thats all the coordination needed!!
- CC: 2PL with bi-granularity locking (file & page). Centralized deadlock detection.
- ARIES-based recovery, static assignment of processors to log sites.
- Failure Management: Chained Declustering
- nodes belong to relation clusters
- relations are declustered within a relation cluster
- backups are declustered "one disk off"
- tolerates failure of a single disk or processor
- discussion in paper vs. interleaved declustering
- these tradeoffs have been beaten to death in RAID world
- glosses over how indexes are handled in case of failure (another paper)
- Simplified picture: Gamma gets parallelism by
- Running multiple small queries in parallel at (hopefully) disjoint sites.
- Running big queries over multiple sites --- idea here is:
- Logically partition problem so that each subproblem is independent, and
- Run the partitions in parallel, one per processor.
- Examples: hash joins, sorting, ...
- Performance results:
- a side-benefit of declustering is that small relations mean fewer & smaller seeks
- big queries get linear speedup!
- not perfect, but amazingly close
- pretty constant scaleup!
- some other observations
- hash-join goes faster if already partitioned on join attribute
- but not much! Redistribution of tuples isnt too expensive
- as you add processors, you lose benefits of short-circuited messages, in addition to
incurring slight overhead for the additional processes
- Missing Research Issues (biggies!):
- query optimization (query rewriting for subqueries, plan optimization)
- load balancing: inter-query parallelism with intra-query parallelism
- disk striping, reliability, etc.
- admin utilities & database design
- skew handling for non-standard data types