Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.

DB Machines, Parallel DBMS & Gamma

History

The database machine
(taken from Boral/DeWitt '83, "Database Machines: An Idea Whose Time has Passed?")

Processor Per Track (PPT): Examples: CASSM, RAP, RARES. Too expensive. Wait for bubble memory, charge-couple devices (CCDs)? Still waiting...
PP Head: Examples: DBC, SURE. Helps with selection queries. Depends on parallel readout disks (read all heads at once). This has become increasingly difficult over time ("settle" is increasingly tricky as disks get faster and smaller).
Off-the-disk DB Machines: Examples: DIRECT, RAP.2, RDBM, DBMAC, INFOPLEX. Precursors of today's parallel DBs. A controller CPU, and special-purpose query processing CPUs with shared disks and memory.

1981: Britton-Lee Database Machine
"There was 1 6Mhz Z8000 Database processor, up to 4 Z8000 communications processors. Up to 4 "Intellegent" disk controllers (they could optimize their own sector schedule, I think). I think you could put up to 16 .5Gig drives on the thing and probably 4-6 Megs of memory. About a year later we upped the DBP to 10Mhz (yeilding a mighty 1 mips). " -- Mike Ubell

All failed. Why?

these don't help much with sort, join, etc.
special-purpose hardware is a losing proposition

prohibitively expensive (no economy of scale)
slow to evolve

Is it time to revisit all this??

IDISK @ Berkeley
Think more about this next lecture

Parallel DB 101

Performance metrics:

Speedup
Scaleup

Transaction scaleup: N times as many TPC-C’s for N machines
Batch scaleup: N times as big a query for N machines

2 kinds of parallelism

pipelined (most are short)
partition

3 barriers to linearity:

startup overheads
interference
skew

3 basic architectures

shared-memory
shared-disk
shared-nothing

DeWitt et al., GAMMA DB Machine

Gamma: "shared-nothing" multiprocessor system
Other shared-nothing systems: Bubba (MCC), Volcano (Colorado), Teradata, Tandem, Informix, IBM Parallel Edition
Version 1.0 (1985)

token ring connecting 20 VAX 11/750's.
Eight processors had a disk each.
Issues:

Token Ring packet size was 2K, so they used 2K disk blocks. Mistake.
Bottleneck at Unibus (1/20 the bandwidth of the network itself, slower than disk!)
Network interface also a bottleneck – only buffered 2 incoming packets
Only 2M RAM per processor, no virtual memory

Version 2.0 (1989)

Intel iPSC-2 hypercube. (32 386s, 8M RAM each, 1 333M disk each.)
networking provides 8 full duplex reliable channels at a time
small messages are datagrams, larger ones form a virtual circuit which goes away at EOT
Usual multiprocessor story: tuned for scientific apps

OS supports few heavyweight processes
Solution: write a new OS! (NOSE) – actually a thread package
SCSI controller transferred only 1K blocks

ported GAMMA 1.0 – some flaky porting stuff
Other complaints

want disk-to-memory DMA. As of now, 10% of cycles wasted copying from I/O buffer, and CPU constantly interrupted to do so (13 times per 8K block!)

"The Future", circa 1990: CM-5, Intel Touchstone Sigma Machine.
"The Future", circa 1995: COW/NOW/SHRIMP with high-speed LAN (Myrinet, ATM, Fast Ethernet, etc.)
"The Future", circa 2000???
How does this differ from a distributed dbms?

no notion of site autonomy
centralized schema
all queries start at "host"

Storage organization: all relations are "horizontally partitioned" across all disk drives in four ways:

round robin: default, used for all query outputs
hashed: randomize on key attributes
range partitioned (specified distribution), stored in range table for relation
range partitioned (uniform distribution – sort and then cut into equal pieces.)
Within a site, store however you wish.
indices created at all sites
A better idea: heat (Bubba?)

hotter relations get partitioned across more sites
why is this better?

"primary multiprocessor index" identifies where a tuple resides.

Query Execution

An operator tree constructed, each node assigned one or more operator processes at each site.

These operator processes never die

Hash-based algorithms only for join & grouping.
"Left-deep" trees (I call this "right-deep". Think of it as "building-deep")

pipelines at most 2 joins deep
this was done because of lack of RAM
"right-deep" (my "left-deep", or "probing-deep") is probably better

Life of a query:

parsed and optimized at host (Query Manager)
if a single-site query, send to that site for execution
otherwise sent to a dispatcher process (load control)
dispatcher process gives query to scheduler process
scheduler process passes pieces to operator processes at the sites
results sent back to scheduler who passes them to Query Manager for display

More detail:

the example query in the paper
split table used to route result tuples to appropriate processors. Three types:

Hashed
Range-Partitioned.
Round-Robin.

selection: uses indices, compiled predicates, prefetching.
join: basic idea is to use hybrid hash, with one bucket per processor.

tuples corresponding to each logical bucket should fit in the aggregate mem of participating processors
each logical bucket after the first is split across all disks (may be diskless processors)

aggregation: done piecewise. Groups accumulated at individual nodes for final result.

Question: can all aggs be done piecewise?

updates: as usual, unless it requires moving a tuple (when?)
control messages: 3x as many as operators in the plan tree

scheduler: Initiate
operator: ID of port to talk to
operator: Done
That’s all the coordination needed!!

CC: 2PL with bi-granularity locking (file & page). Centralized deadlock detection.
ARIES-based recovery, static assignment of processors to log sites.
Failure Management: Chained Declustering

nodes belong to relation clusters
relations are declustered within a relation cluster
backups are declustered "one disk off"
tolerates failure of a single disk or processor
discussion in paper vs. interleaved declustering

these tradeoffs have been beaten to death in RAID world

glosses over how indexes are handled in case of failure (another paper)

Simplified picture: Gamma gets parallelism by

Running multiple small queries in parallel at (hopefully) disjoint sites.
Running big queries over multiple sites --- idea here is:

Logically partition problem so that each subproblem is independent, and
Run the partitions in parallel, one per processor.

Examples: hash joins, sorting, ...

Performance results:

a side-benefit of declustering is that small relations mean fewer & smaller seeks
big queries get linear speedup!

not perfect, but amazingly close

pretty constant scaleup!
some other observations

hash-join goes faster if already partitioned on join attribute

but not much! Redistribution of tuples isn’t too expensive

as you add processors, you lose benefits of short-circuited messages, in addition to incurring slight overhead for the additional processes

Missing Research Issues (biggies!):

query optimization (query rewriting for subqueries, plan optimization)
load balancing: inter-query parallelism with intra-query parallelism
disk striping, reliability, etc.
admin utilities & database design
skew handling for non-standard data types