Mariposa: An Agoric Distributed DBMS for WANs

Theme: to scale DDBMS to 1000's of sites, need loose coupling. This can be done with an economic paradigm for resource allocation: query optimization (scheduling) and data placement.

Bad things about old DDBMSs:

Static data allocation
Single administrative structure
Homogeneity of hardware, software, networks

Goals of Mariposa:

Scale to 1000's of sites
Data Mobility
Local autonomy
Easily and autonomously configurable policies

Mechanism:

Economic (Agoric) paradigm (agora = marketplace)
Allows point-to-point decision-making to work in a more global context of resource utilization

removes need for centralized decision-making (e.g. global optimizer)
in essence, implicit aggregate information (i.e. feedback) is passed around the network, resulting in rational behavior

Some important cautionary notes:

Economics is just a useful metaphor for doing decentralized resource allocation.
This decentralized scheme will not be optimal. Goal is to scale at the expense of controlling optimality.
One can spend years arguing about policy, and trying to tune the system by changing policy.
Mariposa implemented mechanisms, but did very little research on policy.
Premature discussions of policy can be VERY distracting -- first put mechanism in place, THEN play!

Architecture and Life of a Query

Data layout: horizontally partitioned table fragments (logical or not), and replicas (of varying freshness)
3-tier architecture: clients, middleware, local site manager. Mariposa requires local sites to run Postgres.
Query is generated by application, with accompanying "bid curve" ($/Delay).
Query planning is multi-step, working (more or less) as follows:

optimizer (middleware) runs Selinger as if it were a local single-site query
fragmenter (middleware) breaks resulting query plan into pieces, arranges pieces in trivially parallelizable "strides" (which are NOT pipelined together, for no apparent reason)
broker (middleware) sends out Requests For Bid (RFBs) to sites that might be interested (more on this later)
bidder (local site) returns a "bid" for a piece of work, consisting of triple (Cost,Delay,Expiration)
coordinator (middleware) accepts bids, constructs a final plan, and informs local sites of their jobs
coordinator and local executors process the plan

Details of Bidding

Budget B(t) (bid curve) a non-increasing function of time
Broker "bids out" single-table subplans or joins of two fragments
Two protocols: the Long Protocol ("expensive bid") and the Short Protocol ("purchase order")

Long protocol is as described above
In short protocol, heuristics determine where to send work orders (e.g. scans at storage sites)

Choosing among bids:

they do this stride-by-stride, but sum of all strides delay must be <= B(D)
note that there's parallelism within a stride, so delay for a stride's bid collection is MAX of delays of the bids in the collection (whereas cost is SUM of costs)
ideally, want to pick the "lowest point below the bid curve" -- i.e. maximize difference = B(D) - C
greedy heuristic used in Mariposa:

start with minimal-delay bid collection for each stride.
for each longer-delay collection, compute cost gradiant: (decrease in cost)/(increase in time) if we switch to this collection
swap in maximum cost-gradiant alternative if difference increases. Recompute cost gradiants.
repeat until difference will not be improved

This is described as a "bottom-up" strategy. An alternative "top-down" strategy bids out large sub-pieces, which can in turn be busted up by the bidders and "subcontracted" into separate pieces

Discussion: backing off from the details, what is the plan space, what is "bidding" for, and what about other approaches?

Metadata

How does the broker find sites to get bids? Yellow Pages and advertisements.
Servers can register services (with prices) at a variety of yellow page servers. These are timestamped so that freshness can be considered in bidding.
Discussion of various costing policy issues: sale prices, coupons, bulk purchase contracts, etc.
Don't think any fancy economics got implemented in Mariposa

Costing

Simple: charge for CPU cycles and I/O bandwidth, translate to a delay via site-specific multiplicative factor
Scale delay or cost by current machine load: gives a simple (though crude) system-wide feedback for load-balancing. An example of economics taking the place of a centralized load leveler
Can set pricing per fragment (why?)
They talk about network bandwidth reservation, though this was never implemented (and is not clearly a win in the network world)

Storage

Sites can buy and sell fragments; access history required to judge value of fragments
To run a subquery, Mariposa requires the site to buy any fragments that aren't resident; this may require evicting (selling) other fragments. Note lack of pipelining causing a problem here.
No discussion here of copies; there was a great deal of thought about how sites could buy copies and contract for updates at varying rates. In turn, queries could reason about the age of copies. Since they built on Postgres, time-travel could allow for consistent (though perhaps dated) query results regardless of the copies chosen.
Fragments could get too big or too small, and splitting/coalescing fragments was seen as an important optimization issue. Either this could be determined by economics, or by a more direct mechanism.

Names & Nameservice

Mariposa used a different naming scheme than R*.
Really no reason to think it's better than R*'s

Status

Mariposa worked well enough for Jeff Sidell to run some TPC-D queries on half a dozen machines. The simple load-balancing pricing strategy easily adapted to changing workloads.
Running queries across the internet resulted in very unpredictable delays. This undercuts the economic model? See Franklin's "Query Scrambling" work tomorrow.
Cohera Corp

Commercializing Mariposa
Sssshh....wait 'til March 29 (www.cohera.com)

Discussion II: Why economics? When to use it, when to use more direct mechanisms? If economics is a metaphor for doing resource allocation, what are other plausible metaphors/mechanisms that would scale?