Benchmarking
Goals
-
What's a benchmark for?
-
What do you want to learn?
-
How much are you prepared to invest?
-
programmer time
-
machine resources
-
management time
-
lost opportunity
-
What are you prepared to give up?
-
Side-benefits to the community...
Types of Benchmarks
-
Generic benchmarks try to give a broad idea of price/performance
across many applications.
-
Application-Specific benchmarks try to focus in on a restricted
class of workloads. Some of the main consortia defining these:
-
SPEC (System Performance Evaluation Cooperative): scientific workloads,
designed to measure workstation performance
-
The Perfect Club: scientific workloads on exotic parallel architectures
-
TPC (Transaction Processing Council): transaction processing and decision
support workloads, architecture-independent
Gray's criteria for a good app-specific benchmark:
-
Relevant: It must measure the peak performance and price/performance
of systems when performing typical operations within that problem domain.
-
Portable: It should be easy to implement the benchmark on many different
systems and architectures.
-
Scaleable: The benchmark should apply to small and large computer
systems. It should be possible to scale the benchmark up to larger systems,
and to parallel computer systems as computer performance and architecture
evolve.
-
Simple: The benchmark must be understandable, otherwise it will
lack credibility.
A counter-example: MIPS (Millions of Instructions Per Second)
-
irrelevant: doesn't translate directly to useful work
-
not portable: IntelMIPS != SunMIPS
-
not scalable: how does it apply to multiprocessors?
The importance of benchmark acceptance
-
Benchmarketing
-
Wisconsin benchmark history: DeWitt vs. Ellison. The DeWitt clause.
-
Anon, et al.
-
Benchmark wars: escalating numbers lead to escalating tricks, wasted time
("just let us tune it as much as Vendor X did")
-
if you're doing your own, avoid this with "no reruns, no excuses"
Transaction Processing Council
TPC arose in 1988 (headed by consultant Omri Serlin), consortium of 35
hardware/software companies
-
define benchmarks for TP and DSS, define cost/performance metrics, provide
official audits
-
Today: TPC-C is the TP benchmark (succeeding A and B), TPC-D is the decision-support
benchmark
-
Focus on as real-world a scenario as possible:
-
performance from terminals to server and back, as opposed (say) to SPEC.
all aspects of the system.
-
note the overhead of doing this!
-
hard to set up
-
requires auditor to report results
-
functionality as well as performance
Writing a benchmark
-
try to measure core speed and potential bottlenecks
-
"micro-benchmarks", e.g. Sort
-
try to expose functionalities or lack thereof
-
e.g. Wisconsin, TPC-D
-
can backfire for adaptive parts of a system. Optimizers are notoroiusly
hard to "benchmark"
-
try to measure end-to-end performance on a realish workload
-
a statistical note: don't use average performance! consider variance.
if you must give one number, give 90th percentile performance.
DB Benchmarks to be aware of:
-
Wisconsin (mostly for history)
-
TPC-C
-
TPC-D
-
Set Query: Pat O'Neil's complex query benchmark (shows off Model 204)
-
007: OODBMS benchmark from Wisconsin and some of the O-vendors
-
Sequoia: app-specific benchmark for Earth Science from Sequoia project
at Berkeley (Stonebraker) and UCSB (earth scientists). Used as an
early Object-Relational and GIS benchmark (mostly R-trees and user-defined
functions, but also one transitive closure query (!))
-
Bucky: an Object-Relational benchmark from Wisconsin focusing on structured
types (refs and nested sets)
-
OR-1: a yet-to-be-completed Object-Relational benchmark from Wisconsin,
Informix (Stonebraker) and IBM (Carey). Was supposed to fix Bucky
with more focus on "what matters to customers".
-
Gray's Benchmark
Handbook now lives on the web
|