Data Mining
Emerging hot area (see
this!), linking DBMS with AI and stats.
Vision: O Great Database! Tell me what I want to know.
Practice (c. 1998): a small collection of statistical "queries" (with wide open
parameters)
 Clustering/Segmentation: group together similar items (separating dissimilar items)
 Predictive Modeling: predict value of some attributes based on values of others (and
previous "training")
 e.g. classification: assign items to prespecified categories
 Dependency Modeling: model joint probability densities (relationships among columns)
 another example: Association Rules (e.g. find items that get purchased together)
 Data Summarization: find compact descriptions of subsets of data
 Change and Deviation Analysis: find significant changes in sequences of data (e.g. stock
data). Sequenceoriented.
Applications of Mining:
 Marketing
 Fraud detection
 Telecomm
 Science (astronomy, remote sensing, protein sequencing)
 NBA stats (IBM Advanced Scout)
DBMS spin: a taxonomy of queries:
Association Rules
Why look at this example of data mining?
 It's statistically "lite" (no background required)
 It has a relational query processing flavor
 Probably for both of these reasons, there have been a ton of followon papers (arguably
too many...)
"Basket data"
Association Rule: X ==> Y (X and Y are disjoint sets of items)
 confidence c: c% of transactions that contain X also contain Y (rulespecific)
 support s: s% of all transactions contain both X and Y (relative to all data)
Problem: Efficiently find all rules with support > minsup, confidence > minconf
Anecdotes:
 diapers & beer
 Wendy’s & McDonalds (pb & j)
One algorithm from the paper: Apriori
L1 = {large 1itemsets}
for (k = 2; Lk 1 != Æ ; k++) {
Ck = apriorigen(Lk1); // candidate kitemsets
forall transactions t in the database {
Ct = subset(Ck, t); // Candidates from Ck contained in t
forall candidates c in Ct
}
Lk = {c in Ck  c.count >= minsup}
}
Answer = ÈkLk
apriorigen(Lk1) {
// Intuition: every subset of a large itemset must be large.
// So combine almostmatching pairs of large (k1)itemsets,
// and prune out those with nonlarge (k1)subsets.
join:
insert into Ck
select p.item1, …, p.itemk1, q.itemk1
from Lk1 p, Lk1 q
where p.item1 = q.item1 and … and p.itemk2 = q.itemk2 and p.itemk1 < q.itemk1;
prune:
// delete itemsets such that some (k1)subset is not in Lk1
forall itemsets c in Ck
}
Example from paper (minsup = 2)
TID 
Items 
100 
1 3 4 
200 
2 3 5 
300 
1 2 3 5 
400 
2 5 
Itemset 
Support 
{1} 
2 
{2} 
3 
{3} 
3 
{5} 
3 
Itemset 
{1 2} 
{1 3} 
{1 5} 
{2 3} 
{2 5} 
{3 5} 
Itemset 
Support 
{1 3} 
2 
{2 3} 
2 
{2 5} 
3 
{3 5} 
2 
Itemset 
Support 
{2 3 5} 
2 
Efficiently Implementing subset: The hash tree.
 Leaves contain a list of itemsets
 Internal nodes contain a hash table, with each bucket pointing to a child.
 Root is at depth 1.
 When inserting into interior node at depth d, we choose child by applying hash function
to dth item in itemset.
 To check all candidate subsets of transaction t:
 If at leaf, find which itemsets there are in t (i.e. naïve check)
 If we’ve reached an internal node by hashing on item i, hash on each item after i
in turn, and continue recursively for each.
Variation: AprioriTid generates a new "database" at each step k, with items
being <TID, all subsets of size k in that xact>, and scans that rather than the
original database next time around. Benefit: database gets fewer rows at each stage. Cost:
databases generated may be big.
Hybrid: AprioriHybrid: run Apriori for a while, then switch to AprioriTid when the
generated DB would fit in mem.
Analysis shows that AprioriHybrid is fastest, beating old algorithms and matching
Apriori and AprioriTid when either one wins.
Later Work:
 If you have an IsA hierarchy, do association rules on it (you may not be able to say
that pampers > beer, but if you know pampers IsA diaper, and luvs IsA diaper, maybe
you’ll find that diapers>beer.)
 Parallel versions of this stuff.
 Quantitative association rules: "10% of married people between age 50 and 60 have
at least 2 cars."
 Online association rules: you should be able to change support and confidence
threshholds on the fly, and be able to incrementally change answers as new transactions
stream in (Christian Hidber, ICSI/UCB).
