Data Mining
Emerging hot area (see
this!), linking DBMS with AI and stats.
Vision: O Great Database! Tell me what I want to know.
Practice (c. 1998): a small collection of statistical "queries" (with wide open
parameters)
- Clustering/Segmentation: group together similar items (separating dissimilar items)
- Predictive Modeling: predict value of some attributes based on values of others (and
previous "training")
- e.g. classification: assign items to pre-specified categories
- Dependency Modeling: model joint probability densities (relationships among columns)
- another example: Association Rules (e.g. find items that get purchased together)
- Data Summarization: find compact descriptions of subsets of data
- Change and Deviation Analysis: find significant changes in sequences of data (e.g. stock
data). Sequence-oriented.
Applications of Mining:
- Marketing
- Fraud detection
- Telecomm
- Science (astronomy, remote sensing, protein sequencing)
- NBA stats (IBM Advanced Scout)
DBMS spin: a taxonomy of queries:
Association Rules
Why look at this example of data mining?
- It's statistically "lite" (no background required)
- It has a relational query processing flavor
- Probably for both of these reasons, there have been a ton of follow-on papers (arguably
too many...)
"Basket data"
Association Rule: X ==> Y (X and Y are disjoint sets of items)
- confidence c: c% of transactions that contain X also contain Y (rule-specific)
- support s: s% of all transactions contain both X and Y (relative to all data)
Problem: Efficiently find all rules with support > minsup, confidence > minconf
Anecdotes:
- diapers & beer
- Wendys & McDonalds (pb & j)
One algorithm from the paper: Apriori
L1 = {large 1-itemsets}
for (k = 2; Lk -1 != Æ ; k++) {
Ck = apriori-gen(Lk-1); // candidate k-itemsets
forall transactions t in the database {
Ct = subset(Ck, t); // Candidates from Ck contained in t
forall candidates c in Ct
}
Lk = {c in Ck | c.count >= minsup}
}
Answer = ÈkLk
apriori-gen(Lk-1) {
// Intuition: every subset of a large itemset must be large.
// So combine almost-matching pairs of large (k-1)-itemsets,
// and prune out those with non-large (k-1)-subsets.
join:
insert into Ck
select p.item1,
, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1 = q.item1 and
and p.itemk-2 = q.itemk-2 and p.itemk-1 < q.itemk-1;
prune:
// delete itemsets such that some (k-1)-subset is not in Lk-1
forall itemsets c in Ck
}
Example from paper (minsup = 2)
TID |
Items |
100 |
1 3 4 |
200 |
2 3 5 |
300 |
1 2 3 5 |
400 |
2 5 |
Itemset |
Support |
{1} |
2 |
{2} |
3 |
{3} |
3 |
{5} |
3 |
Itemset |
{1 2} |
{1 3} |
{1 5} |
{2 3} |
{2 5} |
{3 5} |
Itemset |
Support |
{1 3} |
2 |
{2 3} |
2 |
{2 5} |
3 |
{3 5} |
2 |
Itemset |
Support |
{2 3 5} |
2 |
Efficiently Implementing subset: The hash tree.
- Leaves contain a list of itemsets
- Internal nodes contain a hash table, with each bucket pointing to a child.
- Root is at depth 1.
- When inserting into interior node at depth d, we choose child by applying hash function
to dth item in itemset.
- To check all candidate subsets of transaction t:
- If at leaf, find which itemsets there are in t (i.e. naïve check)
- If weve reached an internal node by hashing on item i, hash on each item after i
in turn, and continue recursively for each.
Variation: AprioriTid generates a new "database" at each step k, with items
being <TID, all subsets of size k in that xact>, and scans that rather than the
original database next time around. Benefit: database gets fewer rows at each stage. Cost:
databases generated may be big.
Hybrid: AprioriHybrid: run Apriori for a while, then switch to AprioriTid when the
generated DB would fit in mem.
Analysis shows that AprioriHybrid is fastest, beating old algorithms and matching
Apriori and AprioriTid when either one wins.
Later Work:
- If you have an IsA hierarchy, do association rules on it (you may not be able to say
that pampers -> beer, but if you know pampers IsA diaper, and luvs IsA diaper, maybe
youll find that diapers->beer.)
- Parallel versions of this stuff.
- Quantitative association rules: "10% of married people between age 50 and 60 have
at least 2 cars."
- Online association rules: you should be able to change support and confidence
threshholds on the fly, and be able to incrementally change answers as new transactions
stream in (Christian Hidber, ICSI/UCB).
|