redbookcover.gif (13955 bytes) Readings in Database Systems, 3rd Edition

Stonebraker & Hellerstein, eds.


Data Mining

Emerging hot area (see this!), linking DBMS with AI and stats.
Vision: O Great Database!  Tell me what I want to know.
Practice (c. 1998): a small collection of statistical "queries" (with wide open parameters)

  1. Clustering/Segmentation: group together similar items (separating dissimilar items)
  2. Predictive Modeling: predict value of some attributes based on values of others (and previous "training")
    • e.g. classification: assign items to pre-specified categories
  3. Dependency Modeling: model joint probability densities (relationships among columns)
    • another example: Association Rules (e.g. find items that get purchased together)
  4. Data Summarization: find compact descriptions of subsets of data
  5. Change and Deviation Analysis: find significant changes in sequences of data (e.g. stock data).  Sequence-oriented.

Applications of Mining:

  • Marketing
  • Fraud detection
  • Telecomm
  • Science (astronomy, remote sensing, protein sequencing)
  • NBA stats (IBM Advanced Scout)

DBMS spin: a taxonomy of queries:

    • find anomalies/points (traditional queries?)
    • find common stuff (trend analysis/aggregation/OLAP)
    • find pretty common stuff (mining)

    Is there some insight here?

Association Rules

Why look at this example of data mining?

  • It's statistically "lite" (no background required)
  • It has a relational query processing flavor
  • Probably for both of these reasons, there have been a ton of follow-on papers (arguably too many...)

 "Basket data"

Association Rule: X ==> Y (X and Y are disjoint sets of items)

    • confidence c: c% of transactions that contain X also contain Y (rule-specific)
    • support s: s% of all transactions contain both X and Y (relative to all data)

Problem: Efficiently find all rules with support > minsup, confidence > minconf

Anecdotes:

    • diapers & beer
    • Wendy’s & McDonalds (pb & j)

 
One algorithm from the paper: Apriori
 

    L1 = {large 1-itemsets}

    for (k = 2; Lk -1 != ; k++) {

      Ck = apriori-gen(Lk-1); // candidate k-itemsets
      forall transactions t in the database {

        Ct = subset(Ck, t); // Candidates from Ck contained in t
        forall candidates c in Ct

          c.count++;

      }
      Lk = {c in Ck | c.count >= minsup}

    }

    Answer =  kLk

apriori-gen(Lk-1) {
 

    // Intuition: every subset of a large itemset must be large.
    // So combine almost-matching pairs of large (k-1)-itemsets,
    // and prune out those with non-large (k-1)-subsets.

join:

    insert into Ck
    select p.item1, …, p.itemk-1, q.itemk-1
    from Lk-1 p, Lk-1 q
    where p.item1 = q.item1 and … and p.itemk-2 = q.itemk-2 and p.itemk-1 < q.itemk-1;

prune:

    // delete itemsets such that some (k-1)-subset is not in Lk-1
    forall itemsets c in Ck

      forall (k-1)-subsets s of c

        if (s not in Lk-1) {

          delete c from Ck;
          break;

        }

}

Example from paper (minsup = 2)
 

TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5

 
 
 

Itemset Support
{1} 2
{2} 3
{3} 3
{5} 3

 
 
 

Itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}

 
 
 

Itemset Support
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2

 

 

Itemset Support
{2 3 5} 2

 
 
 

Itemset
{2 3 5}

 
 
 
 

Efficiently Implementing subset: The hash tree.

  • Leaves contain a list of itemsets
  • Internal nodes contain a hash table, with each bucket pointing to a child.
  • Root is at depth 1.
  • When inserting into interior node at depth d, we choose child by applying hash function to dth item in itemset.
  • To check all candidate subsets of transaction t:
  • If at leaf, find which itemsets there are in t (i.e. nave check)
  • If we’ve reached an internal node by hashing on item i, hash on each item after i in turn, and continue recursively for each.

 

Variation: AprioriTid generates a new "database" at each step k, with items being <TID, all subsets of size k in that xact>, and scans that rather than the original database next time around. Benefit: database gets fewer rows at each stage. Cost: databases generated may be big.

 

Hybrid: AprioriHybrid: run Apriori for a while, then switch to AprioriTid when the generated DB would fit in mem.

 

Analysis shows that AprioriHybrid is fastest, beating old algorithms and matching Apriori and AprioriTid when either one wins.

 

Later Work:

  • If you have an IsA hierarchy, do association rules on it (you may not be able to say that pampers -> beer, but if you know pampers IsA diaper, and luvs IsA diaper, maybe you’ll find that diapers->beer.)
  • Parallel versions of this stuff.
  • Quantitative association rules: "10% of married people between age 50 and 60 have at least 2 cars."
  • Online association rules: you should be able to change support and confidence threshholds on the fly, and be able to incrementally change answers as new transactions stream in (Christian Hidber, ICSI/UCB).
 

1998, Joseph M. Hellerstein.  Last modified 08/18/98.
Feedback welcomed.