Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 3 —
1
Chapter 3: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
2
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Beer and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?
◼ Freq. pattern: An intrinsic and important property of
datasets
◼ Foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis
◼ Sequential, structural (e.g., sub-graph) patterns
◼ Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
◼ Classification: discriminative, frequent pattern analysis
◼ Cluster analysis: frequent pattern-based clustering
◼ Data warehousing: iceberg cube and cube-gradient
◼ Semantic data compression: fascicles
◼ Broad applications
4
Basic Concepts: Frequent Patterns
Tid Items bought ◼ itemset: A set of one or more
10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper ◼ k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
◼ (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
◼ (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
◼ An itemset X is frequent if X’s
support is no less than a minsup
Customer
buys beer
threshold
5
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer ◼ Association rules: (many more!)
◼ Beer → Diaper (60%, 100%)
◼ Diaper → Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
◼ A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
◼ Solution: Mine closed patterns and max-patterns instead
◼ An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
◼ An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
◼ Closed pattern is a lossless compression of freq. patterns
◼ Reducing the # of patterns and rules
7
Closed Patterns and Max-Patterns
◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼ Min_sup = 1.
◼ What is the set of closed itemset?
◼ <a1, …, a100>: 1
◼ < a1, …, a50>: 2
◼ What is the set of max-pattern?
◼ <a1, …, a100>: 1
◼ What is the set of all patterns?
◼ !!
8
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
9
Scalable Frequent Itemset Mining Methods
◼ Apriori: A Candidate Generation-and-Test
Approach
◼ Improving the Efficiency of Apriori
◼ FPGrowth: A Frequent Pattern-Growth Approach
◼ ECLAT: Frequent Pattern Mining with Vertical
Data Format
10
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent
◼ If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
◼ Scalable mining methods: Three major approaches
◼ Apriori (Agrawal & Srikant@VLDB’94)
◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
11
Apriori: A Candidate Generation & Test Approach
◼ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
◼ Method:
◼ Initially, scan DB once to get frequent 1-itemset
◼ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◼ Test the candidates against DB
◼ Terminate when no frequent or candidate set can be
generated
12
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
13
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
14
Exercise: Transaction Data
Min support count: 2
14
15
16
Cont…
◼ Let S = {I1, I2, I5}
◼ None-empty proper subsets are: {I1}, {I2}, {I5}, {I1, I2},
{I1, I5}, {I2, I5}
◼ Association rules:
17
Challenges of Frequent Pattern Mining
◼Challenges
◼Huge number of candidates
◼Huge data size
◼Multiple scans of transaction database
◼Improving Apriori: general ideas
◼Shrink number of candidates
◼Transaction reduction
◼Reduce passes of transaction database scans
18
Bottlenecks with Apriori
◼Uses a generate-and-test approach
generates candidate itemsets and tests if they
are frequent
Generation of candidate itemsets is expensive (in
◼
both space and time)
Support counting is expensive
◼
◼ Subset checking (computationally expensive)
◼ Multiple Database scans (I/O)
19
Speeding up Apriori Algorithm
❖Dynamic Hashing and Pruning
❖Transaction Reduction
20
DHP: Reduce the Number of Candidates
Hashing itemsets into corresponding buckets
◼
Can be used to reduce the size of candidate k-itemsets, Ck,
◼
for k>1 – specially 2-itemsets
While scanning DB to L1, generate C2 for each t T and
◼
hash them into different bucket of a hash table – increase
hash count
If Supp_count(itemset)<min_sup then remove it from C2
◼
21
DHP: Example
TID List of items IDs Bucket 0 1 2 3 4 5 6
address
T1 I1, I2, I5
Bucket Count 2 2 4 2 2 4 4
T2 I2, I4
T3 I2, I3 Bucket {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
Content {I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
T4 I1, I2, I4
{I2, I3} {I1, I2} {I1, I3}
T5 I1, I3 {I2, I3} {I1, I2} {I1, I3}
T6 I2, I3
T7 I1, I3
Hash Function: h(x,y)=((order of x)10+(order of y)) mod 7
T8 I1, I2, I3, I5
T9 I1, I2, I3
22
How to Trim Candidate Itemsets
◼In k-iteration, hash all “appearing” k+1 itemsets in a hash
table, count all the occurrences of an itemset in the
correspondent bucket.
◼In k+1 iteration, examine each of the candidate itemset to
see if its correspondent bucket count value is above the
min. support (necessary condition)
23
Transaction Reduction
◼ Reduce no. of transactions for future iterations
◼A transaction, t, not containing any frequent k- itemset
cannot contain any frequent (k+1)- itemsets
◼ Delete/mark t from further consideration
24
Frequent Pattern Growth (FP-Growth) Algorithm
◼ Allows frequent itemset discovery without candidate itemset
generation.
◼ Two step approach:
◼ Step 1: Build a compact data structure called the FP-tree
◼ Built using 2 passes over the data-set.
◼ Step 2: Extract frequent itemsets directly from the FP-tree
◼ Traversal through FP-Tree
25
Step-1: FP-Tree Construction
◼ FP-Tree is constructed using 2 passes over the data- set:
◼ Pass 1:
◼ Scan data and find support for each item.
◼ Discard infrequent items.
◼ Sort frequent items in decreasing order based on their support.
◼ For our example: a, b, c, d, e
◼ Use this order when building the FP-Tree, so common prefixes can be shared.
29
26
FP-Tree Construction (cont…)
◼ Pass 2: FP-Tree construction
◼ Read transaction 1: {a, b}
◼ Create 2 nodes a and b and the path null→a→b. Set counts of
a and b to 1.
◼ Read transaction 2: {b, c, d}
◼ Create 3 nodes for b, c and d and the path null→b→c→d. Set
counts to 1.
◼ Note that although transaction 1 and 2 share b, the paths are
disjoint as they don't share a common prefix. Add the link
between the b's.
◼ Read transaction 3: {a, c, d, e}
◼ It shares common prefix item a with transaction 1 so the path
for transaction 1 and 3 will overlap and the frequency count for
node a will be incremented by 1. Add links between the c's and
d's.
◼ Continue until all transactions are mapped to
a path in the FP- tree.
27
Example: FP-Tree
Construction
min_sup=3
28
Exercise VIP
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
29