Data Mining:
Concepts and
Techniques
(3rd ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
1
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Frequent Itemset Mining Methods
Which Patterns Are Interesting?—Pattern
Evaluation Methods
Summary
2
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining
Important?
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data
Classification: discriminative, frequent pattern
analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
4
Basic Concepts: Frequent
Patterns
Tid Items bought itemset: A set of one or more
10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
(absolute) support, or,
40 Nuts, Eggs, Milk
support count of X: Frequency
50 Nuts, Coffee, Diaper, Eggs,
Milk
or occurrence of an itemset X
Customer
(relative) support, s, is the
Customer
buys both buys diaper
fraction of transactions that
contains X (i.e., the
probability that a transaction
contains X)
An itemset X is frequent if X’s
Customer support is no less than a
buys beer minsup threshold
5
Basic Concepts: Association Rules
Tid Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
minimum support and
30 Beer, Diaper, Eggs confidence
40 Nuts, Eggs, Milk support, s, probability that
50 Nuts, Coffee, Diaper, Eggs, Milk
a transaction contains X
Customer
Customer Y
buys both
buys
diaper
confidence, c, conditional
probability that a
transaction having X also
Customer contains Y
buys beer
Let Association
minsup = 50%,rules: (many
minconf more!)
= 50%
Freq. Beer
Pat.: Diaper
Beer:3, Nuts:3,(60%, 100%)
Diaper:4,
Diaper
Eggs:3,
Beer
{Beer, (60%, 75%)
Diaper}:3
6
Closed Patterns and Max-
Patterns
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
7
Closed Patterns and Max-
Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
<a1, …, a100>: 1
What is the set of all patterns?
!!
8
Computational Complexity of Frequent
Itemset Mining
How many itemsets are potentially to be generated in the worst
case?
The number of frequent itemsets to be generated is senstive to
the minsup threshold
When minsup is low, there exist potentially an exponential
number of frequent itemsets
The worst case: MN where M: # distinct items, and N: max length
of transactions
The worst case complexty vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4
The chance to pick up a particular set of 10 products: ~10 -40
What is the chance this particular set of 10 products to be
frequent 103 times in 109 transactions?
9
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Frequent Itemset Mining Methods
Which Patterns Are Interesting?—Pattern
Evaluation Methods
Summary
10
Scalable Frequent Itemset Mining
Methods
Apriori: A Candidate Generation-and-Test
Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth
Approach
ECLAT: Frequent Pattern Mining with Vertical
Data Format
11
The Downward Closure Property and
Scalable Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be
frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts}
also contains {beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
Vertical data format approach (Charm—Zaki &
Hsiao @SDM’02) 12
Apriori: A Candidate Generation & Test
Approach
Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can
be generated 13
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
14
The Apriori Algorithm (Pseudo-
Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 15
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
16
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets
and counts
Interior node contains a hash table
Subset function: finds all the candidates
contained in a transaction
17
Counting Supports of Candidates Using Hash
Tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
18
Scalable Frequent Itemset Mining
Methods
Apriori: A Candidate Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth Approach
ECLAT: Frequent Pattern Mining with Vertical Data
Format
Mining Close Frequent Patterns and Maxpatterns
19
Further Improvement of the Apriori Method
Major computational challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
20
Scalable Frequent Itemset Mining
Methods
Apriori: A Candidate Generation-and-Test Approach
Improving the Efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth Approach
ECLAT: Frequent Pattern Mining with Vertical Data
Format
Mining Close Frequent Patterns and Maxpatterns
21
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent
pattern
22
Construct FP-tree from a Transaction
Database
TID items Items bought (ordered) frequent
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list = f-c-a-b-m-p p:2 m:1
23
Partition Patterns and Databases
Frequent patterns can be partitioned into
subsets according to f-list
F-list = f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
…
Patterns having c but no a nor b, m, p
Pattern f
Completeness and non-redundency
24
Find Patterns Having P From P-conditional
Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent
item p
Accumulate all of transformed prefix paths of item p to
form p’s conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
25
From Conditional Pattern-bases to Conditional FP-
trees
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base:
{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,
a 3 f:3 fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
26
Recursion: Mining Each Conditional FP-
tree
{}
{} Cond. pattern base of “am”: (fc:3) f:3
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
27
A Special Case: Single Prefix Path in FP-
tree
Suppose a (conditional) FP-tree T has a shared
single prefix-path P
Mining can be decomposed into two parts
{} Reduction of the single prefix path into one node
a1:n1
Concatenation of the mining results of the two
parts
a2:n2
a3:n3
{} r1
b1:m1 C1:k1 a1:n1
r1 =
a2:n2
+ b1:m1 C1:k1
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
28
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent
pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not
count node-links and the count field)
29
The Frequent Pattern Growth Mining
Method
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern
and database partition
Method
For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created
conditional FP-tree
Until the resulting FP-tree is empty, or it
contains only one path—single path will
generate all the combinations of its sub-paths,
each of which is a frequent pattern
30
Scaling FP-growth by Database
Projection
What about if FP-tree cannot fit in memory?
DB projection
First partition a database into a set of projected DBs
Then construct and mine FP-tree for each projected DB
Parallel projection vs. partition projection techniques
Parallel projection
Project the DB in parallel for each frequent item
Parallel projection is space costly
All the partitions can be processed in parallel
Partition projection
Partition the DB based on the ordered frequent items
Passing the unprocessed parts to the subsequent
partitions
31