0% found this document useful (0 votes)
26 views29 pages

3 - Unit-Iii-3

Chapter 3 of 'Data Mining: Concepts and Techniques' discusses frequent pattern analysis, which identifies patterns that occur frequently in datasets, such as product purchases or DNA sequences. It covers methods for mining frequent itemsets, the importance of association rules, and challenges faced in frequent pattern mining, including candidate generation and data size. The chapter also introduces scalable mining methods like Apriori and FP-Growth, emphasizing their efficiency in discovering frequent patterns without excessive computational cost.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views29 pages

3 - Unit-Iii-3

Chapter 3 of 'Data Mining: Concepts and Techniques' discusses frequent pattern analysis, which identifies patterns that occur frequently in datasets, such as product purchases or DNA sequences. It covers methods for mining frequent itemsets, the importance of association rules, and challenges faced in frequent pattern mining, including candidate generation and data size. The chapter also introduces scalable mining methods like Apriori and FP-Growth, emphasizing their efficiency in discovering frequent patterns without excessive computational cost.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 3 —

1
Chapter 3: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods

◼ Basic Concepts

◼ Frequent Itemset Mining Methods

◼ Which Patterns Are Interesting?—Pattern

Evaluation Methods

◼ Summary

2
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Beer and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?

◼ Freq. pattern: An intrinsic and important property of


datasets
◼ Foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis

◼ Sequential, structural (e.g., sub-graph) patterns

◼ Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data


◼ Classification: discriminative, frequent pattern analysis

◼ Cluster analysis: frequent pattern-based clustering

◼ Data warehousing: iceberg cube and cube-gradient

◼ Semantic data compression: fascicles

◼ Broad applications

4
Basic Concepts: Frequent Patterns

Tid Items bought ◼ itemset: A set of one or more


10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper ◼ k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
◼ (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
◼ (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
◼ An itemset X is frequent if X’s
support is no less than a minsup
Customer
buys beer
threshold

5
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer ◼ Association rules: (many more!)
◼ Beer → Diaper (60%, 100%)
◼ Diaper → Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
◼ A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
◼ Solution: Mine closed patterns and max-patterns instead
◼ An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
◼ An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)
◼ Closed pattern is a lossless compression of freq. patterns
◼ Reducing the # of patterns and rules
7
Closed Patterns and Max-Patterns
◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼ Min_sup = 1.
◼ What is the set of closed itemset?
◼ <a1, …, a100>: 1
◼ < a1, …, a50>: 2
◼ What is the set of max-pattern?
◼ <a1, …, a100>: 1
◼ What is the set of all patterns?
◼ !!
8
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods

◼ Basic Concepts

◼ Frequent Itemset Mining Methods

◼ Which Patterns Are Interesting?—Pattern

Evaluation Methods

◼ Summary

9
Scalable Frequent Itemset Mining Methods

◼ Apriori: A Candidate Generation-and-Test

Approach

◼ Improving the Efficiency of Apriori

◼ FPGrowth: A Frequent Pattern-Growth Approach

◼ ECLAT: Frequent Pattern Mining with Vertical

Data Format
10
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent

◼ If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}


◼ Scalable mining methods: Three major approaches
◼ Apriori (Agrawal & Srikant@VLDB’94)

◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)
11
Apriori: A Candidate Generation & Test Approach

◼ Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
◼ Method:
◼ Initially, scan DB once to get frequent 1-itemset
◼ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◼ Test the candidates against DB
◼ Terminate when no frequent or candidate set can be
generated

12
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
13
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
14
Exercise: Transaction Data

Min support count: 2

14

15
16
Cont…

◼ Let S = {I1, I2, I5}


◼ None-empty proper subsets are: {I1}, {I2}, {I5}, {I1, I2},
{I1, I5}, {I2, I5}
◼ Association rules:

17
Challenges of Frequent Pattern Mining

◼Challenges
◼Huge number of candidates

◼Huge data size

◼Multiple scans of transaction database

◼Improving Apriori: general ideas

◼Shrink number of candidates

◼Transaction reduction

◼Reduce passes of transaction database scans

18
Bottlenecks with Apriori
◼Uses a generate-and-test approach
generates candidate itemsets and tests if they
are frequent
Generation of candidate itemsets is expensive (in

both space and time)


Support counting is expensive

◼ Subset checking (computationally expensive)

◼ Multiple Database scans (I/O)

19
Speeding up Apriori Algorithm

❖Dynamic Hashing and Pruning


❖Transaction Reduction

20
DHP: Reduce the Number of Candidates

Hashing itemsets into corresponding buckets


Can be used to reduce the size of candidate k-itemsets, Ck,


for k>1 – specially 2-itemsets

While scanning DB to L1, generate C2 for each t  T and


hash them into different bucket of a hash table – increase


hash count

If Supp_count(itemset)<min_sup then remove it from C2


21
DHP: Example
TID List of items IDs Bucket 0 1 2 3 4 5 6
address
T1 I1, I2, I5
Bucket Count 2 2 4 2 2 4 4
T2 I2, I4

T3 I2, I3 Bucket {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
Content {I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
T4 I1, I2, I4
{I2, I3} {I1, I2} {I1, I3}
T5 I1, I3 {I2, I3} {I1, I2} {I1, I3}

T6 I2, I3

T7 I1, I3
Hash Function: h(x,y)=((order of x)10+(order of y)) mod 7
T8 I1, I2, I3, I5

T9 I1, I2, I3

22
How to Trim Candidate Itemsets
◼In k-iteration, hash all “appearing” k+1 itemsets in a hash
table, count all the occurrences of an itemset in the
correspondent bucket.
◼In k+1 iteration, examine each of the candidate itemset to
see if its correspondent bucket count value is above the
min. support (necessary condition)

23
Transaction Reduction
◼ Reduce no. of transactions for future iterations
◼A transaction, t, not containing any frequent k- itemset
cannot contain any frequent (k+1)- itemsets
◼ Delete/mark t from further consideration

24
Frequent Pattern Growth (FP-Growth) Algorithm

◼ Allows frequent itemset discovery without candidate itemset

generation.

◼ Two step approach:

◼ Step 1: Build a compact data structure called the FP-tree

◼ Built using 2 passes over the data-set.

◼ Step 2: Extract frequent itemsets directly from the FP-tree

◼ Traversal through FP-Tree

25
Step-1: FP-Tree Construction

◼ FP-Tree is constructed using 2 passes over the data- set:

◼ Pass 1:

◼ Scan data and find support for each item.

◼ Discard infrequent items.

◼ Sort frequent items in decreasing order based on their support.

◼ For our example: a, b, c, d, e

◼ Use this order when building the FP-Tree, so common prefixes can be shared.
29

26
FP-Tree Construction (cont…)

◼ Pass 2: FP-Tree construction


◼ Read transaction 1: {a, b}
◼ Create 2 nodes a and b and the path null→a→b. Set counts of
a and b to 1.

◼ Read transaction 2: {b, c, d}


◼ Create 3 nodes for b, c and d and the path null→b→c→d. Set
counts to 1.
◼ Note that although transaction 1 and 2 share b, the paths are

disjoint as they don't share a common prefix. Add the link


between the b's.

◼ Read transaction 3: {a, c, d, e}


◼ It shares common prefix item a with transaction 1 so the path
for transaction 1 and 3 will overlap and the frequency count for
node a will be incremented by 1. Add links between the c's and
d's.
◼ Continue until all transactions are mapped to
a path in the FP- tree.
27
Example: FP-Tree
Construction

min_sup=3

28
Exercise VIP

Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

29

You might also like