0% found this document useful (0 votes)

26 views29 pages

3 - Unit-Iii-3

Chapter 3 of 'Data Mining: Concepts and Techniques' discusses frequent pattern analysis, which identifies patterns that occur frequently in datasets, such as product purchases or DNA sequences. It covers methods for mining frequent itemsets, the importance of association rules, and challenges faced in frequent pattern mining, including candidate generation and data size. The chapter also introduces scalable mining methods like Apriori and FP-Growth, emphasizing their efficiency in discovering frequent patterns without excessive computational cost.

Uploaded by

samiullahwaziri13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views29 pages

3 - Unit-Iii-3

Uploaded by

samiullahwaziri13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 3 —

1
Chapter 3: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods

◼ Basic Concepts

◼ Frequent Itemset Mining Methods

◼ Which Patterns Are Interesting?—Pattern

Evaluation Methods

◼ Summary

2
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Beer and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?

◼ Freq. pattern: An intrinsic and important property of

datasets
◼ Foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis

◼ Sequential, structural (e.g., sub-graph) patterns

◼ Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data

◼ Classification: discriminative, frequent pattern analysis

◼ Cluster analysis: frequent pattern-based clustering

◼ Data warehousing: iceberg cube and cube-gradient

◼ Semantic data compression: fascicles

◼ Broad applications

4
Basic Concepts: Frequent Patterns

Tid Items bought ◼ itemset: A set of one or more

10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper ◼ k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs
◼ (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
◼ (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
◼ An itemset X is frequent if X’s
support is no less than a minsup
Customer
buys beer
threshold

5
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer ◼ Association rules: (many more!)
◼ Beer → Diaper (60%, 100%)
◼ Diaper → Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
◼ A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
◼ Solution: Mine closed patterns and max-patterns instead
◼ An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
◼ An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)
◼ Closed pattern is a lossless compression of freq. patterns
◼ Reducing the # of patterns and rules
7
Closed Patterns and Max-Patterns
◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼ Min_sup = 1.
◼ What is the set of closed itemset?
◼ <a1, …, a100>: 1
◼ < a1, …, a50>: 2
◼ What is the set of max-pattern?
◼ <a1, …, a100>: 1
◼ What is the set of all patterns?
◼ !!
8
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods

◼ Basic Concepts

◼ Frequent Itemset Mining Methods

◼ Which Patterns Are Interesting?—Pattern

Evaluation Methods

◼ Summary

9
Scalable Frequent Itemset Mining Methods

◼ Apriori: A Candidate Generation-and-Test

Approach

◼ Improving the Efficiency of Apriori

◼ FPGrowth: A Frequent Pattern-Growth Approach

◼ ECLAT: Frequent Pattern Mining with Vertical

Data Format
10
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent

◼ If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}

◼ Scalable mining methods: Three major approaches
◼ Apriori (Agrawal & Srikant@VLDB’94)

◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)
11
Apriori: A Candidate Generation & Test Approach

◼ Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
◼ Method:
◼ Initially, scan DB once to get frequent 1-itemset
◼ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◼ Test the candidates against DB
◼ Terminate when no frequent or candidate set can be
generated

12
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
13
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
14
Exercise: Transaction Data

Min support count: 2

15
16
Cont…

◼ Let S = {I1, I2, I5}

◼ None-empty proper subsets are: {I1}, {I2}, {I5}, {I1, I2},
{I1, I5}, {I2, I5}
◼ Association rules:

17
Challenges of Frequent Pattern Mining

◼Challenges
◼Huge number of candidates

◼Huge data size

◼Multiple scans of transaction database

◼Improving Apriori: general ideas

◼Shrink number of candidates

◼Transaction reduction

◼Reduce passes of transaction database scans

18
Bottlenecks with Apriori
◼Uses a generate-and-test approach
generates candidate itemsets and tests if they
are frequent
Generation of candidate itemsets is expensive (in
◼

both space and time)

Support counting is expensive
◼

◼ Subset checking (computationally expensive)

◼ Multiple Database scans (I/O)

19
Speeding up Apriori Algorithm

❖Dynamic Hashing and Pruning

❖Transaction Reduction

20
DHP: Reduce the Number of Candidates

Hashing itemsets into corresponding buckets

◼

Can be used to reduce the size of candidate k-itemsets, Ck,

◼

for k>1 – specially 2-itemsets

While scanning DB to L1, generate C2 for each t  T and

◼

hash them into different bucket of a hash table – increase

hash count

If Supp_count(itemset)<min_sup then remove it from C2

◼

21
DHP: Example
TID List of items IDs Bucket 0 1 2 3 4 5 6
address
T1 I1, I2, I5
Bucket Count 2 2 4 2 2 4 4
T2 I2, I4

T3 I2, I3 Bucket {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
Content {I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
T4 I1, I2, I4
{I2, I3} {I1, I2} {I1, I3}
T5 I1, I3 {I2, I3} {I1, I2} {I1, I3}

T6 I2, I3

T7 I1, I3
Hash Function: h(x,y)=((order of x)10+(order of y)) mod 7
T8 I1, I2, I3, I5

T9 I1, I2, I3

22
How to Trim Candidate Itemsets
◼In k-iteration, hash all “appearing” k+1 itemsets in a hash
table, count all the occurrences of an itemset in the
correspondent bucket.
◼In k+1 iteration, examine each of the candidate itemset to
see if its correspondent bucket count value is above the
min. support (necessary condition)

23
Transaction Reduction
◼ Reduce no. of transactions for future iterations
◼A transaction, t, not containing any frequent k- itemset
cannot contain any frequent (k+1)- itemsets
◼ Delete/mark t from further consideration

24
Frequent Pattern Growth (FP-Growth) Algorithm

◼ Allows frequent itemset discovery without candidate itemset

generation.

◼ Two step approach:

◼ Step 1: Build a compact data structure called the FP-tree

◼ Built using 2 passes over the data-set.

◼ Step 2: Extract frequent itemsets directly from the FP-tree

◼ Traversal through FP-Tree

25
Step-1: FP-Tree Construction

◼ FP-Tree is constructed using 2 passes over the data- set:

◼ Pass 1:

◼ Scan data and find support for each item.

◼ Discard infrequent items.

◼ Sort frequent items in decreasing order based on their support.

◼ For our example: a, b, c, d, e

◼ Use this order when building the FP-Tree, so common prefixes can be shared.
29

26
FP-Tree Construction (cont…)

◼ Pass 2: FP-Tree construction

◼ Read transaction 1: {a, b}
◼ Create 2 nodes a and b and the path null→a→b. Set counts of
a and b to 1.

◼ Read transaction 2: {b, c, d}

◼ Create 3 nodes for b, c and d and the path null→b→c→d. Set
counts to 1.
◼ Note that although transaction 1 and 2 share b, the paths are

disjoint as they don't share a common prefix. Add the link

between the b's.

◼ Read transaction 3: {a, c, d, e}

◼ It shares common prefix item a with transaction 1 so the path
for transaction 1 and 3 will overlap and the frequency count for
node a will be incremented by 1. Add links between the c's and
d's.
◼ Continue until all transactions are mapped to
a path in the FP- tree.
27
Example: FP-Tree
Construction

min_sup=3

28
Exercise VIP

Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

Updated Module 3
No ratings yet
Updated Module 3
31 pages
Concepts and Techniques: - Chapter 6
No ratings yet
Concepts and Techniques: - Chapter 6
64 pages
Slides 06FPBasic
No ratings yet
Slides 06FPBasic
30 pages
Computer Science
No ratings yet
Computer Science
59 pages
Unit 2a
No ratings yet
Unit 2a
59 pages
Week 3
No ratings yet
Week 3
56 pages
Unit2 Apriori FP Growth
No ratings yet
Unit2 Apriori FP Growth
27 pages
Frequent Itemset Mining
No ratings yet
Frequent Itemset Mining
58 pages
15mining Freq Patterns-Part1
No ratings yet
15mining Freq Patterns-Part1
25 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
Unit 3
No ratings yet
Unit 3
62 pages
Frequent Pattern Mining Techniques
No ratings yet
Frequent Pattern Mining Techniques
37 pages
Frequent Pattern Mining Techniques
No ratings yet
Frequent Pattern Mining Techniques
59 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
52 pages
KDDM-Lecture 3
No ratings yet
KDDM-Lecture 3
21 pages
DM-BS-lec6-Mining Frequent Patterns
No ratings yet
DM-BS-lec6-Mining Frequent Patterns
37 pages
06 FPBasic
No ratings yet
06 FPBasic
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
67 pages
Module 3
No ratings yet
Module 3
136 pages
Slide 06 Chapter6 Frequent Itemset Mining Methods
No ratings yet
Slide 06 Chapter6 Frequent Itemset Mining Methods
62 pages
Frequent Pattern Based Clustering Methods
No ratings yet
Frequent Pattern Based Clustering Methods
23 pages
Frequent Patterns and Associations in Data Mining
No ratings yet
Frequent Patterns and Associations in Data Mining
66 pages
Powerpoint Presentation On Somlething
No ratings yet
Powerpoint Presentation On Somlething
181 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
Data Mining Patterns & Techniques
No ratings yet
Data Mining Patterns & Techniques
16 pages
06 FPBasic
No ratings yet
06 FPBasic
59 pages
Data Mining: Frequent Patterns
No ratings yet
Data Mining: Frequent Patterns
40 pages
Chapter06 (Frequent Patterns)
No ratings yet
Chapter06 (Frequent Patterns)
47 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Data Mining: Frequent Pattern Analysis
No ratings yet
Data Mining: Frequent Pattern Analysis
33 pages
DM 2
No ratings yet
DM 2
71 pages
06 Association Rule Mining
No ratings yet
06 Association Rule Mining
20 pages
Pattern Mining Concepts and Methods
No ratings yet
Pattern Mining Concepts and Methods
52 pages
Chapter - 6 Data Mining
No ratings yet
Chapter - 6 Data Mining
65 pages
04 FPbasic
No ratings yet
04 FPbasic
78 pages
06 Apriori
No ratings yet
06 Apriori
36 pages
Module 3
No ratings yet
Module 3
98 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
34 pages
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
No ratings yet
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
66 pages
P8 FPBasic
No ratings yet
P8 FPBasic
53 pages
Chapter 5
No ratings yet
Chapter 5
24 pages
Unit 2
No ratings yet
Unit 2
65 pages
What Is Frequent Pattern Analysis?
No ratings yet
What Is Frequent Pattern Analysis?
37 pages
Association Rules
No ratings yet
Association Rules
48 pages
DMDW Chapter 4 (Updated)
No ratings yet
DMDW Chapter 4 (Updated)
28 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
28 pages
Equent Patterns
No ratings yet
Equent Patterns
74 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
94 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
06apriori Edited v3
No ratings yet
06apriori Edited v3
29 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
29 pages
Frequent Pattern Analysis Guide
No ratings yet
Frequent Pattern Analysis Guide
5 pages
Understanding Document Bleed Requirements
No ratings yet
Understanding Document Bleed Requirements
7 pages
Mantık Devreleri ve Kombinasyonel Mantık
No ratings yet
Mantık Devreleri ve Kombinasyonel Mantık
37 pages
WhatsApp Media Access Denied Error
No ratings yet
WhatsApp Media Access Denied Error
1 page
AMPS User Guide for NYC Parks Data
No ratings yet
AMPS User Guide for NYC Parks Data
10 pages
First Laguna Electric Cooperative
No ratings yet
First Laguna Electric Cooperative
5 pages
Computational Thinking: Slide-1
No ratings yet
Computational Thinking: Slide-1
21 pages
Rapport PFA - S9
No ratings yet
Rapport PFA - S9
69 pages
Metadata of The Chapter That Will Be Visualized Online: Samui
No ratings yet
Metadata of The Chapter That Will Be Visualized Online: Samui
14 pages
BESCK104E-204E Module-3 - Notes
No ratings yet
BESCK104E-204E Module-3 - Notes
24 pages
Customize Windows 10 Start Menu Layout Via GPO
No ratings yet
Customize Windows 10 Start Menu Layout Via GPO
17 pages
Adobe Premiere cs6 Product
No ratings yet
Adobe Premiere cs6 Product
3 pages
All The Serials For Microsoft Office 2010 Taringa
No ratings yet
All The Serials For Microsoft Office 2010 Taringa
8 pages
Barry Shao - Cake
No ratings yet
Barry Shao - Cake
2 pages
Persona Fusion Recipes Guide
No ratings yet
Persona Fusion Recipes Guide
23 pages
C Questions
No ratings yet
C Questions
18 pages
Week 7 - Tutorial 6
100% (1)
Week 7 - Tutorial 6
15 pages
Account Usage and Recharge Statement From 01-Mar-2024 To 30-Mar-2024
No ratings yet
Account Usage and Recharge Statement From 01-Mar-2024 To 30-Mar-2024
8 pages
Spring Batch Processing
No ratings yet
Spring Batch Processing
16 pages
250 Excel Keyboard Shortcuts - Engineering Books
No ratings yet
250 Excel Keyboard Shortcuts - Engineering Books
4 pages
The Importance of HTML
No ratings yet
The Importance of HTML
62 pages
Upload The Panorama Virtual Appliance Image To Alibaba Cloud
No ratings yet
Upload The Panorama Virtual Appliance Image To Alibaba Cloud
4 pages
JDS G223X3
No ratings yet
JDS G223X3
17 pages
WP321 wt10 en
No ratings yet
WP321 wt10 en
4 pages
Hardware Manual
No ratings yet
Hardware Manual
52 pages
Session 20 Kerberos Ver 2
No ratings yet
Session 20 Kerberos Ver 2
17 pages
MODULE-4 RM Vipul 2
No ratings yet
MODULE-4 RM Vipul 2
18 pages
Aw4.4 Service Manual
No ratings yet
Aw4.4 Service Manual
354 pages
Expert in Additive Manufacturing Technologies
No ratings yet
Expert in Additive Manufacturing Technologies
1 page
Mine Office Start Up Layout
No ratings yet
Mine Office Start Up Layout
1 page
Class 9 Digital Presenation
No ratings yet
Class 9 Digital Presenation
3 pages

3 - Unit-Iii-3

Uploaded by

3 - Unit-Iii-3

Uploaded by

Data Mining:

Concepts and Techniques

◼ Frequent Itemset Mining Methods

◼ Which Patterns Are Interesting?—Pattern

◼ Freq. pattern: An intrinsic and important property of

◼ Sequential, structural (e.g., sub-graph) patterns

◼ Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data

◼ Cluster analysis: frequent pattern-based clustering

◼ Data warehousing: iceberg cube and cube-gradient

◼ Semantic data compression: fascicles

Tid Items bought ◼ itemset: A set of one or more

◼ Frequent Itemset Mining Methods

◼ Which Patterns Are Interesting?—Pattern

◼ Apriori: A Candidate Generation-and-Test

◼ Improving the Efficiency of Apriori

◼ FPGrowth: A Frequent Pattern-Growth Approach

◼ ECLAT: Frequent Pattern Mining with Vertical

◼ If {beer, diaper, nuts} is frequent, so is {beer,

contains {beer, diaper}

◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin

◼ Apriori pruning principle: If there is any itemset which is

Min support count: 2

◼ Let S = {I1, I2, I5}

◼Huge data size

◼Multiple scans of transaction database

◼Improving Apriori: general ideas

◼Shrink number of candidates

◼Reduce passes of transaction database scans

both space and time)

◼ Subset checking (computationally expensive)

◼ Multiple Database scans (I/O)

❖Dynamic Hashing and Pruning

Hashing itemsets into corresponding buckets

Can be used to reduce the size of candidate k-itemsets, Ck,

for k>1 – specially 2-itemsets

While scanning DB to L1, generate C2 for each t  T and

hash them into different bucket of a hash table – increase

If Supp_count(itemset)<min_sup then remove it from C2

◼ Allows frequent itemset discovery without candidate itemset

◼ Two step approach:

◼ Step 1: Build a compact data structure called the FP-tree

◼ Built using 2 passes over the data-set.

◼ Step 2: Extract frequent itemsets directly from the FP-tree

◼ Traversal through FP-Tree

◼ FP-Tree is constructed using 2 passes over the data- set:

◼ Scan data and find support for each item.

◼ Discard infrequent items.

◼ Sort frequent items in decreasing order based on their support.

◼ For our example: a, b, c, d, e

◼ Pass 2: FP-Tree construction

◼ Read transaction 2: {b, c, d}

disjoint as they don't share a common prefix. Add the link

◼ Read transaction 3: {a, c, d, e}

You might also like