DWDM - Unit - IV
DWDM - Unit - IV
2
Kinds of Association
Rules
3
Association
Mining Mining to
Correlation
Methods 2 4
Mining Frequent Mining
Patterns, Association
and Correlations Constraint
1 5 Based
Frequent Association
Mining
pattern
Analysis
3
1. What is Frequent Pattern Analysis?
Frequent Pattern: A Pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set is called
Frequent Pattern.
Motivation:
– Finding inherent regularities in data
– What products were often purchased together? - Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
4
𝐦𝐢𝐥𝐤 , 𝐛𝐫𝐞𝐚𝐝
8
Example
Transaction-id Items bought Itemset X = {x1, …, xk}
10 A, B, D Find all the rules X Y with minimum
20 A, C, D
support and confidence
30 A, D, E
– support, s, probability that a
40 B, E, F
transaction contains X Y
50 B, C, D, E, F
– confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper
having X also contains Y
Let supmin = 50%, confmin = 50%
Frequent Pattern: {A:3, B:3, D:4, E:3,
AD:3}
Customer
Association rules:
buys beer
A D (60%, 100%)
9
D A (60%, 75%)
Classification of Frequent Pattern Mining Ways
Market basket analysis is just one form of frequent pattern mining.
Frequent pattern mining can be classified in various ways, based on the
following criteria:
a) Based on the completeness of patterns to be mined
Complete set of Frequent Itemsets
Closed Frequent Itemsets and Maximal Frequent Itemsets
Constrained Frequent Itemsets, Approximate FI Sets
Near-match frequent itemsets, Top-k Frequent Itemsets
b) Based on the levels of abstraction involved in the rule set
buys(X, “Computer”) buys (X, “HP printer”)
buys(X, “Laptop_Computer”) buys(X, “HP printer”)
c) Based on the no. of data dimensions involved in the rule
buys(X, “computer”) buys(X, “antivirus software”)
Age(x,“30-39”)^income(X,“42K-48K”) buys(X, “HDTV”):
10
Classification of Frequent Pattern Mining ways
i) Based on the types of values handled in the rule
Boolean Association Rule and Quantitative Association Rule
Catalog design
Sale campaign analysis
Web log (click stream) analysis
DNA sequence analysis.
13
2. SCALABLE METHODS FOR MINING FREQUENT PATTERNS
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
18
Ex:2 Construct Apriori with Minimum Support is
2
Data Set
19
20
Ex-3: Find frequent pattern with Min Support is 50%
24
1. Finding Frequent Itemsets Using Candidate
Generation
Apriori property: All nonempty subsets of a frequent
itemset must also be frequent.
A two-step process is followed, consisting of join and
prune actions.
Disadvantages
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates.
Assumes transaction database is memory resident.
Requires many database scans.
30
Applications
31
3-Mining Frequent Patterns Without Candidate
Generation – FP Growth Algorithm
It is proposed by Han, Pei & Yin in 2000.
FP-Growth follows
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree. 33
Eg: 1. All Electronics branch with transactional data using
FP-Growth Algorithm with Minimum support is 2.
34
Construct items how many times occurred in the table
35
Construct items with priority based
NULL
I2:1
I1:1
I5:1
37
Step 2: Scan DB for Second Transaction: T200
For T200 Transactions: I2,I4
NULL
I2:2
I4:1
I1:1
I5:1
38
Step 3: Scan DB for Third Transaction: T300
For T300 Transactions: I2,I3
NULL
I2:3
I4:1
I1:1 I3:1
I5:1
39
Step 4: Scan DB for fourth Transaction: T400
For T400 Transactions: I2,I1,I4
NULL
I2:4
I4:1
I1:2
I3:1
I5:1 I4:1
40
Step 5: Scan DB for fifth Transaction: T500
For T500 Transactions: I1,I3
NULL
I1:1
I2:4
I3:1
I4:1
I1:2
I3:1
I5:1 I4:1
41
Step 6: Scan DB for sixth Transaction: T600
For T600 Transactions: I2,I3
NULL
I1:1
I2:5
I3:1
I4:1
I1:2
I3:2
I5:1 I4:1
42
Step 7: Scan DB for seventh Transaction: T700
For T700 Transactions: I1,I3
NULL
I1:2
I2:5
I3:2
I4:1
I1:2
I3:2
I5:1 I4:1
43
Step 8: Scan DB for eight Transaction: T800
For T800 Transactions: I2,I1,I3,I5
NULL
I1:2
I2:6
I3:2
I4:1
I1:3
I3:3
I5:2 I4:1
44
Step 9: Scan DB for ninth Transaction: T900
For T900 Transactions: I2,I1,I3
47
E.g.: A database has five transactions. Let min sup =
60% and min conf = 80% using FP-Growth Algorithm
with Minimum support is 3.
48
FP-growth algorithm for discovering frequent
itemsets without candidate generation.
49
Advantages
1. No candidate generation
2. No candidate test
3. No repeated scan of entire database
4. Never break a long pattern of any transaction
5. Reduce irrelevant info—infrequent items are gone
6. Items in frequency descending order.
7. Never be larger than the original database.
8. Preserve complete information for frequent
pattern mining.
50
4- Vertical Data Format (ECLAT)
Vertical data format approach is proposed by – Charm Zaki &
Hsiao in 2000.
52
It mines the transformed data set by TID_set intersections
based on the Apriori property and additional optimization
techniques, such as different set.
Step 1: The vertical data format of the transaction data set D
of 1-Item set
53
Step -2:
Step -3:
54
Mining Closed Frequent Itemsets
Frequent itemset mining may generate a huge number of frequent
itemsets, especially when the min_sup threshold is set low or when
there exist long patterns in the data set.
– A, B, C, D, E 10 A,B,C,D,E
It consists of:
1. Mining Multilevel Association Rules
2. Mining Multidimensional Association Rules
3. Mining Multidimensional Association Rules - Static Discretization of
Quantitative Attributes
4. Mining Quantitative Association Rules - ARCS
57
Example-1: Mining Association Rules
Ti
d
Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs – Support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs,
Milk – Confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys beer
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
Example-2: Mining Association Rules
For rule A C:
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
1. Mining Multi-Level Association Rules
A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher level, more general concepts.
Data can be generalized by replacing low-level concepts within the data by their
higher-level concepts, or ancestors, from a concept hierarchy.
60
The multilevel association rule mining is done based on variations at level wise
search.
Three variation are:
i. Using uniform minimum support for all levels (referred to as uniform
support): The same minimum support threshold is used when mining at each
level of abstraction.
ii. Using reduced minimum support at lower levels (referred to as reduced
support): Each level of abstraction has its own minimum support threshold.
This is better approach compared to uniform support as it comes with more
item sets generated at each level.
iii. Using item or group-based minimum support (referred to as group-based
support): Because users or experts often have insight as to which groups are
more important than others, it is sometimes more desirable to set up user-
specific, item, or group based minimal support thresholds when mining
multilevel rules.
Problem: In Association rule mining to find frequent item set every time low level
item has to depending on its ancestor. ex: laptop computer has an ancestor
computer.
buys(X, “laptop computer”) => buys(X, “HP printer”) [support = 8%, confidence = 70%]
buys(X, “IBM laptop computer”) => buys(X, “HP printer”) [support = 2%, confidence = 72%]
62
2. Mining Multi-Dimensional Association
Multidimensional association rules involve more than one
dimension or predicate (Ex: Rules relating what a
customer buys as well as the customer’s age.).
Multidimensional association rules with no repeated predicates
are called interdimensional association rules.
Mine multidimensional association rules with repeated predicates,
which contain multiple occurrences of some predicates, these
rules are called Hybrid-dimensional association rules.
predicates
a) Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)occupation(X,“student”)buys(X, “coke”)
Fig: A 2-D Grid for tuples representing customers who purchase high-Definition TVs