0% found this document useful (0 votes)
30 views67 pages

DWDM - Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views67 pages

DWDM - Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

UNIT – IV

 Mining Frequent Patterns, Association and


Correlations: The objective is discovery of interesting
associations, correlations between transactional and relational
databases.

 The discovery of frequent patterns, association, and


correlation relationships among huge amounts of data is
useful in selective marketing, decision analysis, and business
management.

 A popular area of application is market basket analysis, which


studies the buying habits of customers by searching for sets of
items that are frequently purchased together (or) in sequence.
1
 It consists of

1. What is frequent pattern analysis


2. Efficient and scalable frequent itemset mining methods

3. Mining various kinds of Association rules


4. From Association mining to correlation analysis
5. Constraint-based association mining

2
Kinds of Association
Rules

3
Association
Mining Mining to
Correlation
Methods 2 4
Mining Frequent Mining

Patterns, Association
and Correlations Constraint
1 5 Based
Frequent Association
Mining
pattern
Analysis

Fig: Mining Frequent Patterns, Association and Correlations

3
1. What is Frequent Pattern Analysis?
 Frequent Pattern: A Pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set is called
Frequent Pattern.

 First proposed by Agrawal, Imielinski, and Swami in the context of


frequent itemsets and association rule mining.

 Motivation:
– Finding inherent regularities in data
– What products were often purchased together? - Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?

4
𝐦𝐢𝐥𝐤 , 𝐛𝐫𝐞𝐚𝐝

Fig: Market Basket Analysis


 Why is Frequent Pattern Mining Important?
 It discloses an intrinsic and important property of data sets
 It forms the foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Sequential, structural (Ex: sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data

– Classification: Associative classification


– Cluster analysis: Frequent pattern-based clustering
– Data warehousing: Iceberg cube and cube-gradient
– Semantic data compression: Fascicles(fascicle is a bundle or a
cluster)
– Broad applications
6
 A transaction DB is a collection of sets of items
(transactions).
 An itemset is a set of items.
 An association rule is an implication of the form X=>Y,
with X and Y itemsets.

 Support Count (SC) of an itemset X is the number of


transactions that contain X.
 Support of X (also frequency of X) = SC(X)/SC({ })

 Support of an association rule X=>Y equals the support of


X U Y.

 Confidence of an association rule X=>Y =


Support(X=>Y) / Support(X).
Association rule mining - Two-step Process

1. Find all frequent itemsets: By definition, each of


these itemsets will occur at least as frequently as
a predetermined minimum support count, min
sup.

2. Generate strong association rules from the


frequent itemsets: By definition, these rules must
satisfy minimum support and minimum
confidence.

8
Example
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D
support and confidence
30 A, D, E
– support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
– confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper
having X also contains Y
Let supmin = 50%, confmin = 50%
Frequent Pattern: {A:3, B:3, D:4, E:3,
AD:3}
Customer
Association rules:
buys beer
A  D (60%, 100%)
9
D  A (60%, 75%)
Classification of Frequent Pattern Mining Ways
 Market basket analysis is just one form of frequent pattern mining.
 Frequent pattern mining can be classified in various ways, based on the
following criteria:
a) Based on the completeness of patterns to be mined
Complete set of Frequent Itemsets
Closed Frequent Itemsets and Maximal Frequent Itemsets
Constrained Frequent Itemsets, Approximate FI Sets
Near-match frequent itemsets, Top-k Frequent Itemsets
b) Based on the levels of abstraction involved in the rule set
buys(X, “Computer”) buys (X, “HP printer”)
buys(X, “Laptop_Computer”) buys(X, “HP printer”)
c) Based on the no. of data dimensions involved in the rule
buys(X, “computer”) buys(X, “antivirus software”)
Age(x,“30-39”)^income(X,“42K-48K”) buys(X, “HDTV”):
10
Classification of Frequent Pattern Mining ways
i) Based on the types of values handled in the rule
 Boolean Association Rule and Quantitative Association Rule

ii) Based on the kinds of rules to be mined


 Association Rules, Correlation Rules, strong gradient
relationships
 Eg: The average sales from Sony_Digital_Camera increase over
16% when sold together with Sony_Laptop_Computer

iii) Based on the kinds of patterns to be mined


 Frequent Itemset Mining
 Sequential Pattern Mining
 Structured Pattern Mining
11
Advantages
 To increase the business market

 To increase the popularity of stock market analysis.

 Both Apriori and FP-Growth are aiming to find out


complete set of patterns;

 FP-Growth is more efficient and scalable than Apriori in


respect to prolific and long patterns. 12
Applications

 Basket data analysis


 Cross-marketing

 Catalog design
 Sale campaign analysis
 Web log (click stream) analysis
 DNA sequence analysis.

13
2. SCALABLE METHODS FOR MINING FREQUENT PATTERNS

 A Pattern (a set of items, subsequences, substructures,


etc.) that occurs frequently in a data set is called Frequent
Pattern.
 Many efficient and scalable algorithms have been
developed for frequent itemset mining from which
association and correlation rules can be derived.

 The downward closure property of frequent patterns


– Any subset of a frequent itemset must be frequent
– If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e.,
every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
14
 Efficient and scalable frequent itemset mining methods
a) Finding Frequent Itemsets Using Candidate Generation
b) Generating Association Rules from Frequent Itemsets
c) Improving the Efficiency of Apriori
d) Mining Frequent Itemsets without Candidate Generation
e) Mining Frequent Itemsets Using Vertical Data Format
f) Mining Closed Frequent Itemsets

 Scalable mining methods: Three major approaches


1. Apriori Algorithm - Agrawal & Srikant.
2. Frequent Pattern growth – FPgrowth - Han, Pei & Yin
3. Vertical data format approach – Charm Zaki & Hsiao
1 - Apriori Algorithm is a seminal(Determining) algorithm
for mining frequent itemsets for Boolean association
rules.

 Proposed by R. Agrawal and R. Srikant in 1994.


 Apriori uses a "bottom up" approach, where frequent
subsets are extended one item at a time (a step known as
candidate generation, and groups of candidates are tested
against the data.

 The Apriori algorithm is an efficient algorithm for


finding all frequent itemsets.
 The Apriori algorithm implements level-wise search
using frequent item property.
 It is used to find the frequent item sets of the given data
sets with candidate generation.
 The algorithm terminates when no further successful
extensions are found.
 Apriori uses breadth-first search and a hash tree structure
to count candidate item sets efficiently.

 An iterative/level-wise search, where k-itemsets are used


to explore (k+1)-itemsets.
 The Apriori algorithm can be additionally optimized.
 There are many measures for association rules.

 Property: All nonempty subsets of a frequent itemset


must also be frequent.
17
Ex: 1 – Find Frequent Pattern with Min Supp is 2
Supmin = 2
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
18
Ex:2 Construct Apriori with Minimum Support is
2

Data Set

19
20
Ex-3: Find frequent pattern with Min Support is 50%

Min support =50% or


2
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2 21
22
23
Apriori Algorithms consists of

1. Finding Frequent Itemsets Using Candidate Generation.

2. Generating Association Rules from Frequent Itemsets.

3. Improving the Efficiency of Apriori

24
1. Finding Frequent Itemsets Using Candidate
Generation
 Apriori property: All nonempty subsets of a frequent
itemset must also be frequent.
 A two-step process is followed, consisting of join and
prune actions.

1. The join step: To find Lk, a set of candidate k-itemsets


is generated by joining Lk-1 with itself.

2. The prune step: Ck is a superset of Lk, that is, its


members may or may not be frequent, but all of the
frequent k-itemsets are included in Ck
2. Generating Association Rules from
Frequent Itemsets
Association rule mining consists of first finding
frequent itemsets (set of items, such as A and B,
satisfying a minimum support threshold, or
percentage of the task-relevant tuples), from
which strong association rules in the form of
A=>B are generated.

These rules also satisfy a minimum confidence


threshold (a prespecified probability of satisfying
B under the condition that A is satisfied).
 It is an implication expression of the form X → Y, where
X and Y are itemsets
Eg: {Milk, Diaper} → {Beer}
 Strong association rules satisfy both minimum support
and minimum confidence.
 Confidence(A=>B) = P(B/A) = Support count(AUB) /
Support count(A)

 Association rules can be generated as follows:


1. For each frequent itemset l, generate all nonempty
subsets of l.
2. For every nonempty subset s of l, output the rule “s => (l
- s)” if support count(l) / support count(s) >= min conf,
where min conf is the minimum confidence threshold.
27
 Generating association rules. Ex: The transactional data for All
Electronics.
 Suppose the data contain the frequent itemset l = {I1, I2, I5}. What
are the association rules that can be generated from l? The
nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2},
and {I5}.
 The resulting association rules are as shown below, each listed
with its confidence:
• I1^I2=>I5, confidence = 2/4 = 50%
• I1^I5)=>I2, confidence = 2/2 = 100%
• I2^I5=>I1, confidence = 2/2 = 100%
• I1=>I2^I5, confidence = 2/6 = 33%
• I2=>I1^I5, confidence = 2/7 = 29%
• I5=>I1^I2, confidence = 2/2 = 100%
 If the minimum confidence threshold is, say, 70%, then only the
second, third, and last rules above are output, because these are the
only ones generated that are strong. 28
3. Improving the Efficiency of Apriori

Many variations of the Apriori algorithm have


been proposed are
i. Hash-based technique: Hashing itemsets into
corresponding buckets.
ii. Transaction reduction: Reducing the number of
transactions scanned in future iterations.
iii. Partitioning: Partitioning the data to find candidate
itemsets.
iv. Sampling: Mining on a subset of the given data).
v. Dynamic itemset counting: adding candidate itemsets
at different points during a scan. 29
Advantages
 Uses large itemset property.
 Easily parallelized.
 Easy to implement.

Disadvantages
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates.
 Assumes transaction database is memory resident.
 Requires many database scans.
30
Applications

 Basket data analysis


 Cross-marketing
 Catalog design
 Sale campaign analysis
 Web log (click stream) analysis, and
 DNA sequence analysis.

31
3-Mining Frequent Patterns Without Candidate
Generation – FP Growth Algorithm
 It is proposed by Han, Pei & Yin in 2000.

 Frequent pattern growth (FP-growth) is a method of


mining frequent itemsets without candidate generation.

 It constructs a highly compact data structure (an FP-tree)


to compress the original transaction database.

 It focuses on frequent pattern (fragment) growth, which


avoids costly candidate generation and resulting in
greater efficiency. 32
 It follows the frequent pattern growth approach, create the
root of the tree, labeled with null and creates branches suffix
and prefix nodes with frequent pattern growth.
 FP-growth algorithm adopts a divide – and – conquer
strategy.
 It reduces the size of candidate sets, leading to good
performance gain.
 It compress the DB representing frequent items into a
frequent – pattern, which retains the itemset association
information.

 FP-Growth follows
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree. 33
Eg: 1. All Electronics branch with transactional data using
FP-Growth Algorithm with Minimum support is 2.

34
Construct items how many times occurred in the table

35
Construct items with priority based

TID ITEMS PRIORITY


T100 I1,I2,I5 I2,I1,I5
T200 I2,I4 I2,I4
T300 I2,I3 I2,I3
T400 I1,I2,I4 I2,I1,I4
T500 I1,I3 I1,I3
T600 I2,I3 I2,I3
T700 I1,I3 I1,I3
T800 I1,I2,I3,I5 I2,I1,I3,I5
T900 I1,I2,I3 I2,I1,I3
36
Step 1: Create root of DB, label it as “NULL”
For T100 Transactions: I2,I1,I5

NULL

I2:1

I1:1

I5:1
37
Step 2: Scan DB for Second Transaction: T200
For T200 Transactions: I2,I4

NULL

I2:2

I4:1
I1:1

I5:1
38
Step 3: Scan DB for Third Transaction: T300
For T300 Transactions: I2,I3

NULL

I2:3

I4:1
I1:1 I3:1

I5:1
39
Step 4: Scan DB for fourth Transaction: T400
For T400 Transactions: I2,I1,I4

NULL

I2:4

I4:1
I1:2
I3:1

I5:1 I4:1
40
Step 5: Scan DB for fifth Transaction: T500
For T500 Transactions: I1,I3

NULL

I1:1
I2:4
I3:1
I4:1
I1:2
I3:1

I5:1 I4:1

41
Step 6: Scan DB for sixth Transaction: T600
For T600 Transactions: I2,I3

NULL

I1:1
I2:5
I3:1
I4:1
I1:2
I3:2

I5:1 I4:1

42
 Step 7: Scan DB for seventh Transaction: T700
 For T700 Transactions: I1,I3

NULL

I1:2
I2:5
I3:2
I4:1
I1:2
I3:2

I5:1 I4:1

43
 Step 8: Scan DB for eight Transaction: T800
 For T800 Transactions: I2,I1,I3,I5

NULL

I1:2
I2:6
I3:2
I4:1
I1:3
I3:3

I5:2 I4:1

44
Step 9: Scan DB for ninth Transaction: T900
For T900 Transactions: I2,I1,I3

Fig: FP-tree registers compressed, Frequent


Pattern Information.
45
Fig: Mining the FP-tree by creating Conditional
(sub-)pattern bases.
46
E.g. 3: A database has five transactions. Let min sup =
60% and min conf = 80% using FP-Growth Algorithm
with Minimum support is 3.

47
E.g.: A database has five transactions. Let min sup =
60% and min conf = 80% using FP-Growth Algorithm
with Minimum support is 3.

Support= 60%=>60*5( No. of Transactions)=300/100=3

Confidence: First identify possible subsets from set L


based on S=>(L-S) . Next calculate the confidence rule

Confidence = # of Tuples in both P(AuB)


______________________
# of Tuples only P(A)

48
FP-growth algorithm for discovering frequent
itemsets without candidate generation.

49
Advantages
1. No candidate generation
2. No candidate test
3. No repeated scan of entire database
4. Never break a long pattern of any transaction
5. Reduce irrelevant info—infrequent items are gone
6. Items in frequency descending order.
7. Never be larger than the original database.
8. Preserve complete information for frequent
pattern mining.
50
4- Vertical Data Format (ECLAT)
 Vertical data format approach is proposed by – Charm Zaki &
Hsiao in 2000.

 Mining frequent itemsets using vertical data format (ECLAT i.e.


(Equivalence CLASS Transformation) is a method that transforms a
given data set of transactions in the horizontal data format of TID-
itemset into the vertical data format of item-TID set.

 Frequent itemsets can also be mined efficiently using vertical data


format, which is the essence of the ECLAT algorithm.

 Data can also be presented in item-TID_set format (that is, {item :


TID_set}), where item is an item name, and TID_set is the set of
transaction identifiers containing the item, this format is known as
vertical data format. 51
Eg: 1. All Electronics branch with transactional data using
ECLAT with Minimum support is 2.

52
 It mines the transformed data set by TID_set intersections
based on the Apriori property and additional optimization
techniques, such as different set.
Step 1: The vertical data format of the transaction data set D
of 1-Item set

53
Step -2:

Step -3:

54
Mining Closed Frequent Itemsets
 Frequent itemset mining may generate a huge number of frequent
itemsets, especially when the min_sup threshold is set low or when
there exist long patterns in the data set.

 Closed frequent itemsets: An itemset X is a closed frequent itemset


in set S if X is both closed and frequent in S.

 Maximal Frequent itemsets: An itemset X is a maximal frequent


itemset (or max-itemset) in set S if X is frequent, and there exists
no super-itemset Y such that X(Y and Y is frequent in S.
 C = {(a1, a2,......,a100) : 1; (a1, a2, : : : , a50) : 2}
 M = {(a1, a2,......,a100) : 1}

 It is more desirable to mine the set of closed frequent itemsets


rather than the set of all frequent itemsets in most cases.
55
Example

 1st scan: find frequent items Tid Items

– A, B, C, D, E 10 A,B,C,D,E

 2nd scan: find support for 20 B,C,D,E,

– AB, AC, AD, AE, ABCDE 30 A,C,D,F

– BC, BD, BE, BCDE


Potential
– CD, CE, CDE, DE, max-patterns
 Since BCDE is a max-pattern, no need to check BCD, BDE, CDE
in later scan.
 Mining Closed Frequent Itemsets Techniques are item merging,
sub-itemset pruning and item skipping, as well as efficient subset
checking of generated itemsets in a pattern-tree. 56
5. Mining Various Kinds of Association Rules
 Multilevel association rules involve concepts at different levels
of abstraction.
 Multidimensional association rules involve more than one
dimension or predicate(e.g., rules relating what a customer buys
as well as the customer’s age.)
 Quantitative association rules involve numeric attributes that
have an implicit ordering among values (e.g., age).

It consists of:
1. Mining Multilevel Association Rules
2. Mining Multidimensional Association Rules
3. Mining Multidimensional Association Rules - Static Discretization of
Quantitative Attributes
4. Mining Quantitative Association Rules - ARCS

57
Example-1: Mining Association Rules

Ti
d
Items bought  Find all the rules X  Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs – Support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs,
Milk – Confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
 Let minsup = 50%, minconf = 50%
 Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Customer
buys beer
 Association rules: (many more!)

Beer  Diaper (60%, 100%)

Diaper  Beer (60%, 75%)
Example-2: Mining Association Rules

Transaction ID Items Bought Min. support 50%


2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
{A,C} 50%

 For rule A  C:
support = support({A  C}) = 50%
confidence = support({A  C})/support({A}) = 66.6%
1. Mining Multi-Level Association Rules
 A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher level, more general concepts.
 Data can be generalized by replacing low-level concepts within the data by their
higher-level concepts, or ancestors, from a concept hierarchy.

60
The multilevel association rule mining is done based on variations at level wise
search.
Three variation are:
i. Using uniform minimum support for all levels (referred to as uniform
support): The same minimum support threshold is used when mining at each
level of abstraction.
ii. Using reduced minimum support at lower levels (referred to as reduced
support): Each level of abstraction has its own minimum support threshold.
This is better approach compared to uniform support as it comes with more
item sets generated at each level.
iii. Using item or group-based minimum support (referred to as group-based
support): Because users or experts often have insight as to which groups are
more important than others, it is sometimes more desirable to set up user-
specific, item, or group based minimal support thresholds when mining
multilevel rules.

Problem: In Association rule mining to find frequent item set every time low level
item has to depending on its ancestor. ex: laptop computer has an ancestor
computer.
 buys(X, “laptop computer”) => buys(X, “HP printer”) [support = 8%, confidence = 70%]
 buys(X, “IBM laptop computer”) => buys(X, “HP printer”) [support = 2%, confidence = 72%]
62
2. Mining Multi-Dimensional Association
 Multidimensional association rules involve more than one
dimension or predicate (Ex: Rules relating what a
customer buys as well as the customer’s age.).
 Multidimensional association rules with no repeated predicates
are called interdimensional association rules.
 Mine multidimensional association rules with repeated predicates,
which contain multiple occurrences of some predicates, these
rules are called Hybrid-dimensional association rules.

 Categorical Attributes: finite number of possible values, no


ordering among values is called Data cube approach

 Quantitative Attributes: numeric, implicit ordering among values


is called Discretization, Clustering, and Gradient approaches. 63
 Types of Multi-dimensional rules:

1. Single-dimensional / Intra dimensional rules:


 buys(X, “milk”)  buys(X, “bread”)

2. Multi-dimensional / Inter dimensional rules:  2 dimensions or

predicates
a) Inter-dimension assoc. rules (no repeated predicates)
 age(X,”19-25”)occupation(X,“student”)buys(X, “coke”)

b) Hybrid-dimension assoc. rules (repeated predicates)


 age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
3. Hybrid dimensional rule:
64
age(X, “20…29”)^buys(X, “laptop”))buys(X, “HP printer”)
3. Mining Multidimensional Associations - Static Discretization

 Quantitative attributes, in this case, are discretized before mining


using pre defined concept hierarchies or data discretization
techniques, where numeric values are replaced by interval labels.
4. Mining Quantitative Association Rules - ARCS
 Quantitative association rules are multidimensional association rules
in which the numeric attributes are dynamically discretized during
the mining process so as to satisfy some mining criteria, such as
maximizing the confidence or compactness of the rules mined.
 Techniques can be categorized by how numerical attributes, such as
age or salary are treated.
 ARCS ( Association Rule and Clustering Systems):
 ARCS uses equal-width binning, where the bin size for each
quantitative attribute is input by the user.
1. Equal-width binning: where the interval size of each bin is the
same.
2. Equal-frequency binning: where each bin has approximately the
same number of tuples assigned to it.
3. Clustering-based binning: where clustering is performed on the
quantitative attribute to group neighboring points (judged based on
various distance measures) into the same bin
Ex: Clustering the association rules: Given rules:
age(X, 34)^income(X, “31K…40K”))buys(X, “HDTV”)
age(X, 35)^income(X, “31K…40K”))buys(X, “HDTV”)
age(X, 34)^income(X, “41K…50K”))buys(X, “HDTV”)
age(X, 35)^income(X, “41K…50K”))buys(X, “HDTV”).

Fig: A 2-D Grid for tuples representing customers who purchase high-Definition TVs

You might also like