0% found this document useful (0 votes)
26 views24 pages

Data Mining and Predictive Modeling: Lecture 9: Association Rule Mining, Apriori Algorithm

This document discusses Association Rule Mining and the Apriori Algorithm, which are essential techniques in data mining for discovering relationships between categorical data. It explains the concept of frequent patterns, the process of generating association rules, and the challenges faced in frequent itemset mining. Additionally, it outlines methods to improve the efficiency of the Apriori algorithm, including reducing database scans and candidate numbers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views24 pages

Data Mining and Predictive Modeling: Lecture 9: Association Rule Mining, Apriori Algorithm

This document discusses Association Rule Mining and the Apriori Algorithm, which are essential techniques in data mining for discovering relationships between categorical data. It explains the concept of frequent patterns, the process of generating association rules, and the challenges faced in frequent itemset mining. Additionally, it outlines methods to improve the efficiency of the Apriori algorithm, including reducing database scans and candidate numbers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Mining and Predictive

Modeling
Lecture 9: Association Rule Mining, Apriori Algorithm
Association Rule Mining
• It is an important data mining model studied extensively by the
database and data mining community.
• Assume all data are categorical.
• Initially used for market basket analysis to find how items purchased
by customers are related.

Bread → Milk [sup = 5%, conf = 100%]


Frequent Pattern Analysis
• Frequent pattern: a pattern ( a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set.
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining.
• Motivation: Finding inherent regularities in data
• What products were often purchased together?
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?
• Applications:
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis
Market Basket Analysis
Why Frequent Pattern Mining?
• Freq. pattern: An intrinsic and important property of datasets
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
• Classification: frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression
• Broad applications
Basing Concepts: Frequent Patterns

• I = {i1, i2, …, im}: a set of items.


• Transaction t :
• t a set of items, and t  I.
• Transaction Database T: a set of transactions T = {t1,
t2, …, tn}.
Transaction Data: Supermarket
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID
(transaction ID)
• A transactional dataset: A set of transactions
The Model: rules
• A transaction t contains X, a set of items (itemset) in I,
if X  t.
• An association rule is an implication of the form:
X → Y, where X, Y  I, and X Y = 

• An itemset is a set of items.


• E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset
Example
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
• (absolute) support, or, support count
of X: Frequency or occurrence of an
40 Nuts, Eggs, Milk
itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
• (relative) support, s, is the fraction
Customer
buys both
Customer of transactions that contains X (i.e.,
buys diaper
the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys beer
Association Rules
Tid Items bought • Find all the rules X → Y with
10 Beer, Nuts, Diaper minimum support and confidence
20 Beer, Coffee, Diaper
• support, s, probability that a
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk • confidence, c, conditional
Customer
probability that a transaction
Customer
buys both
buys
having X also contains Y
diaper Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer ◼ Association rules: (many more!)
◼ Beer → Diaper (60%, 100%)
◼ Diaper → Beer (60%, 75%)
10
Frequent Itemset Mining Methods
Frequent Itemset Mining Methods
Apriori: A Candidate Generation-and-Test Approach

Improving the efficiency of Apriori

FPGrowth: A Frequent Pattern-Growth Approach

Frequent Pattern Mining with Vertical Data Format


Proposed by R. Agrawal and R. Srikanth in
1994.

Apriori The name of this algorithm is based on the


fact that the algorithm uses prior knowledge
Algorithm of the frequent itemset properties.

Apriori employs an iterative approach known


as a level-wise search where k-itemsets are
used to explore (k+1) itemsets.
Apriori Method
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Terminate when no frequent or candidate set can be generated
• To improve the efficiency of the level-wise generation of frequent
itemsets, the apriori property is used to reduce the search space
• All non-empty subsets of a frequent itemset must also be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
Apriori -Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
Apriori - Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Apriori - Implementation
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
Generating Association Rules from Frequent Itemsets
• Method
• For each frequent itemset l, generate all nonempty subsets of l
• For every nonempty subset of l, output the rule s -> (l - s) if (support_coutn(l) /
support_count(s)) >=min_conf

• Because the rules are generated from frequent itemsets, each one automatically
satisfies the minimum support
Generating Association Rules Example
• Example: Given the following table and the frequent itemset X = {I1, I2, I5}, generate
the association rules.

• The nonempty subsets of X are:


{I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}

• The resulting association rules are:

• If the minimum confidence threshold is, say, 70%, then only the second, third, and last
rules are output, because these are the only ones generated that are strong
Improving the Efficiency of Apriori
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates
Scan Database only Twice
• Scan 1: partition database and find local frequent patterns.
• Scan 2: consolidate global frequent patterns.
Hash-based Technique: Reduce the Number of
candidates
• A k-itemset whose corresponding hashing bucket count is below the threshold
cannot be frequent
Example
• Consider the following database with five transactions. Let
min_sup = 60% and min_conf = 80%.

• Find all the frequent itemsets using Apriori method


• List all the strong association rules matching the following
metarule
Thank You

You might also like