COS10022 – DATA SCIENCE PRINCIPLES
Dr Pei-Wei Tsai (Lecturer, Unit Convenor)
[email protected], EN508d
Week 06
Association Rules
COS10022 - Data science Principles
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Examples
Application of Association Rules
Validation and Testing
Diagnostic
3
Overview
• Association rules Market Basket Analysis
• An unsupervised learning method.
• A descriptive, not predictive, method.
• Used to discover interesting hidden
relationships in a large dataset.
• The disclosed relationships are represented
as rules or frequent itemsets.
>
• Commonly used for mining transactions in
database.
4
Overview
• Some questions that association rules
can answer:
• Which products tend to be purchased
together?
• Of those customers who are similar to this
person, what products do they tend to buy? To analyze customer
buying habits by finding
associations and
• Of those customers who have purchased this correlations between the
different items that
product, what other similar products do they customers place in their
tend to view or purchase? shopping baskets.
5
Overview
The general logic behind association rules:
1. A large collection of transactions (depicted as
three stacks of receipts, in which each transaction
consists of one or more items).
2. Association rules go through the items being
purchased to see what items are frequently
bought together and to discover a list of rules that
describe the purchasing behavior.
3. The rules suggest that when cereal is purchased,
90% of the time milk is purchased; when bread is
purchased, 40% of the time milk is purchased also;
when milk is purchased, 23% of the time cereal is
also purchased.
6
Overview
Rules
• Each rule is in the form X → Y
• Means when Item X is observed, Item Y is also observed.
Itemset
• A collection of items or individual entities that contain
some kind of relationship.
• An itemset containing k items is called a k-itemset.
• k-itemset = {item 1, item 2,…,item k}
• Examples:
• A set of retail items purchased together in one transaction.
• A set of hyperlinks clicked on by one user in a single session.
8
Overview
1-itemsets
Apriori Algorithm
• The most fundamental algorithms for generating association rules. 2-itemsets
• One major component of Apriori is support.
• Given an Itemset L, the support of L is the percentage of transactions that contain L.
• If 80% of all transactions contain itemset {bread}, then the support of {bread} is 0.8.
• If 60% of all transactions contain itemset {bread, butter}, then the support of {bread, butter} is 0.6.
Frequent Itemset
• Items that appear together “often enough” (i.e. meets the minimum support criterion).
• If the minimum support is set at 0.7, {bread} is considered a frequent itemset; whereas {bread,
butter} is not considered as a frequent itemset.
9
Overview
Apriori Property
Frequent
• Also called downward closure property. Itemset
• If an item is considered frequent, then any subset of the
frequent itemset must also be frequent.
• If 60% of the transactions contain {bread, jam}, then at least
60% of all the transactions will contain {bread} or {jam}.
• If the support of {bread, jam} is 0.6, the support of {bread}
or {jam} is at least 0.6.
• If itemset {B, C, D} is frequent, then all the subset of this
itemset, shaded in the figure, must also be frequent
itemsets.
10
Overview
Association Rules Rule evaluation Metrics
• An implication expression of the form X • Support (s): No. of transactions that contain
→ Y, where X and Y are non-overlapping both X and Y out of total no. of transactions.
itemsets.
• E.g. A support of 2% means that 2% of all the
• E.g. {Milk, Diaper} → {Beer} transactions under analysis show that {Milk,
Diaper} and {Beer} are purchased together.
• Generating association rules:
• Step 1: Find frequent itemsets whose • Confidence (c): No. of transactions that
occurrences exceed a predefined contain both X and Y out of total no. of
minimum support threshold. transactions that contains X.
• Step 2: Derive association rules from
those frequent itemsets (with the • E.g. A confidence of 60% means that 60% of
constrains of minimum confidence customers who purchased {Milk, Diaper} also
threshold). bought {Beer}
11
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Examples
Application of Association Rules
Validation and Testing
Diagnostic
12
Apriori Algorithm
Creating Frequent Sets
• Apriori employs an iterative approach known as a level-wise search, where
k-itemsets are used to explore (k+1)-itemsets.
• First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted L1.
• Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3,
and so on, until no more frequent k-itemsets can be found.
• The finding of each Lk requires one full scan of the database.
13
Example 1
L1
Item Count Items (1-itemsets) (5 transactions ; 6 types of items)
Bread 4
Coke 2 L2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3
{Bread,Milk} 3
Diaper 4
Eggs 1 {Bread,Beer} 2 (No need to generate
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
L3
Itemset Count
{Bread,Milk,Diaper} 2 Triplets (3-itemsets)
MinSupp = 3/5=0.6 {Bread,Milk,Beer} 1
{Bread,Diaper,Beer} 2
14
Apriori Algorithm
Creating Frequent Sets
• Let’s define:
Ck as a candidate itemset of size k
Lk as a frequent itemset of size k
• Main steps of iteration are:
1. Find frequent itemset Lk-1 (starting from L1)
2. Join step: Ck is generated by joining Lk-1 with itself (cartesian product Lk-1 x Lk-1)
3. Prune step (Apriori Property): Any (k−1) size itemset that is not frequent cannot be a subset of
a frequent k size itemset, hence should be removed from Ck
4. Frequent set Lk has been achieved
15
Apriori Algorithm
Illustrating Apriori Principle
• Any subset of a
frequent itemset must
also be frequent.
• Itemsets that do not
meet the minimum
support threshold are
pruned away.
16
Apriori Algorithm
Pseudo Code
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
17
Example 2
18
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Examples
Application of Association Rules
Validation and Testing
Diagnostic
19
Evaluation of Candidate Rules
The process of creating association rules is two-staged.
• First, a set of candidate rule based on frequent itemsets is generated.
• If {Bread, Egg, Milk, Butter} is the frequent itemset, candidate rules will look like:
• {Egg, Milk, Butter} → {Bread}
• {Bread, Milk, Butter} → {Egg}
• {Bread, Egg} → {Milk, Butter}
• Etc.
• Second, the appropriateness of these candidate rules are evaluated using:
• Confidence
• Lift
• Leverage
20
Evaluation of Candidate Rules
Confidence
• The measures of certainty or trustworthiness associated with each discovered rule.
• Mathematically, the percentage of transactions that contain both X and Y out of all
transactions that contain X.
Support( X + Y )
Confidence( X → Y ) =
support( X )
• E.g. if {bread, eggs, milk} has support of 0.15 and {bread, eggs} also has a support of
0.15, the confidence of rule {bread, eggs} → {milk} is 1.
• This means 100% of the time a customer buys bread and eggs, milk is brought as well. The rule is
therefore correct for 100% of the transactions containing bread and eggs.
21
Evaluation of Candidate Rules
{Toothbrush}→{Milk}:
Confidence Confidence = 10/(10+4) = 0.7
• A relationship may be thought of as interesting when the algorithm identifies the
relationship with a measure of confidence greater than or equal to the predefined
threshold (i.e. the minimum confidence).
• Problem with Confidence:
• Given a rule X → Y, confidence considers only the antecedent (X) and the co-
occurrence of X and Y.
• Cannot tell if a rule contains true implication of the relationship or if the rule
is purely coincidental.
22
https://towardsdatascience.com/association-rules-2-aa9a77241654
Evaluation of Candidate Rules
Lift
• Measures how many times more often X and Y occur together than expected if they are
statistically independent of each other.
• A measure of how X and Y are really related rather than coincidentally happening
together.
support( X + Y )
Lift( X ⇒ Y ) =
support( X ) × support( Y )
• Lift = 1 if X and Y are statistically independent
• Lift > 1 indicates the degree of usefulness of the rule
• A larger value of lift suggests a greater strength of the association between X and Y.
23
Evaluation of Candidate Rules
Lift
• E.g. Assuming 1000 transactions,
• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the transactions, then
Lift(milk → eggs) = 0.3 / (0.5 * 0.4) = 1.5
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions, then
Lift(milk → bread) = 0.4/(0.5*0.4) = 2.0
• Therefore it can be concluded that milk and bread have a stronger association
than milk and eggs.
24
Evaluation of Candidate Rules
Leverage
• Measure the difference in the probability of X and Y appearing together in the
dataset compared to what would be expected if X and Y were statistically
independent of each other.
Leverage( X → Y ) = Support( X + Y ) − Support( X ) × Support( Y )
• Leverage = 0 if X and Y are statistically independent
• Leverage > 0 indicates the degree of relationship between X and Y,
• A larger leverage value indicates a stronger relationship between X and Y.
25
Evaluation of Candidate Rules
Leverage
• E.g. Assuming 1000 transactions,
• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the
transactions, then Leverage(milk → eggs) = 0.3 - 0.5*0.4 = 0.1
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the
transactions, then Leverage (milk → bread) = 0.4 - 0.5*0.4 = 0.2
• It again confirms that milk and bread have a stronger association
than milk and eggs.
26
Summary
• Assuming 1000 transactions,
• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the transactions.
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions.
{Milk} → {Egg} {Milk} → {Bread}
Confidence 0.3ൗ 0.4ൗ
0.5 = 0.6 0.5 = 0.8
Lift 0.3ൗ 0.4ൗ
0.5 × 0.4 = 1.5 0.5 × 0.4 = 2
Leverage 0.3 − 0.5 × 0.4 = 0.1 0.4 − 0.5 × 0.4 = 0.2
Smaller Greater
27
Evaluation of Candidate Rules
• Confidence is able to identify trustworthy rules, but it cannot tell
whether a rule is coincidental.
• Measures such as lift and leverage not only ensure interesting rules
are identified but also filter out the coincidental rules.
• Support, confidence, lift and leverage ensures the discovery of
interesting and strong rules from sample dataset.
28
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Examples
Application of Association Rules
Validation and Testing
Diagnostic
29
Example 3 Min_Supp Count = 2
Min_Conf = 50%
Five Market Basket 1. Candidate list generation, C1: 2. Create frequent 1-itemsets, L1:
transactions with items Start by creating all individual Remove candidates that fail
labelled as I1, I2, I3 and so on. items called candidates and min_sup count (i.e. I4 and I5).
calculate their support counts. No supersets of infrequent
itemset must be generated and
tested.
30
Min_Supp Count = 2
Min_Conf = 50%
3. Generate second 4. Create Frequent 2- 5. Generate third 6. Support count for {I1,
candidate list, C2: itemsets, L2: candidate list, C3: I2, I3} fails min_supp.
C2 is generated by L1 Remove candidates C3 is generated by L2 Association rule mining
cross join L1. that fail min_sup count. cross join L2. is completed and there
will be no C4 candidate
list.
{L1, L2} appears in 2
transactions together.
{L2, L3} appears in 3
transactions together.
{L1, L2, I3} appears in only 1
transaction together.
31
9. Strong association rules:
Green ticked are all
strong rules satisfying
min_conf >=50% and
min_supp count =2.
7. Generate candidate rules: Take the 8. Calculate confidence of each candidate
last non-empty frequent itemset (i.e. rule:
L2 = {I1, I2}, {I2, I3}) and Use all the Support( X + Y )
non-empty subsets of the frequent Confidence( X → Y ) =
support( X )
itemset to generate association rules.
32
Example 4
TID List of Items
• Consider a database, D , consisting of 9 transactions.
T100 I1, I2, I5
T101 I2, I4 • Suppose minimum support count required is 2 (i.e.
min_sup = 2/9 = 22 % )
T102 I2, I3
T103 I1, I2, I4
• Let minimum confidence required be 70%.
T104 I1, I3
T105 I2, I3 • We have first to find out the frequent itemsets using
T106 I1, I3 Apriori algorithm.
T107 I1, I2 ,I3, I5
T108 I1, I2, I3 • Then, Association rules will be generated using min.
support & min. confidence.
33
Step 1: Generating 1-itemset Frequent Pattern
Itemset Sup.Count Itemset Sup.Count
Scan D to {I1} 6 Compare {I1} 6
count of {I2} 7 candidate {I2} 7
each {I3} 6 support count {I3} 6
candidate {I4} 2 with minimum {I4} 2
{I5} 2
support count (2) {I5} 2
C1 L1
• In the first iteration of the algorithm, each item is a member of the set of candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying
minimum support.
34
Step 2: Generating 2-itemset Frequent Pattern
Itemset Itemset Sup. Itemset Sup
Count Count
Generate C2 {I1, I2}
Scan D {I1, I2} 4 Compare {I1, I2} 4
candidates for {I1, I3} 4 candidate {I1, I3} 4
{I1, I3}
from L1 count of support {I1, I5} 2
{I1, I4} {I1, I4} 1
each count
(L1xL1) {I1, I5} 2 {I2, I3} 4
{I1, I5} candida with
{I2, I4} 2
{I2, I3} te {I2, I3} 4 minimum
{I2, I5} 2
cartesian product Lk-1 x Lk-1 {I2, I4} {I2, I4} 2 support
{I2, I5} {I2, I5} 2 count L2
{I3, I4} {I3, I4} 0
{I3, I5} {I3, I5} 1
{I4, I5} {I4, I5} 0
C2 C2
35
Step 2: Generating 2-itemset Frequent Pattern [Cont.]
• To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to
generate a candidate set of 2-itemsets, C2.
• Next, the transactions in D are scanned and the support count for each candidate
itemset in C2 is accumulated (as shown in the middle table).
• The set of frequent 2-itemsets, L2 , is then determined, consisting of those
candidate 2-itemsets in C2 having minimum support.
• Note: We haven’t used Apriori Property yet.
36
Step 3: Generating 3-itemset Frequent Pattern
Generate C3 Compare
candidates Scan D candidate
from L2 Itemset
for count Itemset Sup. support count Itemset Sup
of each Count with min Count
L2xL2 {I1, I2, I3}
candidate {I1, I2, I3} 2 support count {I1, I2, I3} 2
{I1, I2, I5}
{I1, I2, I5} 2 {I1, I2, I5} 2
….
….
L3
C3 C3
• The generation of the set of candidate 3-itemsets, C3 , involves the use of the Apriori Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps
to avoid heavy computation due to large Ck.
37
Step 3: Generating 3-itemset Frequent Pattern [Cont.]
C3 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
• Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can
determine that four latter candidates cannot possibly be frequent. How ?
• For example, let’s take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-
item subsets of {I1, I2, I3} are members of L2, we will keep {I1, I2, I3} in C3.
• Let's take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item
subsets are {I2, I3}, {I2, I5} & {I3,I5}.
• BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus we
will have to remove {I2, I3, I5} from C3.
• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for
Pruning.
• Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-
itemsets in C3 having minimum support.
38
Step 4: Generating 4-itemset Frequent Pattern
• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4.
Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset
{{I2, I3, I5}} is not frequent.
• Thus, C4 = φ , and algorithm terminates, having found all the frequent items.
This completes our Apriori Algorithm.
• What’s Next ?
These frequent itemsets will be used to generate strong association rules ( where
strong association rules satisfy both minimum support & minimum confidence).
39
Step 5: Generating Association Rules from Frequent Itemsets
• Back To Example:
List all frequent itemsets, L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},
{I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
• Lets take the 3-itemsets: {I1,I2,I5}.
• Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
• The resulting candidates rules are:
• {I1,I2} → {I5}
• {I1,I5} → {I2}
• {I2,I5} → {I1}
• {I5} → {I1,I2}
• {I2} → {I1,I5}
• {I1} → {I2,I5}
40
Step 5: Generating Association Rules from Frequent Itemsets [Cont.]
• Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below, each listed with its confidence.
• R1: I1 ^ I2 → I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% {I1,I2,I5}
• R1 is Rejected.
Support( X + Y )
• R2: I1 ^ I5 → I2 Confidence( X → Y ) =
support( X )
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
{I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}
• R3: I2 ^ I5 → I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
• R3 is Selected.
41
Step 5: Generating Association Rules from Frequent Itemsets
[Cont.]
• R4: I1 → I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
• R5: I2 → I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
• R6: I5 → I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
In this way, we have found three strong association rules.
42
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Examples
Application of Association Rules
Validation and Testing
Diagnostic
43
Applications of Association Rules
The term market basket analysis refers to a specific implementation of
association rules.
• For better merchandising – products to include/exclude from
inventory each month
• Placement of products
• Cross-selling
• Promotional programs—multiple product purchase incentives
managed through a loyalty card program
44
Applications of Association Rules
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a laptop computer and a virus protection software, also
bought extended service plan 70 percent of the time."
• How do you use such a pattern/knowledge?
• Put the items next to each other for ease of finding
• Promote the items as a package (do not put one on sale if the other(s) are on sale)
• Place items far apart from each other so that the customer must walk the aisles to search for it,
and by doing so potentially seeing and buying other items
45
Applications of Association Rules
Recommender systems – Amazon, Netflix:
• Clickstream analysis from web usage log files
• Website visitors to page X click on links A,B,C more than on links D,E,F
In medicine:
• relationships between symptoms and illnesses;
• diagnosis and patient characteristics and treatments (to be used in
medical DSS);
• genes and their functions (to be used in genomics projects)..
46
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Application of Association Rules
Examples
Validation and Testing
Diagnostic
47
Validation and Testing
• The frequent and high confidence itemsets are found by pre-specified minimum support
and minimum confidence levels
• Measures like lift and/or leverage then ensure that interesting rules are identified rather
than coincidental ones
• However, some of the remaining rules may be considered subjectively uninteresting
because they don’t yield unexpected profitable actions
• E.g., rules like {paper} → {pencil} are not interesting/meaningful
• Incorporating subjective knowledge requires domain experts
• Good rules provide valuable insights for institutions to improve their business operations
48
Outline
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Examples
Application of Association Rules
Validation and Testing
Diagnostic
49
Diagnostics
• Although the Apriori algorithm is easy to understand and implement, some
of the rules generated are uninteresting or practically useless.
• Additionally, some of the rules may be generated due to coincidental
relationships between the variables.
• Measures like confidence, lift, and leverage should be used along with
human insights to address this problem.
50
Diagnostics
• Another problem with association rules is that, in Phase 3 and 4 of the Data
Analytics Lifecycle , the team must specify the minimum support prior to
the model execution, which may lead to too many or too few rules.
• In related research, a variant of the algorithm can use a predefined target
range for the number of rules so that the algorithm can adjust the minimum
support accordingly.
• Algorithm requires a scan of the entire database to obtain the result.
Accordingly, as the database grows, it takes more time to compute in each
run.
51
Diagnostics
Approaches to improve Apriori’s efficiency:
Partitioning:
• Any itemset that is potentially frequent in a transaction database must be frequent in at least one of the partitions of the
transaction database.
Sampling:
• This extracts a subset of the data with a lower support threshold and uses the subset to perform association rule mining.
Transaction reduction:
• A transaction that does not contain frequent k-itemsets is useless in subsequent scans and therefore can be ignored.
Hash-based itemset counting:
• If the corresponding hashing bucket count of a k-itemset is below a certain threshold, the k-itemset cannot be frequent.
Dynamic itemset counting:
• Only add new candidate itemsets when all of their subsets are estimated to be frequent.
52
References
• EMC Education Services. (2015). Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data. Wiley.
• Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and
techniques. Elsevier.
• Data Camp, ‘Market Basket Analysis using R’, retrieved from
https://www.datacamp.com/community/tutorials/market-basket-
analysis-r
53