Data Mining Assignment
Association Analysis (Solution)
University of Transportation Ho Chi Minh City
August 21, 2025
Instructions
Please answer the following questions clearly and show all your work. For questions requiring
calculations, define the formulas you are using before applying them.
Question 1
Consider the data set shown in Table 6.22.
Table 1: Example of market basket transactions.
Customer ID Transaction ID Items Bought
1 0001 {a, d, e}
1 0024 {a, b, c, e}
2 0012 {a, b, d, e}
2 0031 {a, c, d, e}
3 0015 {b, c, e}
3 0022 {b, d, e}
4 0029 {c, d}
4 0040 {a, b, c}
5 0033 {a, d, e}
5 0038 {a, b, e}
a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each
transaction ID as a market basket.
Formula: The support of an itemset X is the fraction of transactions that contain X.
Number of transactions containing X
s(X) =
Total number of transactions
There are a total of 10 transactions.
• The itemset {e} appears in 8 transactions (0001, 0024, 0012, 0031, 0015, 0022, 0033,
0038).
8
s({e}) = = 0.8
10
• The itemset {b, d} appears in 2 transactions (0012, 0022).
2
s({b, d}) = = 0.2
10
1
• The itemset {b, d, e} appears in 2 transactions (0012, 0022).
2
s({b, d, e}) = = 0.2
10
b) Use the results in part (a) to compute the confidence for the association rules
{b, d} → {e} and {e} → {b, d}. Is confidence a symmetric measure?
Formula: The confidence of a rule X → Y is the conditional probability of seeing Y
given that we have seen X.
s(X ∪ Y )
c(X → Y ) =
s(X)
• For the rule {b, d} → {e}:
s({b, d, e}) 0.2
c({b, d} → {e}) = = = 1.0 (100%)
s({b, d}) 0.2
• For the rule {e} → {b, d}:
s({b, d, e}) 0.2
c({e} → {b, d}) = = = 0.25 (25%)
s({e}) 0.8
No, confidence is not a symmetric measure. As shown by the calculations, c({b, d} →
{e}) ̸= c({e} → {b, d}).
c) Repeat part (a) by treating each customer ID as a market basket.
First, we create a market basket for each customer, containing the union of all items they
bought. There are 5 customers.
• Customer 1: {a, b, c, d, e}
• Customer 2: {a, b, c, d, e}
• Customer 3: {b, c, d, e}
• Customer 4: {a, b, c, d}
• Customer 5: {a, b, d, e}
• The itemset {e} appears in 4 customer baskets (1, 2, 3, 5).
4
s({e}) = = 0.8
5
• The itemset {b, d} appears in all 5 customer baskets.
5
s({b, d}) = = 1.0
5
• The itemset {b, d, e} appears in 4 customer baskets (1, 2, 3, 5).
4
s({b, d, e}) = = 0.8
5
d) Use the results in part (c) to compute the confidence for the association rules
{b, d} → {e} and {e} → {b, d}.
• For the rule {b, d} → {e}:
s({b, d, e}) 0.8
c({b, d} → {e}) = = = 0.8 (80%)
s({b, d}) 1.0
2
• For the rule {e} → {b, d}:
s({b, d, e}) 0.8
c({e} → {b, d}) = = = 1.0 (100%)
s({e}) 0.8
e) Discuss whether there are any relationships between s1 and c1 (transaction-
based) or s2 and c2 (customer-based).
There are no apparent or guaranteed relationships between (s1 , c1 ) and (s2 , c2 ). Aggre-
gating transactions by customer fundamentally changes the dataset by altering the total
number of baskets and the composition of each basket. This transformation can cause
support and confidence values to increase, decrease, or stay the same in unpredictable
ways, depending on the purchasing patterns of the customers.
Question 2
Suppose the Apriori algorithm is applied to the data set shown below with minsup = 30% (3
transactions).
Table 2: Example of market basket transactions.
Transaction ID Items Bought
1 {a, b, d}
2 {b, c, d}
3 {a, b, d, e}
4 {a, c, d, e}
5 {b, d, e}
6 {c, d}
7 {a, b, c}
8 {a, d, e}
9 {a, c, d}
10 {b, d}
a) Draw an itemset lattice representing the data set. Label each node with F
(Frequent), I (Infrequent after counting), or N (Not a candidate/Pruned).
Step 1: Find Frequent 1-Itemsets (L1 )
• {a}: 6, {b}: 6, {c}: 5, {d}: 9, {e}: 4. All are frequent.
• L1 = {{a}, {b}, {c}, {d}, {e}}
Step 2: Generate and Prune 2-Itemsets (C2 → L2 )
• Candidates C2 : {ab}, {ac}, {ad}, {ae}, {bc}, {bd}, {be}, {cd}, {ce}, {de}
• Count supports: {ab}:3, {ac}:3, {ad}:5, {ae}:3, {bc}:3, {bd}:6, {be}:3, {cd}:5,
{ce}:2, {de}:4
• L2 = {{a, b}, {a, c}, {a, d}, {a, e}, {b, c}, {b, d}, {b, e}, {c, d}, {d, e}}. Itemset {c,e} is
infrequent.
Step 3: Generate and Prune 3-Itemsets (C3 → L3 )
3
• Candidates generated from L2 : {abc}, {abd}, {abe}, {acd}, {ace}, {ade}, {bcd},
{bce}, {bde}, {cde}
• Pruning step: {ace} is pruned because its subset {ce} is not in L2 . {bce} is pruned
because its subset {ce} is not in L2 .
• Remaining candidates to count: {abc}, {abd}, {abe}, {acd}, {ade}, {bcd}, {bde}
• Count supports: {abc}:1, {abd}:3, {abe}:1, {acd}:3, {ade}:3, {bcd}:2, {bde}:3
• L3 = {{a, b, d}, {a, c, d}, {a, d, e}, {b, d, e}}.
Step 4: Generate and Prune 4-Itemsets (C4 → L4 )
• Candidate generated from L3 : {abde} from joining {abd} and {abe} (not possible).
Join {abd} and {acd} (not possible). Join {ade} and {bde}. Candidate: {abde}.
• Join {acd} and {ade}. Candidate: {acde}.
• Let’s join {a,b,d} and {a,c,d}. Candidate {a,b,c,d}. Subsets: {abc}(I), {abd}(F),
{acd}(F), {bcd}(I). Pruned.
• Join {a,b,d} and {a,d,e}. Candidate {a,b,d,e}. Subsets: {abd}(F), {abe}(I), {ade}(F),
{bde}(F). Pruned.
• Join {a,c,d} and {a,d,e}. Candidate {a,c,d,e}. Subsets: {acd}(F), {ace}(I), {ade}(F),
{cde}(I). Pruned.
• No frequent 4-itemsets are found.
null
F
A B C D E
F F F F F
BD AE AD AB AC BC CD CE DE BE
F F F F I F F I F F
BCE ACD ABE ABD ABC BCD CDE BDE ACE ADE
N N I I N I N F N F
ABDE ABCD ABCE BCDE ACDE
N N N N N
ABCDE
N
b) What is the percentage of frequent itemsets (with respect to all itemsets in
the lattice)?
4
There are 25 = 32 total itemsets in the lattice. Counting the nodes labeled ’F’, there are
19 frequent itemsets (including the null set).
19
Percentage of frequent itemsets = = 59.375%
32
c) What is the pruning ratio of the Apriori algorithm on this data set?
Pruning ratio is the percentage of itemsets not considered because they were pruned during
candidate generation. These are the nodes labeled ’N’. There are 9 such nodes.
9
Pruning ratio = = 28.125%
32
d) What is the false alarm rate?
The false alarm rate is the percentage of candidate itemsets that are found to be infrequent
after support counting. These are the nodes labeled ’I’. There are 4 such nodes.
4
False alarm rate = = 12.5%
32
Question 3
The Apriori algorithm uses a hash tree data structure to efficiently count the support of candi-
date itemsets. Consider the hash tree for candidate 3-itemsets shown in Figure 6.2.
a) Given a transaction that contains items {1, 3, 4, 5, 8}, which of the hash tree
leaf nodes will be visited when finding the candidates of the transaction?
To find the candidate 3-itemsets contained within the transaction {1, 3, 4, 5, 8}, we must
generate all possible 3-item subsets from the transaction and traverse the hash tree for
each one. The subsets are: {1,3,4}, {1,3,5}, {1,3,8}, {1,4,5}, {1,4,8}, {1,5,8}, {3,4,5},
{3,4,8}, {3,5,8}, {4,5,8}.
By tracing each of these subsets through the hash tree structure provided in the problem,
we find that the leaf nodes visited are L1, L3, L5, L9, and L11.
b) Use the visited leaf nodes in part (a) to determine the candidate itemsets that
are contained in the transaction {1, 3, 4, 5, 8}.
We now check which candidate itemsets are stored in the visited leaf nodes.
• L1 contains {1,4,5}
• L5 contains {1,5,8}
• L9 contains {4,5,8}
Therefore, the candidates contained in the transaction are {1, 4, 5}, {1, 5, 8}, and {4,
5, 8}.
Question 4
Answer the following questions using the data sets shown in Figure 6.6. We will apply the Apriori
algorithm to extract frequent itemsets with minsup = 10% (i.e., itemsets must be contained in
at least 1000 transactions).
5
a) Which data set(s) will produce the most number of frequent itemsets?
Answer: Data set (e). This dataset has dense vertical patterns, indicating that many
items are frequently purchased together across many transactions. This high correlation
will lead to the generation of a very long frequent itemset along with all of its subsets,
resulting in the largest total number of frequent itemsets.
b) Which data set(s) will produce the fewest number of frequent itemsets?
Answer: Data set (d). This dataset appears to have very sparse and random-looking
patterns. It is unlikely that any combination of items will meet the 10% support threshold,
meaning it will likely produce no frequent itemsets beyond the 1-itemsets (if any).
c) Which data set(s) will produce the longest frequent itemset?
Answer: Data set (e). For the same reason as in part (a), the strong vertical correla-
tions in this dataset suggest that a large number of items are consistently bought together,
which will form a single, very long frequent itemset.
d) Which data set(s) will produce frequent itemsets with highest maximum sup-
port?
Answer: Data set (b). This dataset shows some items that appear in very long hor-
izontal blocks, meaning these specific items are present in a large, contiguous block of
transactions. This structure will lead to certain items having a very high support count,
likely the highest maximum support among all datasets.
e) Which data set(s) will produce frequent itemsets containing items with wide-
varying support levels?
Answer: Data set (e). This dataset shows some items appearing very frequently
(dense vertical bars) while others appear much less frequently (sparse dots). This mix
of common and rare items will result in frequent itemsets containing items with a wide
range of individual support levels (e.g., from less than 20% to more than 70%).