0% found this document useful (0 votes)
49 views8 pages

Association Rules in Data Mining

Uploaded by

Soma Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views8 pages

Association Rules in Data Mining

Uploaded by

Soma Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

2.

4 Generating Association Rules from Frequent Itemsets

Generating association rules from frequent itemsets is a crucial step in data mining that helps
us derive actionable insights from the data. Here’s a detailed discussion on how to generate
these rules, including an example for clarity.

1. Overview of Association Rules

Association rules are used to find relationships between different items in a dataset. Each rule
is of the form:

Antecedent→Consequent\text{Antecedent} \rightarrow \
text{Consequent}Antecedent→Consequent

Where:

 Antecedent (or Left-hand side) is a set of items.


 Consequent (or Right-hand side) is another set of items.

The goal is to discover rules that show a strong relationship between items. To measure the
strength of these rules, we use several metrics:

 Support: The proportion of transactions that contain both the antecedent and the
consequent.
 Confidence: The proportion of transactions that contain the antecedent and also
contain the consequent.
 Lift: The ratio of the observed support to the expected support if the items were
independent.

2. Steps for Generating Association Rules

1. Find Frequent Itemsets: Identify all the itemsets that appear frequently in the
dataset, i.e., those that meet a minimum support threshold.
2. Generate Rules from Frequent Itemsets: For each frequent itemset, generate all
possible rules and evaluate them using metrics such as confidence and lift.
3. Prune Rules: Discard rules that do not meet the minimum confidence threshold or
other criteria.

3. Example

Let’s walk through an example using a hypothetical transaction dataset. Consider a dataset
with the following transactions:

Transaction ID Items
1 {Milk, Bread}
2 {Milk, Diapers, Beer}
3 {Bread, Diapers}
4 {Milk, Bread, Diapers}
5 {Bread, Beer}
Assume we have already mined the frequent itemsets and found that the frequent itemsets
are:

 {Milk, Bread} with support = 60%


 {Milk, Diapers} with support = 40%
 {Bread, Diapers} with support = 40%
 {Milk} with support = 80%
 {Bread} with support = 80%
 {Diapers} with support = 60%

Let’s generate association rules from these frequent itemsets.

Example 1: Generating Rules from Frequent Itemset {Milk, Bread}

1. Generate Rules: From the itemset {Milk, Bread}, we can generate the following
rules:
o Rule 1: {Milk} → {Bread}
o Rule 2: {Bread} → {Milk}
2. Calculate Metrics:

For Rule 1: {Milk} → {Bread}

o Support: Support of {Milk, Bread} = 60% (since 3 out of 5 transactions


contain both Milk and Bread)
o Confidence: Confidence of {Milk} → {Bread} = (Support of {Milk, Bread}) /
(Support of {Milk}) = 60% / 80% = 75%
o Lift: Lift = Confidence of {Milk} → {Bread} / (Support of {Bread}) = 75% /
80% = 0.9375

For Rule 2: {Bread} → {Milk}

o Support: Same as Rule 1 = 60%


o Confidence: Confidence of {Bread} → {Milk} = (Support of {Milk, Bread}) /
(Support of {Bread}) = 60% / 80% = 75%
o Lift: Same as Rule 1 = 0.9375

Both rules have high confidence, indicating that there is a strong relationship between
Milk and Bread.

Example 2: Generating Rules from Frequent Itemset {Milk, Diapers}

1. Generate Rules: From the itemset {Milk, Diapers}, we can generate:


o Rule 1: {Milk} → {Diapers}
o Rule 2: {Diapers} → {Milk}
2. Calculate Metrics:

For Rule 1: {Milk} → {Diapers}

o Support: Support of {Milk, Diapers} = 40%


o Confidence: Confidence of {Milk} → {Diapers} = (Support of {Milk,
Diapers}) / (Support of {Milk}) = 40% / 80% = 50%
o Lift: Lift = Confidence of {Milk} → {Diapers} / (Support of {Diapers}) =
50% / 60% = 0.8333

For Rule 2: {Diapers} → {Milk}

o Support: Same as Rule 1 = 40%


o Confidence: Confidence of {Diapers} → {Milk} = (Support of {Milk,
Diapers}) / (Support of {Diapers}) = 40% / 60% = 66.67%
o Lift: Lift = Confidence of {Diapers} → {Milk} / (Support of {Milk}) =
66.67% / 80% = 0.8333

Here, the confidence for {Diapers} → {Milk} is higher compared to {Milk} →


{Diapers}, which may indicate a stronger relationship in this direction.

4. Practical Considerations

 Minimum Support and Confidence: Set thresholds for support and confidence to
filter out insignificant rules.
 Complexity: Generating and evaluating rules for very large itemsets can be
computationally expensive.
 Interpretation: Ensure that the rules make practical sense in the business context.
High lift values indicate stronger rules.

5. Conclusion

Generating association rules from frequent itemsets helps in uncovering valuable patterns and
relationships within the data. By evaluating the rules based on support, confidence, and lift,
one can derive meaningful insights that can inform decision-making processes in various
domains, from retail to healthcare.

2. Frequent Itemset Mining Methods

2.1 Apriori Algorithm


 Principle: Uses the "apriori property" that all subsets of a frequent itemset must also
be frequent.
 Steps:
1. Generate Frequent 1-itemsets: Scan the dataset and count item frequencies.
2. Generate Candidate Itemsets: Use frequent itemsets to generate larger
candidate itemsets.
3. Prune Candidates: Remove candidate itemsets that are not frequent.
4. Repeat: Continue until no more frequent itemsets can be found.
 Example: If {A, B} is frequent, then {A, B, C} will be considered for the next
iteration.

2.2 Finding Frequent Itemsets by Confined Candidate Generation

 Concept: Reduces the search space by confining candidate generation to only those
itemsets that are subsets of the current frequent itemsets.
 Efficiency Improvement: Reduces computational overhead and memory usage.

2.3 FPGrowth (Frequent Pattern Growth)

 Concept: Avoids candidate generation by using a compact data structure called FP-
tree (Frequent Pattern Tree).
 Steps:
1. Construct FP-Tree: Scan the dataset and build the FP-tree by inserting
transactions.
2. Mine FP-Tree: Extract frequent itemsets by traversing the FP-tree and
generating patterns.
 Advantages: Generally more efficient than Apriori, particularly for large datasets.

2.4 Generating Association Rules from Frequent Itemsets

 Steps:
1. Generate Rules: For each frequent itemset, generate rules by splitting the
itemset into antecedents and consequents.
2. Calculate Metrics: Compute support, confidence, and lift to evaluate the
strength and usefulness of each rule.
 Example: From the frequent itemset {A, B, C}, generate rules like {A, B} -> {C}.

2.5 Improving the Efficiency of Apriori

 Techniques:
1. Transaction Reduction: Remove transactions that do not contain the frequent
itemsets.
2. Itemset Pruning: Use the fact that if an itemset is infrequent, all its supersets
will be infrequent.
3. Partitioning: Partition the database into smaller chunks to reduce
computation.
 Example: Implementing a hash-based technique to reduce candidate itemsets.
3. From Association Analysis to Correlation Analysis

Association analysis and correlation analysis are both important techniques in data mining
and statistics, but they serve different purposes and are used in different contexts. Here’s a
detailed discussion on how these analyses relate and transition from one to the other, with
examples for clarity.

1. Overview of Association Analysis

Association Analysis primarily focuses on identifying relationships and patterns within


transactional or categorical data. It is often used to discover frequent itemsets and derive
association rules that reveal how items are related. The metrics used in association analysis
include:

 Support: Measures how often an itemset appears in the dataset.


 Confidence: Indicates how often the consequent of the rule appears when the
antecedent is present.
 Lift: Measures how much more likely the consequent is to be present when the
antecedent is present, compared to the expected likelihood if the antecedent and
consequent were independent.

Example: In a retail dataset, if customers who buy bread often buy milk, an association rule
might be: Bread → Milk\text{{Bread} → \text{Milk}}Bread → Milk with metrics such as
support = 40%, confidence = 75%, and lift = 1.2.

2. Overview of Correlation Analysis

Correlation Analysis examines the relationship between two continuous variables to


determine whether they move together and the strength of their relationship. Unlike
association analysis, which works with categorical data and focuses on itemsets, correlation
analysis uses statistical measures to quantify relationships between continuous variables. Key
metrics include:

 Pearson Correlation Coefficient (r): Measures the linear relationship between two
variables. Values range from -1 to 1, where -1 indicates a perfect negative linear
relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear
relationship.
 Spearman's Rank Correlation: Measures the strength and direction of the
association between two ranked variables. It’s useful for non-linear relationships.
 Kendall’s Tau: Another rank-based measure of correlation, useful for small sample
sizes and when dealing with ordinal data.

Example: In a dataset of students’ study hours and exam scores, Pearson correlation might
reveal a strong positive correlation (r = 0.85), indicating that more study hours are associated
with higher exam scores.

3. Transition from Association to Correlation Analysis

While association analysis is used to find relationships between categorical items or events,
correlation analysis is used to quantify the strength and direction of relationships between
continuous variables. Here’s how you can transition from association analysis to correlation
analysis:

1. Data Preparation:
oAssociation Analysis: Deals with categorical data and is often used in market
basket analysis, where the goal is to find patterns like {bread, milk} →
{butter}.
o Correlation Analysis: Requires continuous data and is used to explore how
changes in one continuous variable affect another, such as how temperature
affects ice cream sales.
2. Finding Relationships:
o Association Analysis: Uses rules and metrics like support, confidence, and lift
to describe how often items are purchased together and how strong the
relationship is.
o Correlation Analysis: Uses statistical measures to describe how changes in
one variable are associated with changes in another variable.
3. Applications:
o Association Analysis: Used for recommendations (e.g., suggesting products),
understanding consumer behavior, and identifying common itemsets.
o Correlation Analysis: Used for predicting values, understanding trends, and
establishing causal relationships.

4. Example of Transition

Let’s illustrate with an example involving a retail store:

Dataset: Transactions from a store including items bought and customer demographics like
age and income.

1. Association Analysis:
o Frequent Itemsets: {Milk, Bread}, {Diapers, Beer}
o Association Rule: {Milk} → {Bread}, with support = 60%, confidence =
75%, and lift = 1.2.
2. Correlation Analysis:
o Data: Suppose we have continuous data on customer age and their total
spending.
o Pearson Correlation: Calculate the Pearson correlation coefficient between
age and total spending.
 If the correlation coefficient is 0.65, it suggests a moderate positive
relationship, meaning as customers get older, their spending tends to
increase.

Combined Insight:

 Association Analysis helps identify that people who buy milk often buy bread.
 Correlation Analysis might reveal that older customers spend more, possibly leading
to targeted marketing strategies.

5. Practical Considerations

 Data Type: Association analysis is suited for categorical data, while correlation
analysis is used for continuous data.
 Objective: Use association analysis to find patterns and relationships in categorical
data and correlation analysis to understand relationships between continuous
variables.
 Interpretation: Association rules provide actionable insights for categorical patterns
(e.g., “Customers who buy milk often buy bread”), while correlation analysis helps
understand continuous relationships (e.g., “Older customers tend to spend more”).

6. Conclusion

Transitioning from association analysis to correlation analysis involves understanding the


type of data and the goal of the analysis. While association analysis is used for discovering
relationships between categorical items, correlation analysis quantifies the strength and
direction of relationships between continuous variables. Both techniques provide valuable
insights that can be used in various applications, from market basket analysis to
understanding customer behavior and trends.

You might also like