MLDAP Mod3
MLDAP Mod3
K-Means is an unsupervised machine learning algorithm used to partition a dataset into K distinct, non-overlapping
subsets (clusters). The goal of K-means is to minimize the intra-cluster variance, meaning that data points within the
same cluster are as similar as possible, and data points in different clusters are as dissimilar as possible.
Key Concepts
1. Cluster Centroids: Each cluster is represented by its centroid, which is the mean of the points in that
cluster.
2. Euclidean Distance: K-Means uses the Euclidean distance between data points and cluster centroids to
assign each point to the closest cluster.
3. Convergence: The algorithm continues iterating through the process of assigning points to the nearest
centroid and then recalculating the centroids until no points change clusters (convergence).
1. Initialization:
o Choose K initial centroids randomly from the data points.
2. Assignment Step:
o For each data point, calculate the distance to each centroid. Assign each point to the cluster
corresponding to the closest centroid.
3. Update Step:
o Recompute the centroids by calculating the mean of all points assigned to each cluster.
4. Repeat:
o Repeat the assignment and update steps until the centroids no longer change, indicating
convergence.
# Update step
new_centroids = update_centroids(clusters)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Plot centroids
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X')
plt.title("K-Means Clustering with 4 Clusters")
plt.show()
1. Data Generation: We create synthetic data using make_blobs with 4 centers to mimic a real-world
clustering problem.
2. KMeans: The KMeans class from sklearn is used to cluster the data into 4 clusters.
3. Visualization: The clusters are visualized using matplotlib, with each cluster shown in a different color.
The centroids are marked with red X’s.
One of the challenges in K-Means is selecting the correct number of clusters, KKK. There are several methods to
determine the optimal KKK:
1. Elbow Method:
o Plot the sum of squared errors (SSE) for different values of K and look for an "elbow," where the
SSE starts to decrease at a slower rate. This is a good estimate of the optimal .
# Elbow method
sse = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
sse.append(kmeans.inertia_)
2. Silhouette Score:
o This method measures how similar an object is to its own cluster compared to other clusters. A
higher silhouette score indicates better-defined clusters.
python
CopyEdit
from sklearn.metrics import silhouette_score
K-Means: Limitations
Sensitive to Initialization: The final result can depend on the initial choice of centroids. This can lead to
suboptimal clustering.
o Solution: Use k-means++, which selects initial centroids in a smarter way.
Number of Clusters: K-Means requires you to specify the number of clusters in advance.
Non-spherical Clusters: K-Means assumes that clusters are spherical and of similar size, which may not
work well for data with non-convex or elongated clusters.
Outliers: K-Means is sensitive to outliers, as they can disproportionately affect the mean.
Hierarchical Clustering is an unsupervised machine learning algorithm used to group data points into clusters based
on their similarity. Unlike K-Means, which requires the number of clusters K to be specified beforehand,
hierarchical clustering doesn’t need a predefined number of clusters. It builds a hierarchy of clusters, which can be
visualized in a dendrogram.
1. Agglomerative (Bottom-up): This is the most common method. It starts with each data point as its own
cluster and then iteratively merges the closest clusters until all points belong to one cluster.
2. Divisive (Top-down): This starts with one large cluster and recursively splits it into smaller clusters.
We'll focus on Agglomerative Hierarchical Clustering in this explanation.
Linkage Criteria
The linkage criterion determines how the distance between two clusters is calculated. The most commonly used
linkage methods are:
The main mathematical concept in hierarchical clustering is the distance matrix, which contains the pairwise
distances between every pair of points (or clusters). When two clusters are merged, the new distance between the
merged cluster and other clusters must be computed based on the linkage criterion.
Agglomerative Clustering: Dendrogram
A dendrogram is a tree-like diagram that records the merges (or splits in divisive clustering). It shows the hierarchy
of clusters, with the vertical axis representing the distance at which clusters were merged. The shorter the vertical
lines, the closer the clusters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage
Explanation of Code:
1. Data Generation: make_blobs creates synthetic 2D data with 4 centers to simulate a clustering problem.
2. Agglomerative Clustering: The AgglomerativeClustering class from sklearn is used to perform the
clustering. The linkage='ward' parameter specifies the use of Ward's method (minimizing variance).
3. Data Visualization: The clustered data is plotted using matplotlib, where each cluster is color-coded.
4. Dendrogram: The linkage function from scipy.cluster.hierarchy is used to generate a linkage matrix,
which is then plotted to create a dendrogram.
Hierarchical clustering doesn't require the number of clusters to be pre-specified. However, in practice, we often use
the dendrogram to choose an optimal cutoff. The cutoff is typically chosen by looking for the largest vertical
distance without any horizontal lines crossing it. This corresponds to the number of clusters at that level.
Horizontal Cut: By drawing a horizontal line through the dendrogram at a particular level, we can
determine how many clusters to form.
1. No Need to Specify Number of Clusters: Hierarchical clustering can be visualized with a dendrogram,
making it easy to choose the number of clusters.
2. Flexibility: It can capture complex cluster structures, including non-spherical clusters.
3. Hierarchy: Hierarchical clustering provides a full hierarchy of clusters, which is useful for understanding
the relationships between different data points.
1. Gene Expression Analysis: Hierarchical clustering is commonly used in bioinformatics to group genes
with similar expression patterns.
2. Customer Segmentation: In marketing, hierarchical clustering helps segment customers based on
purchasing behavior or demographics.
3. Document Clustering: It can be used for text mining to group similar documents together.
Evaluation of Clustering Results
Clustering is an unsupervised machine learning technique, which means there are no predefined labels or ground
truth to compare the results against. Therefore, evaluating the performance of clustering algorithms can be more
challenging than evaluating supervised learning models.
However, several evaluation techniques have been developed to assess clustering results. These methods generally
fall into two categories:
1. Internal Evaluation: Measures that evaluate the quality of clusters based on the data itself, without
requiring external labels.
2. External Evaluation: Measures that compare the clustering results to external, predefined labels (ground
truth).
Internal evaluation metrics do not rely on external labels and focus solely on the structure of the data and the
clustering solution. The aim is to evaluate how well the clusters fit the data.
The Silhouette Score measures how similar each data point is to its own cluster (cohesion) compared to other
clusters (separation). A higher silhouette score indicates better-defined clusters.
The overall silhouette score is the average silhouette score of all points in the dataset.
Range: [−1,1][-1, 1]
o A score close to 1 indicates well-separated clusters.
o A score close to 0 indicates overlapping clusters.
o A score close to -1 indicates that points are wrongly assigned to clusters.
# Fit KMeans
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
# Calculate silhouette score
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score}")
The Dunn Index measures the ratio of the minimum distance between clusters to the maximum intra-cluster
distance. The higher the Dunn Index, the better the clustering result.
Interpretation:
The Davies-Bouldin Index (DBI) is a measure of the average similarity ratio of each cluster with the cluster that is
most similar to it. A lower DBI indicates better clustering.
# Fit KMeans
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
The Inertia (or Within-cluster Sum of Squares) is a measure of how compact the clusters are. It is defined as the
sum of squared distances between each data point and its assigned cluster center.
# Get inertia
inertia = kmeans.inertia_
print(f"Inertia (Within-cluster Sum of Squares): {inertia}")
External evaluation metrics assess clustering quality by comparing the clustering output to a ground truth. These
measures require labeled data.
The Adjusted Rand Index (ARI) is a measure of the similarity between two data clusterings, adjusted for chance. It
ranges from -1 (completely different) to 1 (perfectly matching). A score of 0 indicates random clustering.
Normalized Mutual Information (NMI) measures the amount of information shared between the clustering and
the true labels. It’s normalized to range between 0 (no mutual information) and 1 (perfect agreement).
Python Code for NMI
from sklearn.metrics import normalized_mutual_info_score
The Fowlkes-Mallows Index (FMI) measures the geometric mean of precision and recall for clustering. It evaluates
how well the clustering matches the true labels.
Where:
TPTP is the number of true positives (pairs of points that are in the same cluster and have the same true
label).
FPFP is the number of false positives (pairs of points that are in the same cluster but have different true
labels).
FNFN is the number of false negatives (pairs of points that are in different clusters but have the same true
label).
Principal Component Analysis (PCA) is a widely-used dimensionality reduction technique that transforms a
dataset into a new coordinate system where the axes are ordered by the variance of the data. The first axis (or
component) represents the direction of maximum variance, the second axis is orthogonal to the first and represents
the second-highest variance, and so on. PCA is commonly used in fields like data science, machine learning, image
processing, and biology to reduce the dimensionality of data while preserving as much information as possible.
Sort the eigenvalues in descending order. The eigenvectors corresponding to the largest eigenvalues are the most
important components (principal components), as they explain the most variance in the data.
Mathematical
Example: Solving PCA
Example Problem
This gives us the transformed data along the first principal component.
Now, we have reduced the dimensionality of the original data from 2D to 1D, capturing most of the variance with
just one principal component.
PCA in Python
# Sample data
X = np.array([[2, 3], [3, 4], [4, 5], [5, 6]])
# Display results
print("Principal Components:")
print(pca.components_)
print("Projected Data:")
print(X_pca)
Applications of PCA
1. Dimensionality Reduction: PCA is used to reduce the number of features in a dataset, making it easier to
visualize and analyze.
2. Data Preprocessing: PCA is used as a preprocessing step in machine learning to reduce overfitting and
improve computational efficiency.
3. Image Compression: PCA is widely used in image compression, where high-dimensional data (e.g., pixel
values) are reduced to fewer dimensions.
4. Anomaly Detection: PCA can highlight outliers or anomalies by identifying points that deviate
significantly from the principal components.
Advantages:
Reduces Dimensionality: PCA helps reduce the number of features while retaining the most important
variance.
Improves Computational Efficiency: Less data to process means faster computations.
Removes Correlations: PCA removes correlated features by projecting data into uncorrelated axes.
Disadvantages:
Loss of Interpretability: The new features (principal components) may not be easily interpretable.
Assumes Linear Relationships: PCA assumes linearity and may not capture complex, nonlinear
relationships.
Sensitivity to Outliers: PCA can be sensitive to outliers in the data.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a supervised machine learning technique used for dimensionality reduction
and classification. LDA is particularly effective when the data is categorized into distinct classes, and the goal is to
reduce the dimensionality of the data while maintaining the class separability as much as possible.
Unlike Principal Component Analysis (PCA), which is unsupervised and focuses on maximizing variance, LDA
aims to maximize the separability between different classes in the dataset. This is achieved by finding a linear
combination of features that best separates the classes.
1. Class Separability: LDA seeks to find the axes that best separate the data points belonging to different
classes.
2. Fisher's Criterion: LDA uses Fisher's criterion to evaluate the separability between the classes. It seeks to
maximize the ratio of between-class variance to within-class variance.
3. Dimensionality Reduction: Like PCA, LDA reduces the dimensionality of the dataset, but it also
considers the class labels while doing so.
4. Linear Decision Boundaries: LDA works by finding linear decision boundaries between different classes.
Once the eigenvectors and eigenvalues are computed, the eigenvectors are sorted in decreasing order of the
eigenvalues. The eigenvectors corresponding to the largest eigenvalues are selected to form a lower-dimensional
space for the transformed data.
Example Problem:
Step 1: Compute the Mean Vectors
LDA in Python
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
# Sample data
X = np.array([[2, 3], [3, 4], [4, 5], [5, 6]]) # Features
y = np.array([0, 0, 1, 1]) # Class labels (0: C1, 1: C2)
# Display results
print("Transformed Data:")
print(X_lda)
Applications of LDA
Advantages:
Disadvantages:
Assumption of Normality: LDA assumes that data for each class is normally distributed
and that the covariance matrices of the classes are identical.
Linear Boundaries: LDA only works well when the classes can be separated by a linear
decision boundary. If classes are not linearly separable, LDA may not perform well.
1. Calculate Pairwise Distances: Compute pairwise distances between all points in the high-dimensional
space.
2. Compute Affinities: Calculate the probability distributions pij for all pairs of points in the high-
dimensional space using a Gaussian distribution.
3. Low-Dimensional Mapping: Initialize the low-dimensional representation of the data (usually random)
and calculate the initial affinities qij in the low-dimensional space using a t-distribution.
4. Minimize KL Divergence: Use gradient descent to minimize the KL divergence between the high-
dimensional and low-dimensional probability distributions.
5. Visualize the Results: Once the optimization converges, the low-dimensional representation can be used
for visualization, typically in 2D or 3D.
Let’s consider a simple 2D example where we have the following points in a high-dimensional
space:
We will apply t-SNE to reduce the dimensionality to 1D for simplicity. The steps would be:
In practice, this optimization is complex, so most users use existing libraries (such as scikit-learn) to apply t-
SNE to larger datasets.
t-SNE in Python
We can easily apply t-SNE in Python using the scikit-learn library. Here is an example using a simple dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
X_tsne = tsne.fit_transform(X)
In this code:
We load the Iris dataset, a popular dataset with 4 features and 150 samples from 3 classes.
We apply t-SNE with 2 components (2D projection), using a perplexity of 30 and 300 iterations.
The resulting 2D projection is visualized using matplotlib, with different colors representing different
classes.
Applications of t-SNE
1. Data Visualization: t-SNE is mainly used for visualizing high-dimensional datasets. It is especially useful
when trying to understand the relationships between data points in clusters.
2. Exploratory Data Analysis: It helps in understanding the structure of the data, identifying potential
patterns or anomalies.
3. Feature Engineering: t-SNE can be used to reduce the dimensionality of data before applying other
machine learning algorithms, especially when dealing with complex and high-dimensional data.
4. Deep Learning: t-SNE can be applied to visualize the activations of deep neural networks or to explore the
embedding of data in lower-dimensional spaces.
Advantages:
Captures Non-linear Relationships: Unlike PCA, t-SNE is capable of capturing complex, non-linear
relationships in the data.
Preserves Local Structure: It preserves the local structure of the data well, making it excellent for
clustering and visualization of data points that are close to each other in the high-dimensional space.
Effective for Visualization: It is particularly effective for visualizing high-dimensional datasets in 2D or
3D.
Disadvantages:
Computationally Expensive: t-SNE can be computationally intensive, especially for large datasets, as it
requires pairwise distance computations and iterative optimization.
Sensitivity to Hyperparameters: The choice of perplexity and learning rate can significantly affect the
results. The wrong choice may lead to poor visualizations.
Doesn't Preserve Global Structure: While t-SNE excels at preserving local structure, it does not always
preserve the global structure of the data, such as the relative distances between clusters.
Stochastic: The results of t-SNE can vary slightly due to its probabilistic nature and the random
initialization of points in the low-dimensional space.
An association rule is a relationship between two or more items in a dataset. The rule is typically in the form: A⇒B
Components of a Rule:
Antecedent (Left-Hand Side, LHS): The condition or the items that precede the arrow (e.g., A in A ⇒ B).
Consequent (Right-Hand Side, RHS): The items that follow the arrow (e.g., B in A ⇒ B).
1. Support: The support of an itemset is the proportion of transactions in the dataset that contain the itemset.
2. Confidence: Confidence is a measure of the likelihood that the rule is correct. It represents the probability
that item B is purchased when item A is purchased.
3. Lift: Lift compares the observed frequency of A and B together to the frequency expected if A and B were
independent. A lift value greater than 1 indicates that the two items appear together more often than
expected, suggesting a meaningful association.
4. Conviction: This is another metric used to evaluate the rule's strength. It measures the degree to which the
occurrence of A (on the LHS) ensures the occurrence of B (on the RHS), considering how often B would
occur without A.
3. Steps in Association Rule Mining:
The process of mining association rules typically involves two main phases:
The first step is to identify frequent itemsets, which are groups of items that appear together in a dataset
above a certain frequency (threshold, called minimum support).
Apriori Algorithm and FP-Growth Algorithm are popular methods used for this purpose.
Apriori Algorithm:
o Starts with a single item and progressively explores combinations of larger itemsets.
o The key principle: If an itemset is frequent, all of its subsets must also be frequent.
o This is done in iterations, where in each iteration, the algorithm generates candidate itemsets and
counts their support.
FP-Growth Algorithm:
o A more efficient alternative to Apriori that uses a compact tree structure called the FP-tree to
represent frequent itemsets. It avoids candidate generation like Apriori, making it faster for larger
datasets.
Once the frequent itemsets are identified, we generate rules from them. This is done by finding all possible
combinations of items in the frequent itemsets and calculating their confidence.
The rules are then filtered using the minimum confidence threshold to keep only those that meet the
required confidence level.
4. Apriori Algorithm:
1. Generate Candidate Itemsets: Start with itemsets of size 1, count the support for each item, and remove
those below the minimum support threshold.
3. Generate Rules: Once we have the frequent itemsets, generate all possible rules. Calculate the confidence
and discard rules that do not meet the minimum confidence threshold.
Example of Apriori:
Suppose in a retail store, you analyze the transactions and find the following:
o 50% of transactions contain both bread and butter.
o 60% of transactions contain bread.
o 40% of transactions contain butter.
You could generate an association rule: bread ⇒ butter, and calculate its support, confidence, and lift.
5. FP-Growth Algorithm:
The FP-Growth (Frequent Pattern Growth) algorithm avoids candidate generation and is faster than Apriori for
large datasets. Here's how it works:
1. Construct FP-Tree:
o The first step is to build a compressed FP-tree using the frequency of individual items.
o In this tree, each path represents a frequent itemset.
2. Recursive Mining:
o The tree is then recursively mined by creating conditional FP-trees for each frequent itemset.
o Frequent patterns are mined directly from the tree without generating candidates.
Market Basket Analysis: This is the classic application, where retailers analyze items bought together by
customers to determine product bundling strategies.
Recommendation Systems: Association rules are used to suggest products to customers based on their
past behaviors.
Medical Data Mining: In healthcare, association rule mining can uncover relationships between diseases
and symptoms.
Web Usage Mining: Analyzing web access patterns to understand how users navigate a website.
Scalability: As datasets grow larger, the computation for finding frequent itemsets and generating rules
becomes more intensive.
High-dimensional data: In some domains (e.g., healthcare), there are too many variables or items, making
it difficult to find meaningful rules.
Interpretability: The generated rules can sometimes be too complex or unintuitive, making it hard to
derive actionable insights.
8. Advanced Techniques:
Multi-level Association Mining: Rules are mined at different levels of abstraction (e.g., item categories,
sub-categories).
Temporal Association Rules: Used to find rules that hold over time (e.g., sales patterns across seasons).
Quantitative Association Rules: These involve numerical values (e.g., the number of items bought in a
transaction) and may require different metrics like correlation coefficients.
Apriori Algorithm
The Apriori Algorithm is one of the most popular algorithms used for Association Rule Mining. It works by
identifying frequent itemsets in a transactional database and generating rules that have high support and confidence.
Key Concepts in Apriori Algorithm
1. Generate candidate itemsets of length 1 (single items) and check their support.
2. Iteratively generate larger itemsets (of length 2, 3, etc.) by merging frequent itemsets from the previous
step.
3. Prune itemsets: Any itemset that does not meet the minimum support threshold is discarded.
4. Generate rules: Once frequent itemsets are found, generate association rules from them by calculating
confidence.
Example:
We’ll apply the Apriori algorithm step-by-step to find frequent itemsets and generate association rules.
Step 1: Find Frequent 1-Itemsets (Single Items)
Transactions: 7
Now, compare these supports to the minimum support threshold (50%). Items with support greater than or equal to
50% are considered frequent.
Frequent 1-Itemsets:
A (Support = 0.714)
B (Support = 0.857)
C (Support = 0.857)
D (Support = 0.571)
E (Support = 0.571)
Now we combine frequent 1-itemsets to generate candidate 2-itemsets. We then calculate the support for each pair.
Candidate 2-Itemsets:
Candidate 3-Itemsets:
None of the 3-itemsets meet the minimum support threshold, so we stop here.
We now generate the association rules from the frequent itemsets (those that passed the support threshold).
For example:
(A, B) ⇒ C:
Support of (A, B, C) = 2/7 = 0.286
Confidence = Support(A, B, C) / Support(A, B) = 0.286 / 0.571 = 0.5 (Not frequent)
(A) ⇒ B:
Support of (A, B) = 0.571
Confidence = 0.571 / 0.714 = 0.8 (This is a strong rule with high confidence)
Frequent Itemsets:
1-itemsets: A, B, C, D, E
2-itemsets: (A, B), (A, C), (B, C), (B, D), (C, E)
Association Rules:
The goal of market basket analysis is to identify frequent itemsets (items that are often bought together) and
generate association rules that can help businesses make data-driven decisions on product placement, promotions,
and marketing strategies.
1. Prepare the Data: Organize the transactional data, where each transaction contains a set of items
purchased together.
2. Generate Frequent Itemsets: Use an algorithm (like Apriori) to find frequent itemsets that meet the
minimum support threshold.
3. Generate Association Rules: From the frequent itemsets, generate association rules and evaluate them
using metrics like confidence and lift.
4. Analyze and Interpret Results: Identify meaningful associations to apply in marketing strategies.
Solved Example:
Let's work through a Market Basket Analysis example with a small dataset of transactions. We’ll use the Apriori
Algorithm to find frequent itemsets and generate association rules.
Dataset:
Items that have support greater than or equal to the minimum support threshold of 40% are:
Candidate 2-Itemsets:
(Milk, Bread)
(Milk, Butter)
(Milk, Cheese)
(Bread, Butter)
(Bread, Cheese)
(Butter, Cheese)
Now, let's generate association rules with confidence and lift for each frequent itemset.
This rule means that if a customer buys Milk, there's an 80% chance they will also buy Bread. The lift
value of 0.934 indicates that Milk and Bread are bought together more often than random chance, but not
strongly correlated.
This rule means that if a customer buys Bread, there’s a 66.7% chance they will also buy Milk. The lift
value is slightly lower than 1, indicating a mild positive correlation.
This rule means that if a customer buys Milk, there’s a 60% chance they will also buy Butter. The lift value
of 1.05 indicates a strong positive correlation; meaning Milk and Butter are often bought together.
This rule means that if a customer buys Bread, there’s a 50% chance they will also buy Butter. The lift
value is slightly less than 1, indicating a weaker but positive relationship.
From the generated rules, we can draw conclusions to help inform business decisions:
Milk and Bread are frequently bought together, with a high confidence of 80%. This suggests that these
items could be placed near each other in the store or offered in promotions as a bundle.
Milk and Butter have a strong positive correlation (lift = 1.05). This indicates that they are often bought
together, so a promotion or discount that bundles these items could increase sales.
Bread and Butter are also frequently bought together, but with a slightly weaker correlation. The store
might consider cross-promoting these items, but not as strongly as Milk and Butter.
1. Support:
Support measures the frequency or occurrence of an itemset (or combination of items) in the dataset. It helps to
identify the most commonly occurring itemsets.
Formula:
Example:
If we have 100 transactions and 20 of them contain itemset A (e.g., Milk and Bread), the support for A is: 0.2
High Support: Indicates that the itemset occurs frequently and is important.
Low Support: May indicate a less important or less frequent itemset.
Support helps in pruning the rules and itemsets that are less interesting. We often set a minimum support
threshold (e.g., 0.2 or 20%) to eliminate itemsets that appear too infrequently to be of practical use.
Rules with low support are less likely to be significant, as they only occur in a small subset of transactions.
2. Confidence:
Confidence measures the likelihood that item B is bought when item A is bought. It is a conditional probability of
B given A.
Formula:
Where:
Support (A ∪ B) is the support of the rule A⇒B, i.e., the number of transactions that contain both A and B.
Support(A) is the support of item A, i.e., the number of transactions that contain A.
Example:
If 30 transactions contains Milk and Bread, and 50 transactions contain Milk: Confidence (Milk ⇒ Bread) =
30/50=0.6
High Confidence: Means that A⇒B is a strong rule, as there is a high probability that B will be purchased
if A is bought.
Low Confidence: Indicates a weak rule, where buying A does not necessarily lead to buying B.
Confidence provides actionable insights. A rule with high confidence (e.g., 80%) means the antecedent (A)
strongly influences the consequent (B), making the rule valuable for targeted marketing or product
placement.
3. Lift:
Lift compares the observed support of the rule with the expected support if A and B were independent. It is used to
measure how much more likely B is purchased when A is purchased, relative to what would be expected by chance.
Formula:
Where:
Example:
Lift > 1: Indicates a positive association, meaning A and B are likely to be bought together more often than
expected by chance (they are correlated).
Lift = 1: Means A and B are independent (no correlation).
Lift < 1: Indicates a negative association, meaning A and B are less likely to be bought together than
expected by chance (they are inversely correlated).
Why Lift is Important:
Lift provides a more normalized measure of the strength of the relationship between AA and BB. A lift >
1 suggests that A and B have a strong positive relationship, which is valuable in market basket analysis to
identify items that complement each other.
4. Conviction:
Conviction is an alternative metric to Lift for evaluating the strength of association rules. It measures how much
more likely A is to be bought without B, relative to the probability that A will be bought regardless of B.
Formula:
Where:
Example:
Support(Bread) = 0.40
Confidence(Milk ⇒ Bread) = 0.60
Conviction is useful when you want to assess the “surprise factor” of an association. Higher conviction
values indicate a rule is more likely to hold true than expected.
5. Leverage:
Leverage measures how much more likely the rule A⇒B is to occur compared to what would be expected if A and
B were independent.
Formula:
Example:
If:
Support(Milk) = 0.50
Support(Bread) = 0.40
Support(Milk ∪ Bread) = 0.30
Leverage (Milk⇒Bread)=0.30−(0.50×0.40)=0.30−0.20=0.10
Leverage helps quantify the “extra” occurrence of item BB when item AA is bought, beyond what would
be expected by chance.