0% found this document useful (0 votes)
2 views34 pages

MLDAP Mod3

The document provides a comprehensive overview of K-Means and Hierarchical Clustering, detailing their algorithms, implementations, and evaluation methods. K-Means is an unsupervised learning algorithm that partitions data into K clusters, while Hierarchical Clustering builds a hierarchy of clusters without needing a predefined number of clusters. Additionally, it discusses various evaluation techniques for clustering results, including internal and external metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views34 pages

MLDAP Mod3

The document provides a comprehensive overview of K-Means and Hierarchical Clustering, detailing their algorithms, implementations, and evaluation methods. K-Means is an unsupervised learning algorithm that partitions data into K clusters, while Hierarchical Clustering builds a hierarchy of clusters without needing a predefined number of clusters. Additionally, it discusses various evaluation techniques for clustering results, including internal and external metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

K-Means Clustering: A Detailed Overview

What is K-Means Clustering?

K-Means is an unsupervised machine learning algorithm used to partition a dataset into K distinct, non-overlapping
subsets (clusters). The goal of K-means is to minimize the intra-cluster variance, meaning that data points within the
same cluster are as similar as possible, and data points in different clusters are as dissimilar as possible.

Key Concepts

1. Cluster Centroids: Each cluster is represented by its centroid, which is the mean of the points in that
cluster.
2. Euclidean Distance: K-Means uses the Euclidean distance between data points and cluster centroids to
assign each point to the closest cluster.
3. Convergence: The algorithm continues iterating through the process of assigning points to the nearest
centroid and then recalculating the centroids until no points change clusters (convergence).

Steps in K-Means Clustering

1. Initialization:
o Choose K initial centroids randomly from the data points.
2. Assignment Step:
o For each data point, calculate the distance to each centroid. Assign each point to the cluster
corresponding to the closest centroid.
3. Update Step:
o Recompute the centroids by calculating the mean of all points assigned to each cluster.
4. Repeat:
o Repeat the assignment and update steps until the centroids no longer change, indicating
convergence.

Mathematics behind K-Means

K-Means Pseudo Code


Here’s a high-level pseudo-code for K-Means:

# Initialize centroids randomly from data points


centroids = initialize_centroids(data, K)

# Repeat until convergence


while not converged:
# Assignment step
clusters = assign_points_to_clusters(data, centroids)

# Update step
new_centroids = update_centroids(clusters)

# Check for convergence (if centroids don't change)


if has_converged(centroids, new_centroids):
break
else:
centroids = new_centroids

Python Implementation of K-Means

Let’s implement K-Means clustering in Python using the scikit-learn library.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Step 1: Generate synthetic data


X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Step 2: Apply KMeans algorithm


kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# Step 3: Visualize the results


plt.scatter(X[:, 0], X[:, 1], s=30, c=kmeans.labels_, cmap='viridis')

# Plot centroids
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X')
plt.title("K-Means Clustering with 4 Clusters")
plt.show()

Explanation of the Code:

1. Data Generation: We create synthetic data using make_blobs with 4 centers to mimic a real-world
clustering problem.
2. KMeans: The KMeans class from sklearn is used to cluster the data into 4 clusters.
3. Visualization: The clusters are visualized using matplotlib, with each cluster shown in a different color.
The centroids are marked with red X’s.

Choosing the Right Number of Clusters (K)

One of the challenges in K-Means is selecting the correct number of clusters, KKK. There are several methods to
determine the optimal KKK:
1. Elbow Method:
o Plot the sum of squared errors (SSE) for different values of K and look for an "elbow," where the
SSE starts to decrease at a slower rate. This is a good estimate of the optimal .

# Elbow method
sse = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
sse.append(kmeans.inertia_)

plt.plot(K_range, sse, marker='o')


plt.title("Elbow Method for Optimal K")
plt.xlabel("Number of Clusters")
plt.ylabel("SSE (Inertia)")
plt.show()

2. Silhouette Score:
o This method measures how similar an object is to its own cluster compared to other clusters. A
higher silhouette score indicates better-defined clusters.

python
CopyEdit
from sklearn.metrics import silhouette_score

sil_score = silhouette_score(X, kmeans.labels_)


print(f"Silhouette Score: {sil_score}")

K-Means: Limitations

 Sensitive to Initialization: The final result can depend on the initial choice of centroids. This can lead to
suboptimal clustering.
o Solution: Use k-means++, which selects initial centroids in a smarter way.
 Number of Clusters: K-Means requires you to specify the number of clusters in advance.
 Non-spherical Clusters: K-Means assumes that clusters are spherical and of similar size, which may not
work well for data with non-convex or elongated clusters.
 Outliers: K-Means is sensitive to outliers, as they can disproportionately affect the mean.

Hierarchical Clustering: Detailed Overview

What is Hierarchical Clustering?

Hierarchical Clustering is an unsupervised machine learning algorithm used to group data points into clusters based
on their similarity. Unlike K-Means, which requires the number of clusters K to be specified beforehand,
hierarchical clustering doesn’t need a predefined number of clusters. It builds a hierarchy of clusters, which can be
visualized in a dendrogram.

There are two main types of hierarchical clustering:

1. Agglomerative (Bottom-up): This is the most common method. It starts with each data point as its own
cluster and then iteratively merges the closest clusters until all points belong to one cluster.
2. Divisive (Top-down): This starts with one large cluster and recursively splits it into smaller clusters.
We'll focus on Agglomerative Hierarchical Clustering in this explanation.

How Agglomerative Hierarchical Clustering Works

The algorithm follows these steps:

1. Initialize: Start by treating each data point as a single cluster.


2. Compute the Distance Matrix: Calculate the pairwise distances between all data points or clusters.
Common distance metrics include Euclidean distance, Manhattan distance, etc.
3. Merge the Closest Clusters: At each step, find the two closest clusters and merge them.
4. Update the Distance Matrix: After merging two clusters, update the distance matrix to reflect the new
clusters.
5. Repeat: Continue merging the closest clusters until all points belong to a single cluster.

Linkage Criteria

The linkage criterion determines how the distance between two clusters is calculated. The most commonly used
linkage methods are:

Mathematics Behind Agglomerative Clustering

The main mathematical concept in hierarchical clustering is the distance matrix, which contains the pairwise
distances between every pair of points (or clusters). When two clusters are merged, the new distance between the
merged cluster and other clusters must be computed based on the linkage criterion.
Agglomerative Clustering: Dendrogram

A dendrogram is a tree-like diagram that records the merges (or splits in divisive clustering). It shows the hierarchy
of clusters, with the vertical axis representing the distance at which clusters were merged. The shorter the vertical
lines, the closer the clusters.

Python Code for Hierarchical Clustering (using scikit-learn)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage

# Step 1: Create synthetic data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Step 2: Apply Agglomerative Clustering


agg_clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
labels = agg_clustering.fit_predict(X)

# Step 3: Plot the clustered data


plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title("Agglomerative Clustering")
plt.show()

# Step 4: Create a Dendrogram to visualize the clustering hierarchy


Z = linkage(X, method='ward') # Use Ward's method to minimize variance
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title("Dendrogram for Agglomerative Clustering")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.show()

Explanation of Code:

1. Data Generation: make_blobs creates synthetic 2D data with 4 centers to simulate a clustering problem.
2. Agglomerative Clustering: The AgglomerativeClustering class from sklearn is used to perform the
clustering. The linkage='ward' parameter specifies the use of Ward's method (minimizing variance).
3. Data Visualization: The clustered data is plotted using matplotlib, where each cluster is color-coded.
4. Dendrogram: The linkage function from scipy.cluster.hierarchy is used to generate a linkage matrix,
which is then plotted to create a dendrogram.

Choosing the Number of Clusters

Hierarchical clustering doesn't require the number of clusters to be pre-specified. However, in practice, we often use
the dendrogram to choose an optimal cutoff. The cutoff is typically chosen by looking for the largest vertical
distance without any horizontal lines crossing it. This corresponds to the number of clusters at that level.

 Horizontal Cut: By drawing a horizontal line through the dendrogram at a particular level, we can
determine how many clusters to form.

from scipy.cluster.hierarchy import fcluster

# Define a threshold for the cut (e.g., distance = 7)


clusters = fcluster(Z, t=7, criterion='distance')

# Plot the clustered data


plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis')
plt.title("Agglomerative Clustering with Cut at Distance = 7")
plt.show()

Advantages of Hierarchical Clustering

1. No Need to Specify Number of Clusters: Hierarchical clustering can be visualized with a dendrogram,
making it easy to choose the number of clusters.
2. Flexibility: It can capture complex cluster structures, including non-spherical clusters.
3. Hierarchy: Hierarchical clustering provides a full hierarchy of clusters, which is useful for understanding
the relationships between different data points.

Disadvantages of Hierarchical Clustering

1. Computational Complexity: The algorithm is computationally expensive, with a time complexity of


O(n2)O(n^2)O(n2), making it slow for large datasets.
2. Sensitive to Noise and Outliers: Like many clustering algorithms, hierarchical clustering is sensitive to
noisy data and outliers.
3. No Reversal of Merges: Once clusters are merged, they cannot be split again. This can be problematic if a
merge is incorrect.

Applications of Hierarchical Clustering

1. Gene Expression Analysis: Hierarchical clustering is commonly used in bioinformatics to group genes
with similar expression patterns.
2. Customer Segmentation: In marketing, hierarchical clustering helps segment customers based on
purchasing behavior or demographics.
3. Document Clustering: It can be used for text mining to group similar documents together.
Evaluation of Clustering Results

Clustering is an unsupervised machine learning technique, which means there are no predefined labels or ground
truth to compare the results against. Therefore, evaluating the performance of clustering algorithms can be more
challenging than evaluating supervised learning models.

However, several evaluation techniques have been developed to assess clustering results. These methods generally
fall into two categories:

1. Internal Evaluation: Measures that evaluate the quality of clusters based on the data itself, without
requiring external labels.
2. External Evaluation: Measures that compare the clustering results to external, predefined labels (ground
truth).

1. Internal Evaluation of Clustering Results

Internal evaluation metrics do not rely on external labels and focus solely on the structure of the data and the
clustering solution. The aim is to evaluate how well the clusters fit the data.

1.1. Silhouette Score

The Silhouette Score measures how similar each data point is to its own cluster (cohesion) compared to other
clusters (separation). A higher silhouette score indicates better-defined clusters.

The silhouette score for a single point is calculated as:

The overall silhouette score is the average silhouette score of all points in the dataset.

 Range: [−1,1][-1, 1]
o A score close to 1 indicates well-separated clusters.
o A score close to 0 indicates overlapping clusters.
o A score close to -1 indicates that points are wrongly assigned to clusters.

Python Code for Silhouette Score


from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Example data: X (data points)


X = ...

# Fit KMeans
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
# Calculate silhouette score
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score}")

1.2. Dunn Index

The Dunn Index measures the ratio of the minimum distance between clusters to the maximum intra-cluster
distance. The higher the Dunn Index, the better the clustering result.

Interpretation:

 A higher Dunn Index indicates well-separated and compact clusters.


 A lower Dunn Index suggests overlapping or poorly separated clusters.

1.3. Davies-Bouldin Index

The Davies-Bouldin Index (DBI) is a measure of the average similarity ratio of each cluster with the cluster that is
most similar to it. A lower DBI indicates better clustering.

The formula for the Davies-Bouldin index is:

Python Code for Davies-Bouldin Index


from sklearn.metrics import davies_bouldin_score

# Fit KMeans
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)

# Calculate Davies-Bouldin score


dbi = davies_bouldin_score(X, labels)
print(f"Davies-Bouldin Index: {dbi}")
1.4. Inertia (Within-Cluster Sum of Squares)

The Inertia (or Within-cluster Sum of Squares) is a measure of how compact the clusters are. It is defined as the
sum of squared distances between each data point and its assigned cluster center.

Python Code for Inertia


# Fit KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

# Get inertia
inertia = kmeans.inertia_
print(f"Inertia (Within-cluster Sum of Squares): {inertia}")

2. External Evaluation of Clustering Results

External evaluation metrics assess clustering quality by comparing the clustering output to a ground truth. These
measures require labeled data.

2.1. Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) is a measure of the similarity between two data clusterings, adjusted for chance. It
ranges from -1 (completely different) to 1 (perfectly matching). A score of 0 indicates random clustering.

Python Code for Adjusted Rand Index


from sklearn.metrics import adjusted_rand_score
# Assume true_labels is the ground truth labels and labels are from clustering algorithm
ari = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Index (ARI): {ari}")

2.2. Normalized Mutual Information (NMI)

Normalized Mutual Information (NMI) measures the amount of information shared between the clustering and
the true labels. It’s normalized to range between 0 (no mutual information) and 1 (perfect agreement).
Python Code for NMI
from sklearn.metrics import normalized_mutual_info_score

# Compute Normalized Mutual Information


nmi = normalized_mutual_info_score(true_labels, labels)
print(f"Normalized Mutual Information (NMI): {nmi}")

2.3. Fowlkes-Mallows Index (FMI)

The Fowlkes-Mallows Index (FMI) measures the geometric mean of precision and recall for clustering. It evaluates
how well the clustering matches the true labels.

Where:

 TPTP is the number of true positives (pairs of points that are in the same cluster and have the same true
label).
 FPFP is the number of false positives (pairs of points that are in the same cluster but have different true
labels).
 FNFN is the number of false negatives (pairs of points that are in different clusters but have the same true
label).

Python Code for FMI


from sklearn.metrics import fowlkes_mallows_score

# Compute Fowlkes-Mallows Index


fmi = fowlkes_mallows_score(true_labels, labels)
print(f"Fowlkes-Mallows Index (FMI): {fmi}")

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely-used dimensionality reduction technique that transforms a
dataset into a new coordinate system where the axes are ordered by the variance of the data. The first axis (or
component) represents the direction of maximum variance, the second axis is orthogonal to the first and represents
the second-highest variance, and so on. PCA is commonly used in fields like data science, machine learning, image
processing, and biology to reduce the dimensionality of data while preserving as much information as possible.

Key Concepts in PCA


1. Dimensionality Reduction: PCA reduces the number of variables (features) in the dataset by projecting
the data onto a lower-dimensional space while retaining as much variance (information) as possible.
2. Variance: Variance in a dataset measures how much the data points spread out. PCA seeks to find the
directions where data has the highest variance.
3. Covariance Matrix: PCA relies on the covariance matrix of the data to understand how the variables in the
dataset vary with each other.

Steps Involved in PCA

Step 1: Standardize the Data

Step 2: Compute the Covariance Matrix

Step 3: Compute Eigen values and Eigenvectors


Step 4: Sort Eigenvalues and Eigenvectors

Sort the eigenvalues in descending order. The eigenvectors corresponding to the largest eigenvalues are the most
important components (principal components), as they explain the most variance in the data.

Step 5: Project Data onto Principal Components

Mathematical
Example: Solving PCA

Let’s go through a simple, worked-out example.

Example Problem

Step 1: Standardize the Data


Step 2: Compute the Covariance Matrix

Step 3: Compute Eigenvalues and Eigenvectors

Step 4: Sort Eigenvalues and Eigenvectors

Step 5: Project the Data

Finally, we project the data onto the first principal component:

This gives us the transformed data along the first principal component.

Sample Projected Value


1 -3.0
2 -1.0
3 1.0
4 3.0

Now, we have reduced the dimensionality of the original data from 2D to 1D, capturing most of the variance with
just one principal component.

PCA in Python

Here’s how you can perform PCA using Python’s scikit-learn:


import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample data
X = np.array([[2, 3], [3, 4], [4, 5], [5, 6]])

# Step 1: Standardize the data


scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Step 2: Apply PCA


pca = PCA(n_components=1) # Reduce to 1 dimension
X_pca = pca.fit_transform(X_std)

# Display results
print("Principal Components:")
print(pca.components_)

print("Projected Data:")
print(X_pca)

Applications of PCA

1. Dimensionality Reduction: PCA is used to reduce the number of features in a dataset, making it easier to
visualize and analyze.
2. Data Preprocessing: PCA is used as a preprocessing step in machine learning to reduce overfitting and
improve computational efficiency.
3. Image Compression: PCA is widely used in image compression, where high-dimensional data (e.g., pixel
values) are reduced to fewer dimensions.
4. Anomaly Detection: PCA can highlight outliers or anomalies by identifying points that deviate
significantly from the principal components.

Advantages and Disadvantages of PCA

Advantages:

 Reduces Dimensionality: PCA helps reduce the number of features while retaining the most important
variance.
 Improves Computational Efficiency: Less data to process means faster computations.
 Removes Correlations: PCA removes correlated features by projecting data into uncorrelated axes.

Disadvantages:

 Loss of Interpretability: The new features (principal components) may not be easily interpretable.
 Assumes Linear Relationships: PCA assumes linearity and may not capture complex, nonlinear
relationships.
 Sensitivity to Outliers: PCA can be sensitive to outliers in the data.
Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised machine learning technique used for dimensionality reduction
and classification. LDA is particularly effective when the data is categorized into distinct classes, and the goal is to
reduce the dimensionality of the data while maintaining the class separability as much as possible.

Unlike Principal Component Analysis (PCA), which is unsupervised and focuses on maximizing variance, LDA
aims to maximize the separability between different classes in the dataset. This is achieved by finding a linear
combination of features that best separates the classes.

Key Concepts in LDA

1. Class Separability: LDA seeks to find the axes that best separate the data points belonging to different
classes.
2. Fisher's Criterion: LDA uses Fisher's criterion to evaluate the separability between the classes. It seeks to
maximize the ratio of between-class variance to within-class variance.
3. Dimensionality Reduction: Like PCA, LDA reduces the dimensionality of the dataset, but it also
considers the class labels while doing so.
4. Linear Decision Boundaries: LDA works by finding linear decision boundaries between different classes.

Steps in Linear Discriminant Analysis (LDA)

Step 1: Compute the Mean Vectors

Step 2: Compute the Scatter Matrices

There are two scatter matrices in LDA:


Step 3: Compute the Eigenvalues and Eigenvectors

Step 4: Sort the Eigenvectors by Eigenvalues

Once the eigenvectors and eigenvalues are computed, the eigenvectors are sorted in decreasing order of the
eigenvalues. The eigenvectors corresponding to the largest eigenvalues are selected to form a lower-dimensional
space for the transformed data.

Step 5: Project the Data onto the New Subspace

Mathematical Example of LDA

Let's us consider a dataset with two classes.

Example Problem:
Step 1: Compute the Mean Vectors

Step 2: Compute the Scatter Matrices

Step 3: Solve for Eigenvalues and Eigenvectors

Step 4: Project the Data onto the New Subspace

LDA in Python
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

# Sample data
X = np.array([[2, 3], [3, 4], [4, 5], [5, 6]]) # Features
y = np.array([0, 0, 1, 1]) # Class labels (0: C1, 1: C2)

# Standardize the data (important for LDA)


scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=1)
X_lda = lda.fit_transform(X_std, y)

# Display results
print("Transformed Data:")
print(X_lda)

Applications of LDA

1. Dimensionality Reduction: LDA reduces the dimensionality of data while preserving


class separability, making it useful for feature extraction.
2. Classification: LDA is commonly used for classification tasks, especially when the
classes are well-separated.
3. Face Recognition: In face recognition tasks, LDA can be used to reduce the
dimensionality of facial images and classify them based on distinct facial features.

Advantages and Disadvantages of LDA

Advantages:

 Class separability: LDA explicitly focuses on maximizing class separability.


 Efficient: LDA can be computationally efficient when the number of features is large, as
it reduces the data to a lower-dimensional subspace.
 Better classification: LDA often performs better than PCA for classification tasks
because it incorporates class information.

Disadvantages:

 Assumption of Normality: LDA assumes that data for each class is normally distributed
and that the covariance matrices of the classes are identical.
 Linear Boundaries: LDA only works well when the classes can be separated by a linear
decision boundary. If classes are not linearly separable, LDA may not perform well.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction


technique primarily used for the visualization of high-dimensional data. It is especially useful in
cases where the data contains complex, non-linear structures that other dimensionality reduction
methods like PCA cannot capture effectively. t-SNE works by preserving the local structure of
the data (i.e., points that are close in high-dimensional space remain close in the low-dimensional
projection) while also attempting to separate points that are far apart in high-dimensional space.
Key Concepts in t-SNE

1. High-Dimensional to Low-Dimensional Mapping: t-SNE maps data from a high-


dimensional space to a lower-dimensional space (typically 2D or 3D) for visualization,
while attempting to preserve the pairwise distances between points as much as possible.
2. Probabilistic Approach: t-SNE is a probabilistic method. It models the similarity
between data points as probabilities:
o In high-dimensional space, it calculates the probability that a data point is a
neighbor of another point based on their distance.
o In low-dimensional space, it tries to match these probabilities, so that nearby
points in high-dimensional space remain nearby in the lower-dimensional
projection.
3. Perplexity: The perplexity is a hyperparameter in t-SNE that balances the influence of
local and global data structure. It determines the number of nearest neighbors for each
point.

Mathematical Foundation of t-SNE

Step 1: Compute Pairwise Affinities in High Dimensional Space

Step 2: Compute Pairwise Affinities in Low-Dimensional Space


Step 3: Minimize the Kullback-Leibler Divergence

Steps Involved in t-SNE

1. Calculate Pairwise Distances: Compute pairwise distances between all points in the high-dimensional
space.
2. Compute Affinities: Calculate the probability distributions pij for all pairs of points in the high-
dimensional space using a Gaussian distribution.
3. Low-Dimensional Mapping: Initialize the low-dimensional representation of the data (usually random)
and calculate the initial affinities qij in the low-dimensional space using a t-distribution.
4. Minimize KL Divergence: Use gradient descent to minimize the KL divergence between the high-
dimensional and low-dimensional probability distributions.
5. Visualize the Results: Once the optimization converges, the low-dimensional representation can be used
for visualization, typically in 2D or 3D.

Mathematical Example of t-SNE

Let’s consider a simple 2D example where we have the following points in a high-dimensional
space:

We will apply t-SNE to reduce the dimensionality to 1D for simplicity. The steps would be:

1. Compute pairwise distances between points in the high-dimensional space.


2. Compute the affinities using a Gaussian kernel.
3. Initialize the low-dimensional representation with random values.
4. Optimize using gradient descent to minimize KL divergence.

In practice, this optimization is complex, so most users use existing libraries (such as scikit-learn) to apply t-
SNE to larger datasets.
t-SNE in Python

We can easily apply t-SNE in Python using the scikit-learn library. Here is an example using a simple dataset:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris

# Load the Iris dataset


data = load_iris()
X = data.data
y = data.target

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
X_tsne = tsne.fit_transform(X)

# Visualize the results


plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')
plt.legend(*scatter.legend_elements(), title="Classes")
plt.title("t-SNE visualization of Iris dataset")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.show()

In this code:

 We load the Iris dataset, a popular dataset with 4 features and 150 samples from 3 classes.
 We apply t-SNE with 2 components (2D projection), using a perplexity of 30 and 300 iterations.
 The resulting 2D projection is visualized using matplotlib, with different colors representing different
classes.

Applications of t-SNE

1. Data Visualization: t-SNE is mainly used for visualizing high-dimensional datasets. It is especially useful
when trying to understand the relationships between data points in clusters.
2. Exploratory Data Analysis: It helps in understanding the structure of the data, identifying potential
patterns or anomalies.
3. Feature Engineering: t-SNE can be used to reduce the dimensionality of data before applying other
machine learning algorithms, especially when dealing with complex and high-dimensional data.
4. Deep Learning: t-SNE can be applied to visualize the activations of deep neural networks or to explore the
embedding of data in lower-dimensional spaces.

Advantages and Disadvantages of t-SNE

Advantages:

 Captures Non-linear Relationships: Unlike PCA, t-SNE is capable of capturing complex, non-linear
relationships in the data.
 Preserves Local Structure: It preserves the local structure of the data well, making it excellent for
clustering and visualization of data points that are close to each other in the high-dimensional space.
 Effective for Visualization: It is particularly effective for visualizing high-dimensional datasets in 2D or
3D.
Disadvantages:

 Computationally Expensive: t-SNE can be computationally intensive, especially for large datasets, as it
requires pairwise distance computations and iterative optimization.
 Sensitivity to Hyperparameters: The choice of perplexity and learning rate can significantly affect the
results. The wrong choice may lead to poor visualizations.
 Doesn't Preserve Global Structure: While t-SNE excels at preserving local structure, it does not always
preserve the global structure of the data, such as the relative distances between clusters.
 Stochastic: The results of t-SNE can vary slightly due to its probabilistic nature and the random
initialization of points in the low-dimensional space.

Association Rule Mining


Association Rule Mining (ARM) is a machine learning technique used to discover interesting relationships
(associations) between variables in large datasets. It’s primarily used in market basket analysis but has applications
across various domains such as recommendation systems, medical diagnostics, and web usage mining.

1. What is an Association Rule?

An association rule is a relationship between two or more items in a dataset. The rule is typically in the form: A⇒B

This reads as "If item A occurs, then item B is likely to occur."

Components of a Rule:

 Antecedent (Left-Hand Side, LHS): The condition or the items that precede the arrow (e.g., A in A ⇒ B).
 Consequent (Right-Hand Side, RHS): The items that follow the arrow (e.g., B in A ⇒ B).

2. Key Metrics in Association Rule Mining:

To evaluate the strength of an association rule, we use certain metrics:

1. Support: The support of an itemset is the proportion of transactions in the dataset that contain the itemset.

2. Confidence: Confidence is a measure of the likelihood that the rule is correct. It represents the probability
that item B is purchased when item A is purchased.

3. Lift: Lift compares the observed frequency of A and B together to the frequency expected if A and B were
independent. A lift value greater than 1 indicates that the two items appear together more often than
expected, suggesting a meaningful association.

4. Conviction: This is another metric used to evaluate the rule's strength. It measures the degree to which the
occurrence of A (on the LHS) ensures the occurrence of B (on the RHS), considering how often B would
occur without A.
3. Steps in Association Rule Mining:

The process of mining association rules typically involves two main phases:

Phase 1: Frequent Itemset Generation

 The first step is to identify frequent itemsets, which are groups of items that appear together in a dataset
above a certain frequency (threshold, called minimum support).

 Apriori Algorithm and FP-Growth Algorithm are popular methods used for this purpose.

 Apriori Algorithm:
o Starts with a single item and progressively explores combinations of larger itemsets.

o The key principle: If an itemset is frequent, all of its subsets must also be frequent.

o This is done in iterations, where in each iteration, the algorithm generates candidate itemsets and
counts their support.
 FP-Growth Algorithm:
o A more efficient alternative to Apriori that uses a compact tree structure called the FP-tree to
represent frequent itemsets. It avoids candidate generation like Apriori, making it faster for larger
datasets.

Phase 2: Rule Generation

 Once the frequent itemsets are identified, we generate rules from them. This is done by finding all possible
combinations of items in the frequent itemsets and calculating their confidence.

 The rules are then filtered using the minimum confidence threshold to keep only those that meet the
required confidence level.

4. Apriori Algorithm:

Here's a high-level breakdown of how the Apriori algorithm works:

1. Generate Candidate Itemsets: Start with itemsets of size 1, count the support for each item, and remove
those below the minimum support threshold.

2. Iterative Process: For each iteration:


o Generate itemsets of increasing size (2-itemsets, 3-itemsets, etc.).
o Check if these itemsets meet the minimum support threshold.

3. Generate Rules: Once we have the frequent itemsets, generate all possible rules. Calculate the confidence
and discard rules that do not meet the minimum confidence threshold.

Example of Apriori:

 Suppose in a retail store, you analyze the transactions and find the following:
o 50% of transactions contain both bread and butter.
o 60% of transactions contain bread.
o 40% of transactions contain butter.
 You could generate an association rule: bread ⇒ butter, and calculate its support, confidence, and lift.
5. FP-Growth Algorithm:

The FP-Growth (Frequent Pattern Growth) algorithm avoids candidate generation and is faster than Apriori for
large datasets. Here's how it works:

1. Construct FP-Tree:
o The first step is to build a compressed FP-tree using the frequency of individual items.
o In this tree, each path represents a frequent itemset.
2. Recursive Mining:
o The tree is then recursively mined by creating conditional FP-trees for each frequent itemset.
o Frequent patterns are mined directly from the tree without generating candidates.

6. Applications of Association Rule Mining:

 Market Basket Analysis: This is the classic application, where retailers analyze items bought together by
customers to determine product bundling strategies.
 Recommendation Systems: Association rules are used to suggest products to customers based on their
past behaviors.
 Medical Data Mining: In healthcare, association rule mining can uncover relationships between diseases
and symptoms.
 Web Usage Mining: Analyzing web access patterns to understand how users navigate a website.

7. Challenges in Association Rule Mining:

 Scalability: As datasets grow larger, the computation for finding frequent itemsets and generating rules
becomes more intensive.
 High-dimensional data: In some domains (e.g., healthcare), there are too many variables or items, making
it difficult to find meaningful rules.
 Interpretability: The generated rules can sometimes be too complex or unintuitive, making it hard to
derive actionable insights.

8. Advanced Techniques:

 Multi-level Association Mining: Rules are mined at different levels of abstraction (e.g., item categories,
sub-categories).
 Temporal Association Rules: Used to find rules that hold over time (e.g., sales patterns across seasons).
 Quantitative Association Rules: These involve numerical values (e.g., the number of items bought in a
transaction) and may require different metrics like correlation coefficients.

Apriori Algorithm

The Apriori Algorithm is one of the most popular algorithms used for Association Rule Mining. It works by
identifying frequent itemsets in a transactional database and generating rules that have high support and confidence.
Key Concepts in Apriori Algorithm

Steps Involved in the Apriori Algorithm:

1. Generate candidate itemsets of length 1 (single items) and check their support.
2. Iteratively generate larger itemsets (of length 2, 3, etc.) by merging frequent itemsets from the previous
step.
3. Prune itemsets: Any itemset that does not meet the minimum support threshold is discarded.
4. Generate rules: Once frequent itemsets are found, generate association rules from them by calculating
confidence.

Apriori Algorithm: Step-by-Step Explanation

Example:

Consider the following dataset of transactions from a retail store:

Transaction ID Items Bought


T1 {A, B, D}
T2 {B, C, E}
T3 {A, B, C, E}
T4 {A, B, C, D, E}
T5 {B, C, D}
T6 {A, C, E}
T7 {A, B, C, D}

Minimum Support = 50%


Minimum Confidence = 70%

We’ll apply the Apriori algorithm step-by-step to find frequent itemsets and generate association rules.
Step 1: Find Frequent 1-Itemsets (Single Items)

We start by counting the support for individual items.

Transactions: 7

 Support for A: Appears in T1, T3, T4, T6, T7 → 5/7 = 0.714


 Support for B: Appears in T1, T2, T3, T4, T5, T7 → 6/7 = 0.857
 Support for C: Appears in T2, T3, T4, T5, T6, T7 → 6/7 = 0.857
 Support for D: Appears in T1, T4, T5, T7 → 4/7 = 0.571
 Support for E: Appears in T2, T3, T4, T6 → 4/7 = 0.571

Now, compare these supports to the minimum support threshold (50%). Items with support greater than or equal to
50% are considered frequent.

Frequent 1-Itemsets:

 A (Support = 0.714)
 B (Support = 0.857)
 C (Support = 0.857)
 D (Support = 0.571)
 E (Support = 0.571)

These are our frequent 1-itemsets.

Step 2: Generate Candidate 2-Itemsets

Now we combine frequent 1-itemsets to generate candidate 2-itemsets. We then calculate the support for each pair.

Candidate 2-Itemsets:

 (A, B), (A, C), (A, D), (A, E)


 (B, C), (B, D), (B, E)
 (C, D), (C, E)
 (D, E)

Support for Candidate 2-Itemsets:

 (A, B): Appears in T1, T3, T4, T7 → 4/7 = 0.571


 (A, C): Appears in T3, T4, T6, T7 → 4/7 = 0.571
 (A, D): Appears in T1, T4, T7 → 3/7 = 0.429 (Not frequent)
 (A, E): Appears in T3, T4, T6 → 3/7 = 0.429 (Not frequent)
 (B, C): Appears in T2, T3, T4, T5, T7 → 5/7 = 0.714
 (B, D): Appears in T1, T4, T5, T7 → 4/7 = 0.571
 (B, E): Appears in T2, T3, T4 → 3/7 = 0.429 (Not frequent)
 (C, D): Appears in T4, T5, T7 → 3/7 = 0.429 (Not frequent)
 (C, E): Appears in T2, T3, T4, T6 → 4/7 = 0.571
 (D, E): Appears in T4 → 1/7 = 0.143 (Not frequent)

Frequent 2-Itemsets (Support >= 50%):

 (A, B) (Support = 0.571)


 (A, C) (Support = 0.571)
 (B, C) (Support = 0.714)
 (B, D) (Support = 0.571)
 (C, E) (Support = 0.571)

Step 3: Generate Candidate 3-Itemsets

Now we generate candidate 3-itemsets by combining the frequent 2-itemsets.

Candidate 3-Itemsets:

 (A, B, C), (A, B, D), (A, C, E), (B, C, D), (B, C, E)

Support for Candidate 3-Itemsets:

 (A, B, C): Appears in T3, T4 → 2/7 = 0.286 (Not frequent)


 (A, B, D): Appears in T1, T4, T7 → 3/7 = 0.429 (Not frequent)
 (A, C, E): Appears in T3, T4 → 2/7 = 0.286 (Not frequent)
 (B, C, D): Appears in T4, T7 → 2/7 = 0.286 (Not frequent)
 (B, C, E): Appears in T3, T4 → 2/7 = 0.286 (Not frequent)

None of the 3-itemsets meet the minimum support threshold, so we stop here.

Step 4: Generate Association Rules

We now generate the association rules from the frequent itemsets (those that passed the support threshold).

For example:

 (A, B) ⇒ C:
Support of (A, B, C) = 2/7 = 0.286
Confidence = Support(A, B, C) / Support(A, B) = 0.286 / 0.571 = 0.5 (Not frequent)
 (A) ⇒ B:
Support of (A, B) = 0.571
Confidence = 0.571 / 0.714 = 0.8 (This is a strong rule with high confidence)

Final Frequent Itemsets and Rules:

Frequent Itemsets:

 1-itemsets: A, B, C, D, E
 2-itemsets: (A, B), (A, C), (B, C), (B, D), (C, E)

Association Rules:

 (A) ⇒ (B) with Confidence = 0.8


 (B) ⇒ (C) with Confidence = 0.71
Market Basket Analysis
Market Basket Analysis (MBA) is a data mining technique used to uncover associations and relationships between
different items bought together by customers in retail transactions. It is a type of Association Rule Mining and is
used extensively in areas like recommendation systems, cross-selling, and product bundling.

The goal of market basket analysis is to identify frequent itemsets (items that are often bought together) and
generate association rules that can help businesses make data-driven decisions on product placement, promotions,
and marketing strategies.

Key Concepts in Market Basket Analysis:

1. Transaction: A record of items purchased by a customer in a single transaction.


2. Itemset: A set of items that appear together in a transaction.
3. Frequent Itemset: An itemset that appears in a transaction dataset with support above a certain threshold.
4. Association Rule: A rule in the form of A⇒B "if item A is purchased, then item B is likely to be
purchased."

Steps in Market Basket Analysis:

1. Prepare the Data: Organize the transactional data, where each transaction contains a set of items
purchased together.
2. Generate Frequent Itemsets: Use an algorithm (like Apriori) to find frequent itemsets that meet the
minimum support threshold.
3. Generate Association Rules: From the frequent itemsets, generate association rules and evaluate them
using metrics like confidence and lift.
4. Analyze and Interpret Results: Identify meaningful associations to apply in marketing strategies.

Solved Example:

Let's work through a Market Basket Analysis example with a small dataset of transactions. We’ll use the Apriori
Algorithm to find frequent itemsets and generate association rules.
Dataset:

Consider the following transactions in a retail store:

Transaction ID Items Purchased


T1 {Milk, Bread, Butter}
T2 {Milk, Bread}
T3 {Milk, Butter}
T4 {Bread, Butter}
T5 {Milk, Bread, Butter, Cheese}
T6 {Bread, Cheese}
T7 {Milk, Bread, Cheese}

Minimum Support = 40% (0.4)


Minimum Confidence = 60% (0.6)

Step 1: Calculate Support for 1-Itemsets

The total number of transactions is 7.

Now, let’s calculate the support for each individual item:

 Milk: Appears in T1, T2, T3, T5, T7 → 5/7 = 0.714


 Bread: Appears in T1, T2, T4, T5, T6, T7 → 6/7 = 0.857
 Butter: Appears in T1, T3, T4, T5 → 4/7 = 0.571
 Cheese: Appears in T5, T6, T7 → 3/7 = 0.429

Items that have support greater than or equal to the minimum support threshold of 40% are:

 Milk (Support = 0.714)


 Bread (Support = 0.857)
 Butter (Support = 0.571)
 Cheese (Support = 0.429)

Step 2: Generate Candidate 2-Itemsets

Now, let’s generate candidate 2-itemsets and calculate their support.

Candidate 2-Itemsets:

 (Milk, Bread)
 (Milk, Butter)
 (Milk, Cheese)
 (Bread, Butter)
 (Bread, Cheese)
 (Butter, Cheese)

Support for 2-Itemsets:

 (Milk, Bread): Appears in T1, T2, T5, T7 → 4/7 = 0.571


 (Milk, Butter): Appears in T1, T3, T5 → 3/7 = 0.429
 (Milk, Cheese): Appears in T5, T7 → 2/7 = 0.286 (Not frequent)
 (Bread, Butter): Appears in T1, T4, T5 → 3/7 = 0.429
 (Bread, Cheese): Appears in T5, T6, T7 → 3/7 = 0.429
 (Butter, Cheese): Appears in T5 → 1/7 = 0.143 (Not frequent)

Frequent 2-itemsets (support ≥ 0.4):

 (Milk, Bread) (Support = 0.571)


 (Milk, Butter) (Support = 0.429)
 (Bread, Butter) (Support = 0.429)
 (Bread, Cheese) (Support = 0.429)

Step 3: Generate Association Rules from Frequent 2-Itemsets

Now, let's generate association rules with confidence and lift for each frequent itemset.

1. Rule 1: Milk ⇒ Bread


o Support(Milk, Bread) = 0.571
o Confidence(Milk ⇒ Bread) = Support(Milk, Bread) / Support(Milk) = 0.571 / 0.714 = 0.8
o Lift(Milk ⇒ Bread) = Confidence(Milk ⇒ Bread) / Support(Bread) = 0.8 / 0.857 = 0.934

This rule means that if a customer buys Milk, there's an 80% chance they will also buy Bread. The lift
value of 0.934 indicates that Milk and Bread are bought together more often than random chance, but not
strongly correlated.

2. Rule 2: Bread ⇒ Milk


o Support(Milk, Bread) = 0.571
o Confidence(Bread ⇒ Milk) = Support(Milk, Bread) / Support(Bread) = 0.571 / 0.857 = 0.667
o Lift(Bread ⇒ Milk) = Confidence(Bread ⇒ Milk) / Support(Milk) = 0.667 / 0.714 = 0.933

This rule means that if a customer buys Bread, there’s a 66.7% chance they will also buy Milk. The lift
value is slightly lower than 1, indicating a mild positive correlation.

3. Rule 3: Milk ⇒ Butter


o Support(Milk, Butter) = 0.429
o Confidence(Milk ⇒ Butter) = Support(Milk, Butter) / Support(Milk) = 0.429 / 0.714 = 0.6
o Lift(Milk ⇒ Butter) = Confidence(Milk ⇒ Butter) / Support(Butter) = 0.6 / 0.571 = 1.05

This rule means that if a customer buys Milk, there’s a 60% chance they will also buy Butter. The lift value
of 1.05 indicates a strong positive correlation; meaning Milk and Butter are often bought together.

4. Rule 4: Bread ⇒ Butter


o Support(Bread, Butter) = 0.429
o Confidence(Bread ⇒ Butter) = Support(Bread, Butter) / Support(Bread) = 0.429 / 0.857 = 0.5
o Lift(Bread ⇒ Butter) = Confidence(Bread ⇒ Butter) / Support(Butter) = 0.5 / 0.571 = 0.874

This rule means that if a customer buys Bread, there’s a 50% chance they will also buy Butter. The lift
value is slightly less than 1, indicating a weaker but positive relationship.

Step 4: Interpret the Results

From the generated rules, we can draw conclusions to help inform business decisions:
 Milk and Bread are frequently bought together, with a high confidence of 80%. This suggests that these
items could be placed near each other in the store or offered in promotions as a bundle.
 Milk and Butter have a strong positive correlation (lift = 1.05). This indicates that they are often bought
together, so a promotion or discount that bundles these items could increase sales.
 Bread and Butter are also frequently bought together, but with a slightly weaker correlation. The store
might consider cross-promoting these items, but not as strongly as Milk and Butter.

Evaluation Metrics in Association Rule Mining


In Association Rule Mining (ARM), after generating frequent itemsets and the corresponding rules, we need to
evaluate these rules to determine their significance and usefulness. Various evaluation metrics are used to quantify
the strength, relevance, and interestingness of association rules. The most commonly used metrics include Support,
Confidence, Lift, Conviction, and others.

Let’s go into each of these metrics in detail:

1. Support:

Support measures the frequency or occurrence of an itemset (or combination of items) in the dataset. It helps to
identify the most commonly occurring itemsets.

Formula:

Example:

If we have 100 transactions and 20 of them contain itemset A (e.g., Milk and Bread), the support for A is: 0.2

 High Support: Indicates that the itemset occurs frequently and is important.
 Low Support: May indicate a less important or less frequent itemset.

Why Support is Important:

 Support helps in pruning the rules and itemsets that are less interesting. We often set a minimum support
threshold (e.g., 0.2 or 20%) to eliminate itemsets that appear too infrequently to be of practical use.
 Rules with low support are less likely to be significant, as they only occur in a small subset of transactions.

2. Confidence:

Confidence measures the likelihood that item B is bought when item A is bought. It is a conditional probability of
B given A.

Formula:

Where:

 Support (A ∪ B) is the support of the rule A⇒B, i.e., the number of transactions that contain both A and B.
 Support(A) is the support of item A, i.e., the number of transactions that contain A.

Example:

If 30 transactions contains Milk and Bread, and 50 transactions contain Milk: Confidence (Milk ⇒ Bread) =
30/50=0.6

 High Confidence: Means that A⇒B is a strong rule, as there is a high probability that B will be purchased
if A is bought.
 Low Confidence: Indicates a weak rule, where buying A does not necessarily lead to buying B.

Why Confidence is Important:

 Confidence provides actionable insights. A rule with high confidence (e.g., 80%) means the antecedent (A)
strongly influences the consequent (B), making the rule valuable for targeted marketing or product
placement.

3. Lift:

Lift compares the observed support of the rule with the expected support if A and B were independent. It is used to
measure how much more likely B is purchased when A is purchased, relative to what would be expected by chance.

Formula:

Where:

 Support (A ∪ B) is the support of the rule A⇒B.


 Support (A) is the support of A.
 Support (B) is the support of B.

Example:

Let’s say we have:

 Support(Milk) = 0.50 (50% of transactions contain Milk)


 Support(Bread) = 0.40 (40% of transactions contain Bread)
 Support(Milk ∪ Bread) = 0.30 (30% of transactions contain both Milk and Bread)

The Lift of the rule Milk ⇒ Bread is:

Lift (Milk ⇒ Bread) = 0.300.50×0.40=1.50

 Lift > 1: Indicates a positive association, meaning A and B are likely to be bought together more often than
expected by chance (they are correlated).
 Lift = 1: Means A and B are independent (no correlation).
 Lift < 1: Indicates a negative association, meaning A and B are less likely to be bought together than
expected by chance (they are inversely correlated).
Why Lift is Important:

 Lift provides a more normalized measure of the strength of the relationship between AA and BB. A lift >
1 suggests that A and B have a strong positive relationship, which is valuable in market basket analysis to
identify items that complement each other.

4. Conviction:

Conviction is an alternative metric to Lift for evaluating the strength of association rules. It measures how much
more likely A is to be bought without B, relative to the probability that A will be bought regardless of B.

Formula:

Where:

 Support (B) is the support of item B.


 Confidence (A ⇒ B) is the confidence of the rule.

Example:

For the rule Milk⇒Bread:

 Support(Bread) = 0.40
 Confidence(Milk ⇒ Bread) = 0.60

The Conviction of the rule Milk ⇒ Bread is:

Conviction (Milk ⇒ Bread)=(1−0.40)/(1−0.60)=0.60/0.40=1.5

 Conviction = 1: Indicates that A and B are independent (no strong relationship).


 Conviction < 1: Suggests a weak or negative relationship.
 Conviction > 1: Indicates a strong rule, suggesting that when A is bought, B is less likely to be bought by
chance.

Why Conviction is Important:

 Conviction is useful when you want to assess the “surprise factor” of an association. Higher conviction
values indicate a rule is more likely to hold true than expected.

5. Leverage:

Leverage measures how much more likely the rule A⇒B is to occur compared to what would be expected if A and
B were independent.

Formula:
Example:

If:

 Support(Milk) = 0.50
 Support(Bread) = 0.40
 Support(Milk ∪ Bread) = 0.30

The Leverage of the rule Milk ⇒ Bread is:

Leverage (Milk⇒Bread)=0.30−(0.50×0.40)=0.30−0.20=0.10

 Leverage > 0: Indicates a positive association between A and B.


 Leverage = 0: Indicates that A and B are independent (no correlation).
 Leverage < 0: Indicates a negative association.

Why Leverage is Important:

 Leverage helps quantify the “extra” occurrence of item BB when item AA is bought, beyond what would
be expected by chance.

You might also like