0% found this document useful (0 votes)
18 views6 pages

UnsupervisedLearning FoundationalMathofAI S24

The document provides an introduction to unsupervised learning, explaining its significance in discovering patterns in unlabelled data and its applications such as clustering, association, and dimensionality reduction. It details various clustering methods including centroid-based, hierarchical, distribution-based, and density-based clustering, along with evaluation metrics like the Silhouette score. Additionally, it covers dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to simplify datasets while retaining essential information.

Uploaded by

Tej Grover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

UnsupervisedLearning FoundationalMathofAI S24

The document provides an introduction to unsupervised learning, explaining its significance in discovering patterns in unlabelled data and its applications such as clustering, association, and dimensionality reduction. It details various clustering methods including centroid-based, hierarchical, distribution-based, and density-based clustering, along with evaluation metrics like the Silhouette score. Additionally, it covers dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) to simplify datasets while retaining essential information.

Uploaded by

Tej Grover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Introduction to Unsupervised Learning

Yashil Sukurdeep
July 2, 2024

1 Unsupervised Learning: Fundamentals


Unsupervised learning is a type of machine learning where models are trained
on unlabelled data, i.e., data that does not have labeled responses. As a result,
the goal is to train the model so that it learns patterns and structure from the
input data without any explicit instructions on what to predict. Unsupervised
learning is important because it helps to discover hidden patterns or intrinsic
structures in data, reduce the complexity of data, and prepare data for super-
vised learning tasks.

Unsupervised learning is used for:


• Clustering: Grouping data points into clusters based on their similarity.
• Association: Discovering rules that describe large portions of data.

• Dimensionality Reduction: Reducing the number of random variables


under consideration.
Some real life use cases of unsupervised learning include:
• Customer segmentation in marketing.

• Anomaly detection in fraud detection.


• Grouping news articles by topic.

2 Clustering
Clustering is the process of dividing a dataset into groups, where the members of
each group are similar in some way. The goal of clustering is to identify structure
in an unlabeled dataset. There are several methods used for clustering, each
with its own approach to grouping data points.

1
2.1 Centroid-Based Clustering
Centroid-based clustering methods partition the data into clusters by assign-
ing each data point to the cluster corresponding to the nearest centroid. The
centroids are then updated iteratively to minimize the distance between data
points and their respective centroids.

The K-Means clustering algorithm is an example of a centroid-based clustering


method. Given (unlabelled) data points {x1 , . . . , xn } ∈ Rd , it works as follows:

1. Initialization: Choose K initial centroids {c1 , . . . , cK } randomly from


the dataset. At this stage, each centroid could be one of the data points
themselves. Let the cluster corresponding to centroid cj be denoted by Cj
for j = 1, . . . , K.

2. Assignment step: Assign each data point to the cluster corresponding


to the nearest centroid. Mathematically, for each data point xi , find the
centroid cj that minimizes the Euclidean distance:
d
X
cj = argmin ∥xi − cℓ ∥2 = (xim − cℓm )2 ,
ℓ=1,...,n m=1

and assign data point xi to cluster Cj .


3. Update step: Recalculate the centroid of each cluster by taking the mean
of all data points assigned to that cluster. For each cluster Cj , update it’s
centroid as follows:
1 X
cj = xi
|Cj |
xi ∈Cj

4. Repeat: Repeat the assignment and update steps until the centroids no
longer change or the change is below a certain threshold.

The algorithm aims to minimize the within-cluster sum of squares (WCSS),


which is defined as:
XK X
WCSS = ∥xi − cj ∥2
j=1 xi ∈Cj

This objective function ensures that the data points within each cluster are as
close as possible to the centroid, leading to compact and well-separated clusters.

2
Figure 1: Illustration of K-means clustering with the Iris dataset. The first 3
features of the dataset were selected for visualization. The plot on the right
displays the clusters assigned by the K-means algorithm. The plot on the left
displays the data points colored by their true labels.

An important consideration when running the K-means algorithm is how to


choose the best value of K. This can be achieved by the Elbow method, which
consists of computing the ‘quality’ of your clusters as defined by an objective
function (such as the WCSS, or a performance metric such as the Silhouette
score) for a range of values of K, and then choosing the value of K where there
is an ’inflexion’ point in the graph of the objective function versus K:

Figure 2: Elbow method for choosing ‘best’ value of K for the K-means algo-
rithm.

3
2.2 Hierarchical Clustering
Hierarchical clustering is another clustering technique which builds a hierarchy
of clusters either by merging small clusters into larger ones (agglomerative) or
by splitting larger clusters into smaller ones (divisive). This method is often
visualized using a dendrogram.

Figure 3: Illustration of Hierarchical Clustering. On the left, we display the


data to be clustered, with the Euclidean distance used as the distance metric.
On the right, we display the hierarchical clustering dendrogram obtained.

2.3 Distribution-Based Clustering


Distribution-based clustering assumes that the data points are generated from a
mixture of several distributions. The goal is to identify the parameters of these
distributions.

Example: An example of this method is Gaussian Mixture Models (GMM).

2.4 Density-Based Clustering


Density-based clustering groups data points that are closely packed together,
marking as outliers the points that lie alone in low-density regions.

Example: DBSCAN (Density-Based Spatial Clustering of Applications with


Noise) is a popular algorithm for this type of clustering.

4
2.5 Evaluation Metrics for Clustering
To evaluate the performance of clustering algorithms, several metrics can be
used. The Silhouette score, in particular, is a powerful metric.

• Silhouette Score: Measures how similar an object is to its own cluster


(cohesion) compared to other clusters (separation). The Silhouette Score
s(xi ) for a data point xi is defined as:

b(xi ) − a(xi )
s(xi ) =
max(a(xi ), b(xi ))

where a(xi ) is the average distance from the point xi to the other points
in the same cluster, and b(xi ) is the average distance from the point xi to
points in the nearest cluster.

The average of the Silhouette score of the data points is then used as
the overall metric for the quality of clusters obtained accross the entire
dataset:
n
1X
S= S(xi ).
n i=1

3 Dimensionality Reduction
Dimensionality reduction is a process used in machine learning and statistics to
reduce the number of features or variables under consideration. This technique
simplifies the dataset by decreasing its dimensions without losing significant
information. It is essential for handling high-dimensional data efficiently and
is used to improve the performance of machine learning models, reduce storage
space, and decrease computation time.

3.1 Principal Component Analysis (PCA)


Principal Component Analysis (PCA) is a statistical technique that transforms
the data into a set of orthogonal components (i.e., axes which are at right angles
to each other), which are ordered by the amount of variance they explain. The
first few components capture most of the variability in the data, allowing for a
reduction in the number of dimensions while preserving the essential patterns.

3.2 Linear Discriminant Analysis (LDA)


Linear Discriminant Analysis (LDA) is a technique used for both dimensionality
reduction and classification. It projects the data onto a lower-dimensional space
with the goal of maximizing the separation between multiple classes. LDA is
particularly useful when the classes are well-separated.

5
Figure 4: Illustration of Principal Component Analysis (PCA).

Figure 5: Illustration of Linear Discriminant Analysis (LDA) using the Iris


dataset. The different colors represent the three species of Iris flowers: setosa
(navy), versicolor (turquoise), and virginica (dark orange). The plot shows how
LDA projects the data onto a lower-dimensional space (with axes X1 and X2 )
with the goal of maximizing the separation between these classes.

You might also like