CE 345 - Introduction to
Machine Learning
Lecture #10
The Unsupervised Learning - Clustering (cont’d)
Dr. M. Çağkan Uludağlı
Clustering
2
Clustering
● A parametric clustering method: K-Means.
○ The number of classes (K) has to be defined.
● Another parametric clustering method: Mixture of Gaussians.
○ Clusters are represented as Gaussian distributions, then the training set is assumed to be a
mixture of Gaussians.
○ The number of clusters has to be defined.
● For not-well separated distributions, parametric approaches are likely to fail.
We seek for more flexible approaches.
3
Clustering
● Partition-Based (Centroid-Based) Clustering:
○ Example: k-Means, k-Medoids
○ Best for: Spherical clusters and fast segmentation.
● Hierarchical (Connectivity-Based) Clustering:
○ Example: Agglomerative, Divisive, BIRCH
○ Best for: Small to medium datasets, interpretable hierarchy.
● Fuzzy Clustering
○ Example: Fuzzy k-means, c-Means
○ Best for: Overlapping clusters where points can belong to multiple clusters.
● Search-Based (Heuristic-Based) Clustering
○ Example: J-Means, Global k-Means
○ Best for: Complex objective functions and cases where global optimization is needed.
● Graph-Based Clustering
○ Example: Chameleon, CACTUS
○ Best for: Non-linearly separable clusters, network and social data, and data with connectivity-based structures.
● Grid-Based Clustering:
○ Example: STING, CLIQUE
○ Best for: Spatial or high-dimensional data.
● Density-Based Clustering:
○ Example: DBSCAN, OPTICS
○ Best for: Non-spherical clusters and data with noise or outliers.
● Model-Based Clustering:
○ Example: Gaussian Mixture Models
○ Best for: Elliptical clusters and soft clustering, overlapping clusters.
● Affinity Propagation
○ Best for: Clustering without predefined k and data with complex structures based on similarity.
4
Hierarchical Clustering
5
Hierarchical Clustering
● Produces a set of nested clusters organized as a hierarchical tree.
● Can be visualized as a dendrogram, which is a tree like diagram that records
the sequences of merges or splits.
6
Hierarchical Clustering
● In hierarchical clustering, we do not have to define a particular number of
clusters.
○ Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper
level.
○ A group of connected components at that level forms a cluster.
● Use of the method is popular in life sciences since they correspond to
taxonomies.
● Example: It is common in biological sciences (e.g., animal kingdom,
phylogeny reconstruction (i.e. the history of the evolution of a species or
group), etc.).
7
Hierarchical Clustering
There are two main types of hierarchical clustering:
● Agglomerative:
○ Start with the data points as individual clusters.
○ At each step, merge the closest pair of clusters until only one cluster (or k clusters) left.
● Divisive:
○ Start with one, all-inclusive cluster.
○ At each step, split a cluster until each cluster contains a point (or there are k clusters).
8
Agglomerative Clustering
9
Agglomerative Clustering
It is a straightforward algorithm:
1. Compute the proximity matrix.
2. Let each data point be a cluster.
3. Repeat Until only a single cluster remains:
a. Merge the two closest clusters.
b. Update the proximity matrix.
10
Agglomerative Clustering
● The samples assigned to one cluster, should be more similar to samples in
the same cluster (w.r.t. to the samples in other clusters).
● Hierarchical algorithms use a similarity/proximity matrix, which is a symmetric
matrix on individual data point similarity/proximity.
● Different similarity measures can be defined on some distance criteria (e.g. an
Euclidean distance between two samples).
11
Agglomerative Clustering
● Start with clusters of individual points and an initial proximity matrix:
12
Agglomerative Clustering
● After some merging steps, we have some clusters:
13
Agglomerative Clustering
● We want to merge the two closest clusters (C2 and C5) and update the
proximity matrix.
14
Agglomerative Clustering
● After merging, how do we update the
proximity matrix?
● By using an inter-cluster similarity
measure and a linkage method.
15
Agglomerative Clustering: Inter-Cluster Similarity
The most used inter-cluster similarity metrics are:
● Euclidean Distance.
● Manhattan Distance.
● Minkowski Distance.
● Mahalanobis Distance.
● Cosine Similarity.
● Other metrics.
16
Agglomerative Clustering: Linkage
The most used linkage methods are:
● MIN (Single) Distance.
● MAX (Complete) Distance.
● Group Average.
● Distance Between Centroids.
● Ward’s Minimum Variance.
● An other method driven by a different objective function.
17
Agglomerative Clustering: Linkage
● MIN (Single) Distance:
18
Agglomerative Clustering: Linkage
● MAX (Complete) Distance:
19
Agglomerative Clustering: Linkage
● Group Average:
20
Agglomerative Clustering: Linkage
● Distance Between Centroids:
21
Agglomerative
Clustering: Linkage
Comparison
The comparison between characteristics of different
linkage methods for hierarchical clustering on toy
datasets with Python and scikit-learn library is given in
here:
https://scikit-learn.org/stable/auto_examples/cluster/plot_link
age_comparison.html
22
Agglomerative Clustering
● Key operation/property of an hierarchical algorithm is the computation of the
proximity of two clusters.
● To find nearest clusters, one can use:
23
Agglomerative Clustering: Example #1
● This example can be used for hierarchical clustering cities in Italy using
intercity road distances (with Euclidean distance & MIN (Single) Linkage).
24
Agglomerative Clustering: Example #1
25
Agglomerative Clustering: Example #1
26
Agglomerative Clustering: Example #1
27
Agglomerative Clustering: Example #1
28
Divisive Clustering
29
Divisive Clustering
For divisive hierarchical clustering, you need to build a Minimum Spanning Tree (MST):
1. Start with any point and include it in the tree.
2. Repetitively, look for the closest pair of points (p, q) such that p is in the current
tree but the q is not.
3. Add q to the tree and put an edge between p and q.
4. Repeat this process until every point is connected.
30
Divisive Clustering
Divisive Clustering using MST :
1. Start with the complete MST.
2. Repeat Until only singleton (one-member) clusters remain.
a. Create a new cluster by breaking the link corresponding to the largest distance (smallest
similarity).
31
Hierarchical Clustering: A Summary
32
Hierarchical Clustering: A Summary
● These methods use a distance matrix as clustering criteria.
● They do not require the number of clusters (K) as an input.
● They use one of the different approaches for inter-cluster similarity
measurement and their linkages.
● Once a decision is made to combine two clusters, it cannot be undone.
● No cost function is directly minimized in them.
33
Comparison of Clustering Methods
Comparison can be found in here: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html 34
Clustering Evaluation Metrics
35
Clustering Evaluation Metrics
● Rand Index & Adjusted Rand Index (ARI)
Ground Truths
● Mutual Information & Normalized Mutual Information (NMI) Needed to be
● Homogeneity, Completeness & V-measure Known
● Fowlkes-Mallows index (FMI)
● Silhouette Coefficient
Ground Truths
● Calinski-Harabasz Index Are Not Known
● Davies-Bouldin Index
● Contingency Matrix Represented
Ground Truths
Needed to be
● Pair Confusion Matrix in Matrix Form
Known
36
Adjusted Rand Index (ARI)
● The Rand Index computes a similarity measure between two clusterings by
considering all pairs of samples and counting pairs that are assigned in the
same or different clusters in the predicted and true clusterings.
● The raw RI score is then “adjusted for chance” into the Adjusted Rand Index
(ARI) score using the following scheme:
ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
● The ARI is thus ensured to have a value close to 0.0 for random labeling
independently of the number of clusters and samples, and exactly 1.0 when
the clusterings are identical. The ARI is bounded below by -0.5 for especially
discordant clusterings.
● ARI is a symmetric measure.
37
Adjusted Rand Index (ARI): Advantages
● Interpretability: The RI is proportional to the number of sample pairs whose
labels are the same in both predicted and true labels, or are different in both.
● Uniform random label assignments have an ARI score close to 0.0 for
any value of number of clusters and number of samples.
● Bounded range: Lower values indicate different labelings, similar clusterings
have a high RI or ARI, 1.0 is the perfect match score. The score range is [0, 1]
for the RI and [-0.5, 1] for the ARI.
● No assumption is made on the cluster structure: The RI or ARI can be
used to compare all kinds of clustering algorithms.
38
Adjusted Rand Index (ARI): Disadvantages
● The RI or ARI require knowledge of the ground truth classes, which is
almost never available in practice or requires manual assignment by human
annotators (as in the supervised learning setting).
● The RI is often close to 1.0 even if the clusterings themselves differ
significantly.
○ This can be understood when interpreting the Rand index as the accuracy of element pair
labeling resulting from the clusterings: In practice there often is a majority of element pairs that
are assigned the different pair label under both the predicted and the ground truth clustering
resulting in a high proportion of pair labels that agree, which leads subsequently to a high
score.
39
Normalized Mutual Information (NMI)
● The Mutual Information (MI) is a measure of the similarity between two labels
of the same data.
● Where |Ui| is the number of the samples in cluster Ui and |Vi| is the number of
the samples in cluster Vi, the Mutual Information between clusterings U and V
is given as:
● This metric is independent of the absolute values of the labels: a permutation
of the class or cluster label values won’t change the score value in any way.
● This metric is furthermore symmetric: switching U with V will return the same
score value.
40
Normalized Mutual Information (NMI)
● Normalized Mutual Information (NMI) is a normalization of the Mutual
Information (MI) score to scale the results between 0 (no mutual information)
and 1 (perfect correlation).
● In this function, mutual information is normalized by some generalized mean
of true labels and predicted labels, defined by the averaging method.
● NMI is also calculated as:
where Y = class labels, C = cluster labels, H(.) = entropy and I(Y;C) = mutual
information between Y and C.
41
Normalized Mutual Information (NMI)
● NMI is a good measure for determining the quality of clustering.
● It is an external measure because we need the class labels of the instances
to determine the NMI.
● Since it’s normalized, we can measure and compare the NMI between
different clusterings having different number of clusters.
42
Homogeneity, Completeness and V-measure
● Given the knowledge of the ground truth class assignments of the samples, it
is possible to define some intuitive metric using conditional entropy analysis.
● In particular, Rosenberg and Hirschberg (2007) define the following two
desirable objectives for any cluster assignment:
○ Homogeneity: each cluster contains only members of a single class.
○ Completeness: all members of a given class are assigned to the same cluster.
● Both scores are bounded below by 0.0 and above by 1.0 (higher is better).
43
Homogeneity, Completeness and V-measure
● Their harmonic mean called V-measure, and it is computed by:
● This score is identical to NMI score with the arithmetic averaging method.
44
Homogeneity, Completeness and V-measure: Advantages
● Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.
● Intuitive interpretation: clustering with bad V-measure can be qualitatively
analyzed in terms of homogeneity and completeness to better feel what ‘kind’
of mistakes is done by the assignment.
● No assumption is made on the cluster structure: can be used to compare
clustering algorithms such as k-means which assumes isotropic blob shapes
with results of spectral clustering algorithms which can find cluster with
“folded” shapes.
45
Homogeneity, Completeness and V-measure: Disadvantages
● The previously introduced metrics are not normalized with regards to
random labeling: this means that depending on the number of samples,
clusters and ground truth classes, a completely random labeling will not
always yield the same values for homogeneity, completeness and hence
v-measure. In particular random labeling won’t yield zero scores especially
when the number of clusters is large.
● These metrics require the knowledge of the ground truth classes, which
almost never available in practice or requires manual assignment by human
annotators (as in the supervised learning setting).
46
Silhouette Coefficient
● If the ground truth labels are not known, evaluation must be performed using
the model itself. The Silhouette Coefficient is an example of such an
evaluation, where a higher Silhouette Coefficient score relates to a model with
better defined clusters.
● The Silhouette Coefficient is defined for each sample and is composed of two
scores:
○ a: The mean distance between a sample and all other points in the same class.
○ b: The mean distance between a sample and all other points in the next nearest cluster.
● The Silhouette Coefficient s for a single sample is then given as:
● The Silhouette Coefficient for a set of samples is given as the mean of the
Silhouette Coefficient for each sample. 47
Silhouette Coefficient
● In normal usage, the Silhouette Coefficient is applied to the results of a cluster
analysis.
● The score is bounded between -1 for incorrect clustering and +1 for highly
dense clustering.
● Scores around zero indicate overlapping clusters.
● The score is higher when clusters are dense and well separated, which
relates to a standard concept of a cluster.
● The Silhouette Coefficient is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those obtained
through DBSCAN.
48
Calinski-Harabasz Index
● If the ground truth labels are not known, the Calinski-Harabasz index - also
known as the Variance Ratio Criterion - can be used to evaluate the model,
where a higher Calinski-Harabasz score relates to a model with better defined
clusters.
● It is the ratio of the sum of between-clusters dispersion and of within-cluster
dispersion for all clusters.
● The dispersion is defined as the sum of squared distances.
49
Calinski-Harabasz Index
50
Calinski-Harabasz Index
● In normal usage, the Calinski-Harabasz index is applied to the results of a
cluster analysis.
● The score is higher when clusters are dense and well separated, which
relates to a standard concept of a cluster.
● The score is fast to compute.
● The Calinski-Harabasz index is generally higher for convex clusters than
other concepts of clusters, such as density based clusters like those obtained
through DBSCAN.
51
Davies-Bouldin Index
● If the ground truth labels are not known, the Davies-Bouldin index can be
used to evaluate the model, where a lower Davies-Bouldin index relates to a
model with better separation between the clusters.
● This index signifies the average ‘similarity’ between clusters, where the
similarity is a measure that compares the distance between clusters with the
size of the clusters themselves.
● Zero is the lowest possible score. Values closer to zero indicate a better
partition.
52
Davies-Bouldin Index
53
Davies-Bouldin Index
● In normal usage, the Davies-Bouldin index is applied to the results of a cluster
analysis.
● The computation of Davies-Bouldin is simpler than that of Silhouette scores.
● The index is solely based on quantities and features inherent to the dataset
as its computation only uses point-wise distances.
● The Davies-Bouldin Index is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those obtained from
DBSCAN.
● The usage of centroid distance limits the distance metric to Euclidean space.
54
Contingency Matrix
● Contingency matrix reports the intersection cardinality for every true/predicted
cluster pair.
● The contingency matrix provides sufficient statistics for all clustering metrics
where the samples are independent and identically distributed (iid) and one
doesn’t need to account for some instances not being clustered.
● A confusion matrix for classification is a square contingency matrix where the
order of rows and columns correspond to a list of classes.
55
Contingency Matrix
● If the ground-truth classes are x = ["a", "a", "a", "b", "b", "b"], and the predicted
clusters are y = [0, 0, 1, 1, 2, 2], then the contingency matrix will be:
● The first row of matrix indicates that there are three samples whose true class
is “a”. Of them, two are in predicted cluster 0, one is in 1, and none is in 2.
● The second row indicates that there are three samples whose true class is
“b”. Of them, none is in predicted cluster 0, one is in 1 and two are in 2.
56
Contingency Matrix
● It allows to examine the spread of each true cluster across predicted clusters
and vice versa.
● The calculated contingency matrix is typically utilized in the calculation of a
similarity statistic (like the others listed in this document) between the two
clusterings.
● It is easy to interpret for a small number of clusters, but becomes very hard to
interpret for a large number of clusters.
● It doesn’t give a single metric to use as an objective for clustering
optimisation.
57
Pair Confusion Matrix
The pair confusion matrix is a 2x2 similarity matrix between two clusterings computed by
considering all pairs of samples and counting pairs that are assigned into the same or
into different clusters under the true and predicted clusterings:
● C00 : number of pairs with both clusterings having the samples not clustered
together.
● C10: number of pairs with the true label clustering having the samples clustered
together but the other clustering not having the samples clustered together.
● C01: number of pairs with the true label clustering not having the samples clustered
together but the other clustering having the samples clustered together.
● C11: number of pairs with both clusterings having the samples clustered together.
58
Pair Confusion Matrix
● Considering a pair of samples that is clustered together a positive pair, then
as in binary classification, the count of true negatives is C00, false negatives is
C10, true positives is C11 and false positives is C01.
● Perfectly matching labelings have all non-zero entries on the diagonal
regardless of actual label values.
● Labelings that assign all classes members to the same clusters are complete
but may not always be pure, hence penalized, and have some off-diagonal
non-zero entries.
● The matrix is not symmetric.
● If classes members are completely split across different clusters, the
assignment is totally incomplete, hence the matrix has all zero diagonal
entries.
59
Clustering Evaluation Metrics
60
Algorithm
Selection
Cheatsheet
61
Summary
● Hierarchical Clustering
○ Agglomerative
○ Divisive
● Comparison of Clustering Methods
● Clustering Evaluation Metrics
62
End of Lecture #10
63
References
● Lecture Slides by Ethem Alpaydın, Introduction to Machine Learning, 3rd Edt. (MIT Press,
2014).
● Lecture Slides by Yalın Baştanlar, Introduction to Machine Learning course, IZTECH CS
Department, 2012.
● Machine Learning Flashcards, Chris Albon, https://machinelearningflashcards.com/
● Gan, G., Ma, C., & Wu, J. (2007). Data clustering: theory, algorithms, and applications.
Society for Industrial and Applied Mathematics.
● Giordani, P., Ferraro, M. B., Martella, F., Giordani, P., Ferraro, M. B., & Martella, F. (2020).
Introduction to clustering with R. Springer Singapore.
● Northeastern University, CS6140 Machine Learning, SPRING 2015 Lecture Notes by Bilal
Ahmed, https://course.ccs.neu.edu/cs6140sp15/
● V-Measure: A conditional entropy-based external cluster evaluation measure Andrew
Rosenberg and Julia Hirschberg, 2007.
● 2.3. Clustering, Unsupervised Learning, User Guide,
https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
64