Clustering metrics
Clustering metrics
• Clustering metrics are quantitative measures used to evaluate the quality of
clustering algorithms and the resulting clusters.
• Internal Evaluation Metrics: These metrics evaluate the quality of clusters without
any external reference. They measure the compactness of data points within the
same cluster and the separation between different clusters.
• Such as: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index (Variance Ratio
Criterion), Dunn Index
• External Evaluation Metrics: These metrics require a ground truth or a reference
set of labels to compare the clustering results against. They measure the
agreement between the generated clusters and the true classes in the reference
set.
• Such as: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows
Index
Silhouette Score
• Silhouette Coefficient or silhouette score is a metric used to calculate
the goodness of a clustering technique. Its value ranges from -1 to 1.
• 1: Means clusters are well apart from each other and clearly
distinguished (Best Case)
• 0: Means clusters are indifferent, or we can say that the distance
between clusters is not significant (Overlapping Clusters)
• -1: Means clusters are assigned in the wrong way (Worst Case)
It is computed as:
• The silhouette value is a measure of how similar an object is to its
own cluster (cohesion) compared to other clusters (separation).
• Silhouette Score = (b-a)/max(a,b)
where
• a= average intra-cluster distance i.e the average distance between
each point within a cluster
• b= average inter-cluster distance i.e the average distance between all
clusters
Example:
Datapoints Cluster Label
A1 C1
A2 C1
A3 C2
A4 C2
Datapoint A1 A2 A3 A4
A1 0 0.10 0.65 0.55
A2 0.10 0 0.70 0.60
A3 0.65 0.70 0 0.30
A4 0.55 0.60 0.30 0
For point A1:
a= 0.1/1=0.1
b= (0.65+0.55)/2=0.6
Silhouette Score = (b-a)/max(a,b)
(0.6-0.1)/ 0.6= 0.833
For point A2:
a= 0.1/1=0.1
b= (0.70+0.60)/2=0.65
Silhouette Score = (b-a)/max(a,b)
(0.65-0.1)/ 0.65= 0.846
For point A3:
a= 0.30/1=0.30
b= (0.65+0.70)/2=0.675
Silhouette Score = (b-a)/max(a,b)
(0.675-0.30)/ 0.675= 0.555
For point A4:
a= 0.30/1=0.30
b= (0.55+0.60)/2=0.575
Silhouette Score = (b-a)/max(a,b)
(0.575-0.30)/ 0.575= 0.478
Point A1 and A2 are lying cluster C1 , so for computing Silhouette Score
for cluster C1,
(0.833+0.846)/2= 1.679/2= 0.839
Point A3 and A4 are lying cluster C2 , so for computing Silhouette Score
for cluster C2,
(0.555+0.478)/2= 1.033/2= 0.5165
Silhouette Score/Coefficient for overall clustering problem is:
(0.839+0.5165)/2= 0.6775 or 0.678
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
data = np.array([[3, 7], [9, 8], [5, 4], [2, 6], [5,6]])
labels = np.array([0, 0, 1, 1, 1])
Silhouette_Score = silhouette_score(data, labels)
print("The average silhouette score is:", Silhouette_Score)
Davies-Bouldin Index
• The Davies-Bouldin index (DBI) is a metric for assessing the separation
and compactness of clusters.
• It is based on the idea that good clusters are those that have low
within-cluster variation and high between-cluster separation.
• The minimum score is zero, with lower values indicating better
clustering.
• Lower the DB index value, better is the clustering. It also has a
drawback. A good value reported by this method does not
imply the best information retrieval.
Where,
Example
• Data Point 1: (2, 3)
• Data Point 2: (3, 2)
• Data Point 3: (8, 8)
• Data Point 4: (9, 7)
• Data Point 5: (15, 14)
• Data Point 6: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2} Cluster 2: {Data Point 3, Data
Point 4, Data Point 5, Data Point 6}
• The centroids of these clusters are approximately:
• Centroid of Cluster 1: (2.5, 2.5) Centroid of Cluster 2: (12, 10.5)
• Now, let's calculate the pairwise distances between the centroids of
each cluster:
• Now, let's calculate the pairwise distances between the centroids of each
cluster:
• Distance between Cluster 1 centroid and Cluster 2 centroid
• sqrt((12 - 2.5)^2 + (10.5 - 2.5)^2) ≈ 11.64
• Next, for each cluster, let's calculate the average distance of its data points
to its centroid:
• For Cluster 1:
• Average distance = [(2.5 - 2)^2 + (2.5 - 3)^2]^0.5 ≈ 0.71 (For all datapoints
in C1)
• For Cluster 2:
• Average distance = [(12 - 8)^2 + (10.5 - 8)^2 + (12 - 15)^2 + (10.5 -
14)^2]^0.5 ≈ 4.56 (For all datapoints in C2)
• Now, calculate the Davies-Bouldin Index for each cluster:
• For Cluster 1:
• DB1 = 0.71 / 11.64 ≈ 0.061
• For Cluster 2:
• DB2 = 4.56 / 11.64 ≈ 0.392
• Finally, calculate the overall Davies-Bouldin Index for this clustering
solution:
• DBI = (DB1 + DB2) / 2 ≈ (0.061 + 0.392) / 2 ≈ 0.227
• In this example, the Davies-Bouldin Index for this clustering solution is
approximately 0.227. Lower values of the Davies-Bouldin Index indicate
better-defined clusters, where data points within clusters are close to each
other and far from points in other clusters.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import
davies_bouldin_score
data1 = np.array([[6, 8], [9, 5], [5, 4], [2,
6], [5,6], [3,4]])
labels1 = np.array([0, 0, 1, 1, 1, 0])
davies_bouldin_index =
davies_bouldin_score(data1, labels1)
print("Davies-Bouldin Index:",
davies_bouldin_index)
Dunn index:
• A metric for evaluating clustering algorithms, is an internal evaluation scheme,
where the result is based on the clustered data itself.
• Like all other such indices, the aim of this Dunn index to identify sets of clusters
that are compact, with a small variance between members of the cluster, and well
separated, where the means of different clusters are sufficiently far apart, as
compared to the within cluster variance.
• Higher the Dunn index value, better is the clustering.
• The number of clusters that maximizes Dunn index is taken as the optimal number
of clusters k.
Where,
• A higher Dunn Index value indicates better clustering, as it implies
that the clusters are compact and well-separated. Let's go through a
numerical example to illustrate the concept.
• Consider a dataset with 8 data points in a 2D space:
• Data Point 1: (2, 3) Data Point 2: (3, 2) Data Point 3: (4, 3) Data Point
4: (8, 8) Data Point 5: (9, 7) Data Point 6: (10, 8) Data Point 7: (15, 14)
Data Point 8: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2, Data Point 3} Cluster 2: {Data
Point 4, Data Point 5, Data Point 6, Data Point 7, Data Point 8}
• Now, let's calculate the inter-cluster distances (minimum distance
between any two points from different clusters) and intra-cluster
distances (maximum distance between any two points within the
same cluster):
• Inter-cluster distance:
• Between Cluster 1 and Cluster 2:
• Min inter-cluster distance = distance(Data Point 1, Data Point 4) = sqrt((8 - 2)^2 + (8 - 3)^2) ≈
8.25
• Intra-cluster distances:
• For Cluster 1:
• Max intra-cluster distance = max(distance(Data Point 1, Data Point 2), distance(Data Point 1,
Data Point 3), distance(Data Point 2, Data Point 3)) = max(sqrt(2), sqrt(1), sqrt(1)) = sqrt(2) ≈
1.41
• For Cluster 2:
• Max intra-cluster distance = max(distance(Data Point 4, Data Point 5), distance(Data Point 4,
Data Point 6), distance(Data Point 4, Data Point 7), distance(Data Point 4, Data Point 8),
distance(Data Point 5, Data Point 6), distance(Data Point 5, Data Point 7), distance(Data Point
5, Data Point 8), distance(Data Point 6, Data Point 7), distance(Data Point 6, Data Point 8),
distance(Data Point 7, Data Point 8)) = max(sqrt(1), sqrt(2), sqrt(2), sqrt(5), sqrt(2), sqrt(2),
sqrt(5), sqrt(2), sqrt(5), sqrt(2)) = sqrt(5) ≈ 2.24
• Now, calculate the Dunn Index:
• Dunn Index = Min inter-cluster distance / Max intra-cluster distance ≈
8.25 / 2.24 ≈ 3.68
• In this example, the Dunn Index for this clustering solution is
approximately 3.68. A higher Dunn Index indicates that the clusters
are well-separated and compact within themselves, which is desirable
for a good clustering solution.
Example
• Let's work through a numerical example to calculate the Dunn Index for a simple set of clusters.
Suppose you have the following data points and their corresponding clusters:
Data Points:
• A(1, 2)
• B(2, 2)
• C(5, 8)
• D(6, 8)
• E(10, 12)
• F(11, 12)
Clusters:
• Cluster 1: {A, B}
• Cluster 2: {C, D}
• Cluster 3: {E, F}
• Distance between Cluster 1 and Cluster 2 (d(1, 2)):
• d(A, C) = sqrt((1-5)^2 + (2-8)^2) = sqrt(32) = 4√2
• d(A, D) = sqrt((1-6)^2 + (2-8)^2) = sqrt(41) ≈ 6.40
• d(B, C) = sqrt((2-5)^2 + (2-8)^2) = sqrt(29) ≈ 5.39
• d(B, D) = sqrt((2-6)^2 + (2-8)^2) = sqrt(36) = 6
• So, d(1, 2) = 4√2
• Distance between Cluster 1 and Cluster 3 (d(1, 3)):
• d(A, E) = sqrt((1-10)^2 + (2-12)^2) = sqrt(170) ≈ 13.04
• d(A, F) = sqrt((1-11)^2 + (2-12)^2) = sqrt(200) ≈ 14.14
• d(B, E) = sqrt((2-10)^2 + (2-12)^2) = sqrt(164) ≈ 12.81
• d(B, F) = sqrt((2-11)^2 + (2-12)^2) = sqrt(194) ≈ 13.93
• So, d(1, 3) = 12.81
• Distance between Cluster 2 and Cluster 3 (d(2, 3)):
• d(C, E) = sqrt((5-10)^2 + (8-12)^2) = sqrt(20) ≈ 4.47
• d(C, F) = sqrt((5-11)^2 + (8-12)^2) = sqrt(32) ≈ 5.66
• d(D, E) = sqrt((6-10)^2 + (8-12)^2) = sqrt(16) = 4
• d(D, F) = sqrt((6-11)^2 + (8-12)^2) = sqrt(29) ≈ 5.39
• So, d(2, 3) = 4.47
• Now, let's calculate the minimum inter-cluster distance and the
maximum intra-cluster distance:
• Minimum Inter-cluster Distance: min(d(1, 2), d(1, 3), d(2, 3)) =
min(4√2, 12.81, 4.47) = 4√2
• Maximum Intra-cluster Distance: max(max(d(A, B), d(C, D), d(E, F))) =
max(6.40, 6.40, 14.14) = 14.14
• Finally, we can calculate the Dunn Index:
• Dunn Index = min(d(1, 2), d(1, 3), d(2, 3)) / max(max(d(A, B), d(C, D),
d(E, F))) Dunn Index = (4√2) / 14.14 ≈ 0.2828
• So, the Dunn Index for these clusters is approximately 0.2828.
Data Point Coordinates Cluster
1 (2, 3) A
2 (2, 5) A
3 (3, 8) B
4 (6, 5) B
5 (8, 8) C
6 (9, 6) C
7 (10, 2) A
8 (12, 4) A
9 (15, 7) C
10 (17, 5) C
Drawbacks of Dunn index:
• As the number of clusters and dimensionality of the data
increase, the computational cost also increases.
Adjusted Rand Index (ARI)
• It measures the similarity between the true class labels and the
clusters generated by a clustering algorithm while accounting for
chance agreement.
• The ARI produces a score between -1 and 1, where higher values
indicate better agreement between the predicted clusters and the
true labels.
• Contigency table- m x n
• m= number of clusters by c1 algo.
• n= number of clusters by c2 algo (Ground Truth)
To use the Adjusted Rand Index for evaluating clustering, follow these steps:
Perform Clustering: Apply a clustering algorithm to your dataset to create clusters of data points. This
could be an algorithm like k-means, hierarchical clustering, DBSCAN, or any other clustering technique.
Obtain True Class Labels: If you have access to ground truth labels or class assignments for your data,
this is ideal. These true labels represent the actual groups or categories that your data points belong to.
Example
• Suppose you have two clustering results for a dataset with 100 data points:
• Create a contingency table that shows the number of data points in
common between the two clusterings. Rows represent clusters in
Clustering A, and columns represent clusters in Clustering B. The table
might look like this:
• | Cluster 1 | Cluster 2 |
• -----------------------------------
• Cluster 1 | 30 | 20 |
• -----------------------------------
• Cluster 2 | 40 | 10 |
• -----------------------------------
• Calculate the sum of squares for the rows and columns of the
contingency table. Let's denote these as SSR (Sum of Squares for
Rows) and SSC (Sum of Squares for Columns):
• SSR = (30^2 + 20^2) + (40^2 + 10^2) = 900 + 4000 = 4900
• SSC = (30^2 + 40^2) + (20^2 + 10^2) = 2500 + 500 = 3000
• Now Calculate ARI = [ (RI - Expected_RI) ] / [ max(RI) - Expected_RI ) ]
• The Rand Index (RI) is calculated as (SSR + SSC) / [2 * (N choose 2)], where
N is the total number of data points (100 in this case).
• RI = (4900 + 3000) / [2 * (100 choose 2)] = 7900 / [2 * 4950] ≈ 0.799
• Expected_RI = (SSR * SSC) / [2 * (N choose 2)^2] = (4900 * 3000) / [2 *
4950^2] ≈ 0.109
• ARI = [ (0.799 - 0.109) ] / [ 1 - 0.109 ] ≈ 0.691 / 0.891 ≈ 0.775
• So, the Adjusted Rand Index (ARI) for the given clustering A and B is
approximately 0.775. This indicates a moderate agreement between the
two clusterings, where a higher ARI value would indicate better agreement.
Normalized Mutual Information (NMI)
• Normalized Mutual Information (NMI) is a normalization of the
Mutual Information (MI) score to scale the results between 0 (no
mutual information) and 1 (perfect correlation).
NMI
• NMI is a good measure for determining the
quality of clustering.
• It is an external measure because we need the
class labels of the instances to determine the
NMI.
• Since it’s normalized we can measure and
compare the NMI between different
clusterings having different number of
clusters.
Normalized Mutual Information
• Normalized Mutual Information:
2 × 𝐼(𝑌; 𝐶)
𝑁𝑀𝐼 𝑌, 𝐶 =
𝐻 𝑌 +𝐻 𝐶
• where,
1) Y = class labels
2) C = cluster labels
3) H(.) = Entropy
4) I(Y;C) = Mutual Information b/w Y and C Note: All logs are base-2.
Calculating NMI for Clustering
• Assume m=3 classes and k=2 clusters
Cluster-1 (C=1) Cluster-2 (C=2)
Class-1 (Y=1) Class-2 (Y=2) Class-3 (Y=3)
H(Y) = Entropy of Class Labels
• P(Y=1) = 5/20 = ¼
• P(Y=2) = 10/20 = ½
• P(Y=3) = 5/20 =¼
1 1 1 1 1 1
• H(Y) = − log − log − log = 1.5
4 4 4 4 2 2
This is calculated for the entire dataset and can be
calculated prior to clustering, as it will not change
depending on the clustering output.
H(C) = Entropy of Cluster Labels
• P(C=1) = 10/20 = 1/2
• P(C=2) = 10/20 = ½
1 1 1 1
• H(Y) =− log − log =1
2 2 2 2
This will be calculated every time the clustering
changes. You can see from the figure that the
clusters are balanced (have equal number of
instances).
I(Y;C)= Mutual Information
• Mutual information is given as:
– 𝐼 𝑌; 𝐶 = 𝐻 𝑌 − 𝐻 𝑌 𝐶
– We already know H(Y)
– H(Y|C) is the entropy of class labels within each
cluster, how do we calculate this??
Mutual Information tells us the reduction in the
entropy of class labels that we get if we know the
cluster labels. (Similar to Information gain in
deicison trees)
H(Y|C): conditional entropy of class labels for
clustering C
• Consider Cluster-1:
– P(Y=1|C=1)=3/10 (three triangles in cluster-1)
– P(Y=2|C=1)=3/10 (three rectangles in cluster-1)
– P(Y=3|C=1)=4/10 (four stars in cluster-1)
– Calculate conditional entropy as:
𝐻 𝑌 𝐶 = 1 = −𝑃(𝐶 = 1) ∑ 𝑃 𝑌 = 𝑦 𝐶 = 1 log(𝑃 𝑌 = 𝑦 𝐶 = 1 )
Y∈ {1,2,3}
1 3 3 3 3 4 4
=− 2 × [ log + log( )+ log( )] = 0.7855
10 10 10 10 10 10
H(Y|C): conditional entropy of class labels for
clustering C
• Now, consider Cluster-2:
– P(Y=1|C=2)=2/10 (two triangles in cluster-2)
– P(Y=2|C=2)=7/10 (seven rectangles in cluster-2)
– P(Y=3|C=2)=1/10 (one star in cluster-2)
– Calculate conditional entropy as:
𝐻 𝑌 𝐶 = 2 = −𝑃(𝐶 = 2) ∑ 𝑃 𝑌 = 𝑦 𝐶 = 2 log(𝑃 𝑌 = 𝑦 𝐶 = 2 )
Y∈ {1,2,3}
1 2 2 7 7 1 1
=− 2 × [ log + log( )+ log( )] = 0.5784
10 10 10 10 10 10
I(Y;C)
• Finally the mutual information is:
𝐼 𝑌; 𝐶 = 𝐻 𝑌 − 𝐻 𝑌 𝐶
= 1.5 − 0.7855 + 0.5784
= 0.1361
The NMI is therefore,
2 × 𝐼(𝑌; 𝐶)
𝑁𝑀𝐼 𝑌, 𝐶 =
𝐻 𝑌 +𝐻 𝐶
2 × 0.1361
𝑁𝑀𝐼 𝑌, 𝐶 = = 0.1089
1.5 + 1
Calculate NMI for the following
Homogeneity
• A clustering result satisfies homogeneity if all of its clusters contain
only data points which are members of a single class.
• This metric is independent of the absolute values of the labels: a
permutation of the class or cluster label values won’t change the
score value in any way.
• Syntax : sklearn.metrics.homogeneity_score(labels_true,
labels_pred)
• The Metric is not symmetric,
switching label_true with label_pred will return
the completeness_score.
Parameters :
•labels_true:<int array, shape = [n_samples]> : It accept the ground truth
class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts the cluster labels
to evaluate.
Returns:
homogeneity:<float>: Its return the score between 0.0 and 1.0 stands for
perfectly homogeneous labeling.
Completeness score
• This score is complementary to the previous one. Its purpose is to
provide a piece of information about the assignment of samples
belonging to the same class.
• More precisely, a good clustering algorithm should assign all samples
with the same true label to the same cluster.
Completeness portrays the closeness of the clustering algorithm to this (completeness_score)
perfection.
This metric is autonomous of the outright values of the labels. A permutation of the cluster
label values won’t change the score value in any way.
sklearn.metrics.completeness_score()
Syntax: sklearn.metrics.completeness_score(labels_true, labels_pred)
•labels_true:<int array, shape = [n_samples]>: It accepts the
ground truth class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts
the cluster labels to evaluate.
Returns: completeness score between 0.0 and 1.0. 1.0
stands for perfectly completeness labeling.
V-Measure
One of the primary disadvantages of any clustering technique is that it is difficult to evaluate its
performance. To tackle this problem, the metric of V-Measure was developed. The calculation of
the V-Measure first requires the calculation of two terms:-
1.Homogeneity: A perfectly homogeneous clustering is one where each cluster has data-points
belonging to the same class label. Homogeneity describes the closeness of the clustering
algorithm to this perfection.
2.Completeness: A perfectly complete clustering is one where all data-points belonging to the
same class are clustered into the same cluster. Completeness describes the closeness of the
clustering algorithm to this perfection.
sklearn.metrics.v_measure_score(labels_true, labels_pred, *, beta=1.0)
The V-measure is the harmonic mean between homogeneity and completeness:
v = (1 + beta) * homogeneity * completeness
/ (beta * homogeneity + completeness)
Fowlkes-Mallows Score
• The Fowlkes-Mallows Score is an evaluation metric to evaluate the
similarity among clustering's obtained after applying different
clustering algorithms.
• Although technically it is used to quantify the similarity between two
clustering's, it is typically used to evaluate the clustering performance
of a clustering algorithm by assuming the second clustering to be the
ground-truth i.e. the observed data and assuming it to be the perfect
clustering.
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision
and recall:
FMI = TP / sqrt((TP + FP) * (TP + FN))
• Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the
same clusters in both labels_true and labels_pred),
• FP is the number of False Positive (i.e. the number of pair of points that belongs in the same
clusters in labels_true and not in labels_pred)
• and FN is the number of False Negative (i.e the number of pair of points that belongs in the
same clusters in labels_pred and not in labels_True).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
Adjusted Mutual Information (AMI)
• Adjusted Mutual Information (AMI) is an adjustment of the Mutual
Information (MI) score to account for chance.
• It accounts for the fact that the MI is generally higher for two
clusterings with a larger number of clusters, regardless of whether
there is actually more information shared.
• For two clusterings U and V, the AMI is given as:
• AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [avg(H(U), H(V)) - E(MI(U, V))]
Thank you!