0% found this document useful (0 votes)

84 views54 pages

UNIT 3-Clustering Metrics

Clustering metrics are quantitative measures used to evaluate clustering algorithms, categorized into internal and external evaluation metrics. Internal metrics assess cluster quality without external references, while external metrics compare clustering results against a ground truth. Key metrics include Silhouette Score, Davies-Bouldin Index, and Dunn Index, each providing insights into cluster compactness and separation.

Uploaded by

Anshaj Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views54 pages

UNIT 3-Clustering Metrics

Uploaded by

Anshaj Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Clustering metrics

Clustering metrics
• Clustering metrics are quantitative measures used to evaluate the quality of
clustering algorithms and the resulting clusters.
• Internal Evaluation Metrics: These metrics evaluate the quality of clusters without
any external reference. They measure the compactness of data points within the
same cluster and the separation between different clusters.
• Such as: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index (Variance Ratio
Criterion), Dunn Index
• External Evaluation Metrics: These metrics require a ground truth or a reference
set of labels to compare the clustering results against. They measure the
agreement between the generated clusters and the true classes in the reference
set.
• Such as: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows
Index
Silhouette Score
• Silhouette Coefficient or silhouette score is a metric used to calculate
the goodness of a clustering technique. Its value ranges from -1 to 1.
• 1: Means clusters are well apart from each other and clearly
distinguished (Best Case)
• 0: Means clusters are indifferent, or we can say that the distance
between clusters is not significant (Overlapping Clusters)
• -1: Means clusters are assigned in the wrong way (Worst Case)
It is computed as:
• The silhouette value is a measure of how similar an object is to its
own cluster (cohesion) compared to other clusters (separation).
• Silhouette Score = (b-a)/max(a,b)
where
• a= average intra-cluster distance i.e the average distance between
each point within a cluster
• b= average inter-cluster distance i.e the average distance between all
clusters
Example:
Datapoints Cluster Label
A1 C1
A2 C1
A3 C2
A4 C2

Datapoint A1 A2 A3 A4
A1 0 0.10 0.65 0.55
A2 0.10 0 0.70 0.60
A3 0.65 0.70 0 0.30
A4 0.55 0.60 0.30 0
For point A1:
a= 0.1/1=0.1
b= (0.65+0.55)/2=0.6
Silhouette Score = (b-a)/max(a,b)
(0.6-0.1)/ 0.6= 0.833

For point A2:

a= 0.1/1=0.1
b= (0.70+0.60)/2=0.65
Silhouette Score = (b-a)/max(a,b)
(0.65-0.1)/ 0.65= 0.846

For point A3:

a= 0.30/1=0.30
b= (0.65+0.70)/2=0.675
Silhouette Score = (b-a)/max(a,b)
(0.675-0.30)/ 0.675= 0.555

For point A4:

a= 0.30/1=0.30
b= (0.55+0.60)/2=0.575
Silhouette Score = (b-a)/max(a,b)
(0.575-0.30)/ 0.575= 0.478
Point A1 and A2 are lying cluster C1 , so for computing Silhouette Score
for cluster C1,
(0.833+0.846)/2= 1.679/2= 0.839

Point A3 and A4 are lying cluster C2 , so for computing Silhouette Score

for cluster C2,
(0.555+0.478)/2= 1.033/2= 0.5165

Silhouette Score/Coefficient for overall clustering problem is:

(0.839+0.5165)/2= 0.6775 or 0.678
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
data = np.array([[3, 7], [9, 8], [5, 4], [2, 6], [5,6]])
labels = np.array([0, 0, 1, 1, 1])
Silhouette_Score = silhouette_score(data, labels)
print("The average silhouette score is:", Silhouette_Score)
Davies-Bouldin Index

• The Davies-Bouldin index (DBI) is a metric for assessing the separation

and compactness of clusters.
• It is based on the idea that good clusters are those that have low
within-cluster variation and high between-cluster separation.
• The minimum score is zero, with lower values indicating better
clustering.
• Lower the DB index value, better is the clustering. It also has a
drawback. A good value reported by this method does not
imply the best information retrieval.
Where,
Example
• Data Point 1: (2, 3)
• Data Point 2: (3, 2)
• Data Point 3: (8, 8)
• Data Point 4: (9, 7)
• Data Point 5: (15, 14)
• Data Point 6: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:

• Cluster 1: {Data Point 1, Data Point 2} Cluster 2: {Data Point 3, Data

Point 4, Data Point 5, Data Point 6}

• The centroids of these clusters are approximately:

• Centroid of Cluster 1: (2.5, 2.5) Centroid of Cluster 2: (12, 10.5)

• Now, let's calculate the pairwise distances between the centroids of

each cluster:
• Now, let's calculate the pairwise distances between the centroids of each
cluster:
• Distance between Cluster 1 centroid and Cluster 2 centroid
• sqrt((12 - 2.5)^2 + (10.5 - 2.5)^2) ≈ 11.64
• Next, for each cluster, let's calculate the average distance of its data points
to its centroid:
• For Cluster 1:
• Average distance = [(2.5 - 2)^2 + (2.5 - 3)^2]^0.5 ≈ 0.71 (For all datapoints
in C1)
• For Cluster 2:
• Average distance = [(12 - 8)^2 + (10.5 - 8)^2 + (12 - 15)^2 + (10.5 -
14)^2]^0.5 ≈ 4.56 (For all datapoints in C2)
• Now, calculate the Davies-Bouldin Index for each cluster:
• For Cluster 1:
• DB1 = 0.71 / 11.64 ≈ 0.061
• For Cluster 2:
• DB2 = 4.56 / 11.64 ≈ 0.392
• Finally, calculate the overall Davies-Bouldin Index for this clustering
solution:
• DBI = (DB1 + DB2) / 2 ≈ (0.061 + 0.392) / 2 ≈ 0.227
• In this example, the Davies-Bouldin Index for this clustering solution is
approximately 0.227. Lower values of the Davies-Bouldin Index indicate
better-defined clusters, where data points within clusters are close to each
other and far from points in other clusters.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import
davies_bouldin_score
data1 = np.array([[6, 8], [9, 5], [5, 4], [2,
6], [5,6], [3,4]])
labels1 = np.array([0, 0, 1, 1, 1, 0])
davies_bouldin_index =
davies_bouldin_score(data1, labels1)
print("Davies-Bouldin Index:",
davies_bouldin_index)
Dunn index:
• A metric for evaluating clustering algorithms, is an internal evaluation scheme,
where the result is based on the clustered data itself.
• Like all other such indices, the aim of this Dunn index to identify sets of clusters
that are compact, with a small variance between members of the cluster, and well
separated, where the means of different clusters are sufficiently far apart, as
compared to the within cluster variance.
• Higher the Dunn index value, better is the clustering.
• The number of clusters that maximizes Dunn index is taken as the optimal number
of clusters k.
Where,
• A higher Dunn Index value indicates better clustering, as it implies
that the clusters are compact and well-separated. Let's go through a
numerical example to illustrate the concept.
• Consider a dataset with 8 data points in a 2D space:

• Data Point 1: (2, 3) Data Point 2: (3, 2) Data Point 3: (4, 3) Data Point
4: (8, 8) Data Point 5: (9, 7) Data Point 6: (10, 8) Data Point 7: (15, 14)
Data Point 8: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2, Data Point 3} Cluster 2: {Data
Point 4, Data Point 5, Data Point 6, Data Point 7, Data Point 8}
• Now, let's calculate the inter-cluster distances (minimum distance
between any two points from different clusters) and intra-cluster
distances (maximum distance between any two points within the
same cluster):
• Inter-cluster distance:
• Between Cluster 1 and Cluster 2:
• Min inter-cluster distance = distance(Data Point 1, Data Point 4) = sqrt((8 - 2)^2 + (8 - 3)^2) ≈
8.25
• Intra-cluster distances:
• For Cluster 1:
• Max intra-cluster distance = max(distance(Data Point 1, Data Point 2), distance(Data Point 1,
Data Point 3), distance(Data Point 2, Data Point 3)) = max(sqrt(2), sqrt(1), sqrt(1)) = sqrt(2) ≈
1.41
• For Cluster 2:
• Max intra-cluster distance = max(distance(Data Point 4, Data Point 5), distance(Data Point 4,
Data Point 6), distance(Data Point 4, Data Point 7), distance(Data Point 4, Data Point 8),
distance(Data Point 5, Data Point 6), distance(Data Point 5, Data Point 7), distance(Data Point
5, Data Point 8), distance(Data Point 6, Data Point 7), distance(Data Point 6, Data Point 8),
distance(Data Point 7, Data Point 8)) = max(sqrt(1), sqrt(2), sqrt(2), sqrt(5), sqrt(2), sqrt(2),
sqrt(5), sqrt(2), sqrt(5), sqrt(2)) = sqrt(5) ≈ 2.24
• Now, calculate the Dunn Index:
• Dunn Index = Min inter-cluster distance / Max intra-cluster distance ≈
8.25 / 2.24 ≈ 3.68
• In this example, the Dunn Index for this clustering solution is
approximately 3.68. A higher Dunn Index indicates that the clusters
are well-separated and compact within themselves, which is desirable
for a good clustering solution.
Example
• Let's work through a numerical example to calculate the Dunn Index for a simple set of clusters.
Suppose you have the following data points and their corresponding clusters:
Data Points:
• A(1, 2)
• B(2, 2)
• C(5, 8)
• D(6, 8)
• E(10, 12)
• F(11, 12)
Clusters:
• Cluster 1: {A, B}
• Cluster 2: {C, D}
• Cluster 3: {E, F}
• Distance between Cluster 1 and Cluster 2 (d(1, 2)):
• d(A, C) = sqrt((1-5)^2 + (2-8)^2) = sqrt(32) = 4√2
• d(A, D) = sqrt((1-6)^2 + (2-8)^2) = sqrt(41) ≈ 6.40
• d(B, C) = sqrt((2-5)^2 + (2-8)^2) = sqrt(29) ≈ 5.39
• d(B, D) = sqrt((2-6)^2 + (2-8)^2) = sqrt(36) = 6
• So, d(1, 2) = 4√2
• Distance between Cluster 1 and Cluster 3 (d(1, 3)):
• d(A, E) = sqrt((1-10)^2 + (2-12)^2) = sqrt(170) ≈ 13.04
• d(A, F) = sqrt((1-11)^2 + (2-12)^2) = sqrt(200) ≈ 14.14
• d(B, E) = sqrt((2-10)^2 + (2-12)^2) = sqrt(164) ≈ 12.81
• d(B, F) = sqrt((2-11)^2 + (2-12)^2) = sqrt(194) ≈ 13.93
• So, d(1, 3) = 12.81
• Distance between Cluster 2 and Cluster 3 (d(2, 3)):
• d(C, E) = sqrt((5-10)^2 + (8-12)^2) = sqrt(20) ≈ 4.47
• d(C, F) = sqrt((5-11)^2 + (8-12)^2) = sqrt(32) ≈ 5.66
• d(D, E) = sqrt((6-10)^2 + (8-12)^2) = sqrt(16) = 4
• d(D, F) = sqrt((6-11)^2 + (8-12)^2) = sqrt(29) ≈ 5.39
• So, d(2, 3) = 4.47
• Now, let's calculate the minimum inter-cluster distance and the
maximum intra-cluster distance:
• Minimum Inter-cluster Distance: min(d(1, 2), d(1, 3), d(2, 3)) =
min(4√2, 12.81, 4.47) = 4√2
• Maximum Intra-cluster Distance: max(max(d(A, B), d(C, D), d(E, F))) =
max(6.40, 6.40, 14.14) = 14.14
• Finally, we can calculate the Dunn Index:
• Dunn Index = min(d(1, 2), d(1, 3), d(2, 3)) / max(max(d(A, B), d(C, D),
d(E, F))) Dunn Index = (4√2) / 14.14 ≈ 0.2828
• So, the Dunn Index for these clusters is approximately 0.2828.
Data Point Coordinates Cluster
1 (2, 3) A
2 (2, 5) A
3 (3, 8) B
4 (6, 5) B
5 (8, 8) C
6 (9, 6) C
7 (10, 2) A
8 (12, 4) A
9 (15, 7) C
10 (17, 5) C
Drawbacks of Dunn index:
• As the number of clusters and dimensionality of the data
increase, the computational cost also increases.
Adjusted Rand Index (ARI)
• It measures the similarity between the true class labels and the
clusters generated by a clustering algorithm while accounting for
chance agreement.
• The ARI produces a score between -1 and 1, where higher values
indicate better agreement between the predicted clusters and the
true labels.
• Contigency table- m x n
• m= number of clusters by c1 algo.
• n= number of clusters by c2 algo (Ground Truth)
To use the Adjusted Rand Index for evaluating clustering, follow these steps:

Perform Clustering: Apply a clustering algorithm to your dataset to create clusters of data points. This
could be an algorithm like k-means, hierarchical clustering, DBSCAN, or any other clustering technique.

Obtain True Class Labels: If you have access to ground truth labels or class assignments for your data,
this is ideal. These true labels represent the actual groups or categories that your data points belong to.
Example
• Suppose you have two clustering results for a dataset with 100 data points:
• Create a contingency table that shows the number of data points in
common between the two clusterings. Rows represent clusters in
Clustering A, and columns represent clusters in Clustering B. The table
might look like this:
• | Cluster 1 | Cluster 2 |
• -----------------------------------
• Cluster 1 | 30 | 20 |
• -----------------------------------
• Cluster 2 | 40 | 10 |
• -----------------------------------
• Calculate the sum of squares for the rows and columns of the
contingency table. Let's denote these as SSR (Sum of Squares for
Rows) and SSC (Sum of Squares for Columns):
• SSR = (30^2 + 20^2) + (40^2 + 10^2) = 900 + 4000 = 4900
• SSC = (30^2 + 40^2) + (20^2 + 10^2) = 2500 + 500 = 3000
• Now Calculate ARI = [ (RI - Expected_RI) ] / [ max(RI) - Expected_RI ) ]
• The Rand Index (RI) is calculated as (SSR + SSC) / [2 * (N choose 2)], where
N is the total number of data points (100 in this case).
• RI = (4900 + 3000) / [2 * (100 choose 2)] = 7900 / [2 * 4950] ≈ 0.799
• Expected_RI = (SSR * SSC) / [2 * (N choose 2)^2] = (4900 * 3000) / [2 *
4950^2] ≈ 0.109

• ARI = [ (0.799 - 0.109) ] / [ 1 - 0.109 ] ≈ 0.691 / 0.891 ≈ 0.775

• So, the Adjusted Rand Index (ARI) for the given clustering A and B is
approximately 0.775. This indicates a moderate agreement between the
two clusterings, where a higher ARI value would indicate better agreement.
Normalized Mutual Information (NMI)
• Normalized Mutual Information (NMI) is a normalization of the
Mutual Information (MI) score to scale the results between 0 (no
mutual information) and 1 (perfect correlation).
NMI
• NMI is a good measure for determining the
quality of clustering.
• It is an external measure because we need the
class labels of the instances to determine the
NMI.
• Since it’s normalized we can measure and
compare the NMI between different
clusterings having different number of
clusters.
Normalized Mutual Information
• Normalized Mutual Information:
2 × 𝐼(𝑌; 𝐶)
𝑁𝑀𝐼 𝑌, 𝐶 =
𝐻 𝑌 +𝐻 𝐶
• where,
1) Y = class labels
2) C = cluster labels
3) H(.) = Entropy
4) I(Y;C) = Mutual Information b/w Y and C Note: All logs are base-2.
Calculating NMI for Clustering
• Assume m=3 classes and k=2 clusters

Cluster-1 (C=1) Cluster-2 (C=2)

Class-1 (Y=1) Class-2 (Y=2) Class-3 (Y=3)

H(Y) = Entropy of Class Labels
• P(Y=1) = 5/20 = ¼
• P(Y=2) = 10/20 = ½
• P(Y=3) = 5/20 =¼
1 1 1 1 1 1
• H(Y) = − log − log − log = 1.5
4 4 4 4 2 2

This is calculated for the entire dataset and can be

calculated prior to clustering, as it will not change
depending on the clustering output.
H(C) = Entropy of Cluster Labels
• P(C=1) = 10/20 = 1/2
• P(C=2) = 10/20 = ½
1 1 1 1
• H(Y) =− log − log =1
2 2 2 2

This will be calculated every time the clustering

changes. You can see from the figure that the
clusters are balanced (have equal number of
instances).
I(Y;C)= Mutual Information
• Mutual information is given as:
– 𝐼 𝑌; 𝐶 = 𝐻 𝑌 − 𝐻 𝑌 𝐶
– We already know H(Y)
– H(Y|C) is the entropy of class labels within each
cluster, how do we calculate this??

Mutual Information tells us the reduction in the

entropy of class labels that we get if we know the
cluster labels. (Similar to Information gain in
deicison trees)
H(Y|C): conditional entropy of class labels for
clustering C
• Consider Cluster-1:
– P(Y=1|C=1)=3/10 (three triangles in cluster-1)
– P(Y=2|C=1)=3/10 (three rectangles in cluster-1)
– P(Y=3|C=1)=4/10 (four stars in cluster-1)
– Calculate conditional entropy as:

𝐻 𝑌 𝐶 = 1 = −𝑃(𝐶 = 1) ∑ 𝑃 𝑌 = 𝑦 𝐶 = 1 log(𝑃 𝑌 = 𝑦 𝐶 = 1 )
Y∈ {1,2,3}
1 3 3 3 3 4 4
=− 2 × [ log + log( )+ log( )] = 0.7855
10 10 10 10 10 10
H(Y|C): conditional entropy of class labels for
clustering C
• Now, consider Cluster-2:
– P(Y=1|C=2)=2/10 (two triangles in cluster-2)
– P(Y=2|C=2)=7/10 (seven rectangles in cluster-2)
– P(Y=3|C=2)=1/10 (one star in cluster-2)
– Calculate conditional entropy as:

𝐻 𝑌 𝐶 = 2 = −𝑃(𝐶 = 2) ∑ 𝑃 𝑌 = 𝑦 𝐶 = 2 log(𝑃 𝑌 = 𝑦 𝐶 = 2 )
Y∈ {1,2,3}
1 2 2 7 7 1 1
=− 2 × [ log + log( )+ log( )] = 0.5784
10 10 10 10 10 10
I(Y;C)
• Finally the mutual information is:
𝐼 𝑌; 𝐶 = 𝐻 𝑌 − 𝐻 𝑌 𝐶
= 1.5 − 0.7855 + 0.5784
= 0.1361

The NMI is therefore,

2 × 𝐼(𝑌; 𝐶)
𝑁𝑀𝐼 𝑌, 𝐶 =
𝐻 𝑌 +𝐻 𝐶

2 × 0.1361
𝑁𝑀𝐼 𝑌, 𝐶 = = 0.1089
1.5 + 1
Calculate NMI for the following
Homogeneity
• A clustering result satisfies homogeneity if all of its clusters contain
only data points which are members of a single class.
• This metric is independent of the absolute values of the labels: a
permutation of the class or cluster label values won’t change the
score value in any way.
• Syntax : sklearn.metrics.homogeneity_score(labels_true,
labels_pred)
• The Metric is not symmetric,
switching label_true with label_pred will return
the completeness_score.
Parameters :
•labels_true:<int array, shape = [n_samples]> : It accept the ground truth
class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts the cluster labels
to evaluate.
Returns:
homogeneity:<float>: Its return the score between 0.0 and 1.0 stands for
perfectly homogeneous labeling.
Completeness score
• This score is complementary to the previous one. Its purpose is to
provide a piece of information about the assignment of samples
belonging to the same class.
• More precisely, a good clustering algorithm should assign all samples
with the same true label to the same cluster.
Completeness portrays the closeness of the clustering algorithm to this (completeness_score)
perfection.
This metric is autonomous of the outright values of the labels. A permutation of the cluster
label values won’t change the score value in any way.

sklearn.metrics.completeness_score()
Syntax: sklearn.metrics.completeness_score(labels_true, labels_pred)

•labels_true:<int array, shape = [n_samples]>: It accepts the

ground truth class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts
the cluster labels to evaluate.
Returns: completeness score between 0.0 and 1.0. 1.0
stands for perfectly completeness labeling.
V-Measure
One of the primary disadvantages of any clustering technique is that it is difficult to evaluate its
performance. To tackle this problem, the metric of V-Measure was developed. The calculation of
the V-Measure first requires the calculation of two terms:-

1.Homogeneity: A perfectly homogeneous clustering is one where each cluster has data-points
belonging to the same class label. Homogeneity describes the closeness of the clustering
algorithm to this perfection.
2.Completeness: A perfectly complete clustering is one where all data-points belonging to the
same class are clustered into the same cluster. Completeness describes the closeness of the
clustering algorithm to this perfection.

sklearn.metrics.v_measure_score(labels_true, labels_pred, *, beta=1.0)

The V-measure is the harmonic mean between homogeneity and completeness:

v = (1 + beta) * homogeneity * completeness

/ (beta * homogeneity + completeness)
Fowlkes-Mallows Score
• The Fowlkes-Mallows Score is an evaluation metric to evaluate the
similarity among clustering's obtained after applying different
clustering algorithms.
• Although technically it is used to quantify the similarity between two
clustering's, it is typically used to evaluate the clustering performance
of a clustering algorithm by assuming the second clustering to be the
ground-truth i.e. the observed data and assuming it to be the perfect
clustering.
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision
and recall:

FMI = TP / sqrt((TP + FP) * (TP + FN))

• Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the
same clusters in both labels_true and labels_pred),
• FP is the number of False Positive (i.e. the number of pair of points that belongs in the same
clusters in labels_true and not in labels_pred)
• and FN is the number of False Negative (i.e the number of pair of points that belongs in the
same clusters in labels_pred and not in labels_True).

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
Adjusted Mutual Information (AMI)
• Adjusted Mutual Information (AMI) is an adjustment of the Mutual
Information (MI) score to account for chance.
• It accounts for the fact that the MI is generally higher for two
clusterings with a larger number of clusters, regardless of whether
there is actually more information shared.
• For two clusterings U and V, the AMI is given as:

• AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [avg(H(U), H(V)) - E(MI(U, V))]

Thank you!

UNIT 3-Clustering Metrics
No ratings yet
UNIT 3-Clustering Metrics
59 pages
Unit 3
No ratings yet
Unit 3
130 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
3.k-Metoids and Hierarchical Updated
No ratings yet
3.k-Metoids and Hierarchical Updated
50 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Clustering & Association Mining Basics
No ratings yet
Clustering & Association Mining Basics
50 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
ML 3 PPT Unit 3
No ratings yet
ML 3 PPT Unit 3
51 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
38 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Lect 4
No ratings yet
Lect 4
34 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Model 3
No ratings yet
Model 3
31 pages
Clustering
No ratings yet
Clustering
55 pages
Lecture 9 Kmean-V3
No ratings yet
Lecture 9 Kmean-V3
52 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Chapter 4
No ratings yet
Chapter 4
30 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
56 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
Clustering - Introduction J Evaluation Metrics
No ratings yet
Clustering - Introduction J Evaluation Metrics
19 pages
PART2
No ratings yet
PART2
61 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
DBSCAN
No ratings yet
DBSCAN
27 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
27 pages
ML - 8
No ratings yet
ML - 8
70 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
8 pages
DBSCAN Clustering Explained: Key Concepts
No ratings yet
DBSCAN Clustering Explained: Key Concepts
5 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Clustering TNP
No ratings yet
Clustering TNP
53 pages
Using ChatGPT for Cluster Diagrams
No ratings yet
Using ChatGPT for Cluster Diagrams
4 pages
ML Unit-5
No ratings yet
ML Unit-5
21 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
DBSCAN
No ratings yet
DBSCAN
29 pages
K-Means Clustering for CS Students
No ratings yet
K-Means Clustering for CS Students
30 pages
08 K-Means
No ratings yet
08 K-Means
19 pages
DB Scan
No ratings yet
DB Scan
7 pages
Unit 4
No ratings yet
Unit 4
19 pages
DBSCAN - Introduction in Machine Learning.
No ratings yet
DBSCAN - Introduction in Machine Learning.
3 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
Performance Metrics
No ratings yet
Performance Metrics
16 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Unit-V Clustering Part 1
No ratings yet
Unit-V Clustering Part 1
26 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
UNIT-6 DBSCAN Clustering
No ratings yet
UNIT-6 DBSCAN Clustering
6 pages
Silhouette (Clustering) : Method
No ratings yet
Silhouette (Clustering) : Method
7 pages
K-Means Clustering Steps Explained
No ratings yet
K-Means Clustering Steps Explained
54 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Dbscan
No ratings yet
Dbscan
18 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Neural Network Clustering Guide
No ratings yet
Neural Network Clustering Guide
168 pages
Unit 6
No ratings yet
Unit 6
26 pages
Effective Group Discussions
No ratings yet
Effective Group Discussions
11 pages
Fuzzy Logic and Ant Colony Optimization For Resource Allocation in Cloud Computing
No ratings yet
Fuzzy Logic and Ant Colony Optimization For Resource Allocation in Cloud Computing
8 pages
JEE (Main) AcknowledgementPage
No ratings yet
JEE (Main) AcknowledgementPage
1 page
HALOALKANE
No ratings yet
HALOALKANE
16 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Debrah Artificial Intelligence Building
No ratings yet
Debrah Artificial Intelligence Building
64 pages
2009 - Clustering Techniques For Financial Diversification
No ratings yet
2009 - Clustering Techniques For Financial Diversification
6 pages
Artificial Intelligence As A General-Purpose Technology - Mapping Innovation Diffusion Via Patent Ana (1) 2
No ratings yet
Artificial Intelligence As A General-Purpose Technology - Mapping Innovation Diffusion Via Patent Ana (1) 2
51 pages
Machine Learning (BCS-055) QUS & ANS
No ratings yet
Machine Learning (BCS-055) QUS & ANS
29 pages
Data Prep for Maize Yield Model
No ratings yet
Data Prep for Maize Yield Model
7 pages
Unit 3-Fuzzy Clustering
No ratings yet
Unit 3-Fuzzy Clustering
34 pages
BigData ML
No ratings yet
BigData ML
10 pages
Exploring Potential of State-of-the-Art Speaker Diarization Frameworks For Multilingual Multi-Speaker Conversational Audio
No ratings yet
Exploring Potential of State-of-the-Art Speaker Diarization Frameworks For Multilingual Multi-Speaker Conversational Audio
6 pages
Key Trends in Indian Bank Analytics
100% (1)
Key Trends in Indian Bank Analytics
41 pages
Soccerment TheClusteringProject ENG 20220615 PDF
100% (1)
Soccerment TheClusteringProject ENG 20220615 PDF
158 pages
AI for Deforestation Detection
No ratings yet
AI for Deforestation Detection
3 pages
CSIT Student Project Report
No ratings yet
CSIT Student Project Report
45 pages
Birchwood MDS Brochure (2) - Min
No ratings yet
Birchwood MDS Brochure (2) - Min
18 pages
Chapter 4 Group Technology Algorithms
No ratings yet
Chapter 4 Group Technology Algorithms
30 pages
A Benchmark of Batch-Effect Correction Methods For Single-Cell RNA Sequencing Data
No ratings yet
A Benchmark of Batch-Effect Correction Methods For Single-Cell RNA Sequencing Data
32 pages
Research On Segmenting E-Commerce Customer Through An Improved K-Medoids Clustering Algorithm
No ratings yet
Research On Segmenting E-Commerce Customer Through An Improved K-Medoids Clustering Algorithm
10 pages
Medical Image Segmentation A Review of Recent Tech
No ratings yet
Medical Image Segmentation A Review of Recent Tech
12 pages
Is Zc415 (Data Mining BITS-WILP)
No ratings yet
Is Zc415 (Data Mining BITS-WILP)
4 pages
Data Warehousing & Mining Course Plan
No ratings yet
Data Warehousing & Mining Course Plan
3 pages
Data Analytics Complete Notes
No ratings yet
Data Analytics Complete Notes
33 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
K Metoids
No ratings yet
K Metoids
18 pages
Machine Learning Algorithms in Bipedal Robot Control
No ratings yet
Machine Learning Algorithms in Bipedal Robot Control
16 pages
Lecture 1
100% (1)
Lecture 1
43 pages
DNA and RNA Structure FAQs
No ratings yet
DNA and RNA Structure FAQs
11,493 pages
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
No ratings yet
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
52 pages
Ai Pre-Board - 1 Examination
No ratings yet
Ai Pre-Board - 1 Examination
4 pages
BDS 1101-BSD 3101-Cat 1
No ratings yet
BDS 1101-BSD 3101-Cat 1
2 pages
Documentation of Mini Project..
No ratings yet
Documentation of Mini Project..
55 pages

UNIT 3-Clustering Metrics

Uploaded by

UNIT 3-Clustering Metrics

Uploaded by

Clustering metrics

For point A2:

For point A3:

For point A4:

Point A3 and A4 are lying cluster C2 , so for computing Silhouette Score

Silhouette Score/Coefficient for overall clustering problem is:

• The Davies-Bouldin index (DBI) is a metric for assessing the separation

• Cluster 1: {Data Point 1, Data Point 2} Cluster 2: {Data Point 3, Data

• The centroids of these clusters are approximately:

• Centroid of Cluster 1: (2.5, 2.5) Centroid of Cluster 2: (12, 10.5)

• Now, let's calculate the pairwise distances between the centroids of

• ARI = [ (0.799 - 0.109) ] / [ 1 - 0.109 ] ≈ 0.691 / 0.891 ≈ 0.775

Cluster-1 (C=1) Cluster-2 (C=2)

Class-1 (Y=1) Class-2 (Y=2) Class-3 (Y=3)

This is calculated for the entire dataset and can be

This will be calculated every time the clustering

Mutual Information tells us the reduction in the

The NMI is therefore,

•labels_true:<int array, shape = [n_samples]>: It accepts the

sklearn.metrics.v_measure_score(labels_true, labels_pred, *, beta=1.0)

v = (1 + beta) * homogeneity * completeness

FMI = TP / sqrt((TP + FP) * (TP + FN))

• AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [avg(H(U), H(V)) - E(MI(U, V))]

You might also like