Clustering Vivek Saxena
Clustering Vivek Saxena
Clu
ste
rin
Dat lin
g
e
Tid
Tid Refund
Refund Marital
Marital Taxable
Taxable
od
Cheat
a
Cheat
g
Status
Status Income
Income
1
1 Yes
Yes Single
Single 125K
125K No
No
2
2 No
No Married
Married 100K
100K No
No
M
3
3 No
No Single
Single 70K
70K No
No
e
4
4 Yes
Yes Married
Married 120K
120K No
No
t iv
ic
5
5 No
No Divorced
Divorced 95K
95K Yes
Yes
d
6
6 No
No Married
Married 60K
60K No
No
e
PrA
7
7 Yes
Yes Divorced
Divorced 220K
220K No
No
8
8 No
No Single
Single 85K
85K Yes
Yes
9
9 No
No Married
Married 75K
75K No
No
n
De om
10
10 No
No Single
Single 90K
90K Yes
Yes
n
tio
11
11 No
No Married
Married 60K
60K No
No
ia te aly
12
12 Yes
Yes Divorced
Divorced 220K
220K No
No
soc
13 No Single 85K Yes
cti
13 No Single 85K Yes
As
14
14 No
No Married
Married 75K
75K No
No
10
10
15
15 No
No Single
Single 90K
90K Yes
Yes
on
le s
Ru
Milk
What is not Cluster Analysis?
• Simple segmentation
– Dividing students into different registration groups alphabetically, by last name
• Results of a query
– Groupings are a result of an external specification
– Clustering is a grouping of objects based on the data
• Supervised classification
– Have class label information
• Association Analysis
– Local vs. global connections
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications?
• Biology
– Creating mathmatical taxonomy (hierarchichal classification) of all living things like
kingdom, phylum, class, order, family, genus, species.
• Information retrieval
– Classification of web pages based on search result of search engine
• Climate
– Find pattern in atmospheric pressure of polar region and area of ocean that have
significant impact of land climate
20 points
clustering
Data set can be viewed as collection of data object. Data objects are
described as number of attributes that have basic characteristic of
object. It is important to identify the type of attributes for consistency
check of a ML/DM technique. On the basis of attributes, data types are
classified into two types:
9
❏ Quantitative data is statistical and is
typically structured - means defined and
more rigid.
❏ It is represented by numbers and have
most of the properties of numbers.
❏ It can be integer-values or continuous and
Quantitati more suitable for data analysis.
❏ Interval and ratio attributes are collectively
ve Data referred to as Quantitative data.
❏ Interval attributes are those values
whose difference between values are
meaningful. Examples are calendar dates,
exam grade etc.
❏ Ratio attributes are those for which both
difference and ratio are meaningful.
Examples are length of table, electrical 10
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
C a te g o ric a l
12
As the field of data mining is developing,
a great variety of data sets become
Characteristics available. First we proceed with the three
of Data Sets main characteristics that apply to many
data sets.
Record Data: The data Graph Based Data: It Ordered Data: Data in
set is collection of is more powerful and which relationship
records, each of which convenient to represent between attributes
have fixed set of data in the form of involve order in time or
attributes. It is stored graph. It take space. There are various
either in flat files or
relationship among data types of ordered data
relational database. The
objects and data objects such as sequential,
database is more
are represented as genetic sequence, time
convenient place to find
graphs. Example web series and spatial data.
records. Example market 14
pages, road network
Record Data
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
– One can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
2
5 1
2
5
Average Monthly
Temperature of
land and ocean
Data Quality
22
Measurement Quality
Assessment
Precision: The closeness of measurement of
same quality to one another.
Duplicate Data: Data set may contain duplicate data objects. This duplicate
data unnecessarily increases the number of data objects in data set. Deduplication
is the process that deal with these issues. To detect and eliminate such duplicates
from dataset we may follow:
1. If two data objects represent single object then the value of corresponding
attributes may differ, and then these inconsistent values must be solved.
2. Do not combine the data objects that are similar but not duplicates, such as25
Timeliness: This is the most important issue,
as the data is collected it get older as the time
passes and some time it is very difficult to work
with it. Example in the case of Web Browsing data.
General If the data is out of date, then so the models and
patterns are based on it.
Issues
Relevance: The available data must contain
Related to the information necessary for the application.
Application Make sure the data object is relevant. A common
problem is sampling bias. Domain Expert help is
essential.
1. Data preprocessing is applied to solve some issues of data set and provide
us more suitable data for analysis.
2. The full and final outcome of data preprocessing is to improve the pattern
analysis with respect to time, cost and quality.
3. There are number of strategies and techniques that are interrelated in
complex ways. We will discuss some of the main ideas and approaches by
which it is easy to point out the relationships among them.
27
Data
Preprocessi
ng
Techniques
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Measures of Similarity
● Similarity and dissimilarity are important because they are used by a number
of data mining techniques, such as clustering, nearest neighbor classification,
and anomaly detection.
● It transform the data to a similarity spaces and then performing analysis.
● The term proximity is used to refer either similarity or dissimilarity.
● This includes measure such as time series and Euclidean distance, which are
useful for dense data such as time series or two-dimensional points, as well as
the Jaccard and cosine similarity measures, which are useful for sparse data.
34
Similarit
y
● It is the numerical measure of the degree to which the two objects are alike.
● It is higher for pair of objects that are more alike.
● It is non-negative and are often between 0 (no similarity) and 1 (complete
similar)
Dissimilari
● It is numerical measure of the degree to which the two objects are
ty different.
● Lower when objects are more alike.
● Minimum dissimilarity is 0, while maximum dissimilarity value varies.
35
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects x and y, with respect to a single, simple
attribute.
Euclidean Distance
3
poi nt x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
• r = 2. Euclidean distance
• r
. “supremum” (Lmax norm, Lnorm) distance.
– This is the maximum difference between any component of the vectors
• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
Mahalanobis Distance
𝑇 −1
( )
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱 , 𝐲 =(𝐱 − 𝐲 ) Ʃ ( 𝐱 − 𝐲 )
is the covariance matrix
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity
Scatter plots
showing the
similarity from
–1 to 1.
Drawback of Correlation (non linear cases)
y i = x i2
• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74
Mode
Mode is the term appearing maximum time in dataset i.e. term that has highly
frequency. But there could be a dataset where there is no mode at all as all values
appears same number of times. If frequency of two terms are same then the
dataset is bimodal, if three terms have maximum same frequency then it is
53
trimodal and for n modes, that dataset is multimodal
Measures of Spread/Dispersion
Measures of Spread refers to the idea of variability within your data.
Standard Deviation
● Standard deviation is the measurement of average distance between
each quantity and mean.
● It shows how data is spread out from mean.
● A low standard deviation indicates that the data points tend to be close
to the mean of the data set, while a high standard deviation indicates
that the data points are spread out over a wider range of values.
54
Mean Deviation/Mean Absolute
Deviation
It is an average of absolute differences between each values in a set of
values, and the average of all the values of that set.
Variance
Variance is a square of average distance between each quantity and
mean. That it is square of standard deviation.
Range
Range is one of the simplest technique of descriptive statistics. It is the
difference between lowest and highest value. 55
Skewness
Skewness is a measure of asymmetry of the probability distribution of a real-valued
random variable about its mean. It value may be positive or negative, or undefined.
Tail of the curve depend on skewness of the distribution. When the distribution is left
skewed then tail on the curve’s left-hand is longer and mean is less than the mode. This
situation is also called negative skewness. When distribution is right skewed then it is
vice-versa.
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering
p1
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster
than to any point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to
the center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
8 contiguous clusters
Types of Clusters: Density-Based
• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and
when noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
Types of Clusters: Objective Function
• Hierarchical clustering
• Density-based clustering
K-means Clustering
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
K-means Clustering – Details
2.5
Original Points
2
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
• If there are K ‘real’ clusters then the chance of selecting one centroid from each
cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then
Iteration 4
1
2
3
8
0
y
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 4
1
2
3
8
0
y
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other
have only one.
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only
one.
strategies to Initial Centroids Problem
• Multiple runs and select cluster with min. SSE
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than k initial centroids and then select among these
initial centroids
– Select most widely separated
• Postprocessing
• Generate a larger number of clusters and then perform a
hierarchical clustering
• Bisecting K-means
– Not as susceptible to initialization issues
K-means++
6.5
X X 15 16
X18.5
9 10
6.5
X 9 10
X 15 16
X 18.5
Empty
Cluster
Handling Empty Clusters
• Basic K-means algorithm can yield empty clusters
• Several strategies
– Choose the point that is farthest from current centroid
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.
Updating Centers Incrementally
• In the basic K-means algorithm, centroids are updated
after all points are assigned to a centroid
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
Bisecting K-means
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains an individual point (or there are
k clusters)
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update the
proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C1 C3 C4
C5
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
C1
Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Distance
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
MIN p5
MAX .
Group Average . Proximity Matrix
.
Distance Between Centroids
Other methods driven by an objective function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN .
MAX .
Group Average . Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
–
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN .
MAX .
Group Average . Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
–
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN .
MAX .
Group Average . Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
–
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN .
MAX .
Group Average . Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
–
MIN or Single Link
• Proximity of two clusters is based on the two
closest points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph
• Example:
Distance Matrix:
Hierarchical Clustering: MIN
5
1
3
5
0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Two Clusters
Original Points
Distance Matrix:
Hierarchical Clustering: MAX
4 1
2 5
0.4
0.35
5
2
0.3
0.25
3 6
0.2
3
0.15
1 0.1
0.05
4 0
3 6 4 1 2 5
|Clusteri | |Clusterj |
• Need to use average connectivity for scalability since total proximity favors
large clusters
Distance Matrix:
Hierarchical Clustering: Group Average
5 4 1
0.25
2
5 0.2
2 0.15
3 6 0.1
1 0.05
4 0
3
3 6 4 1 2 5
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
• Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is distance squared
5 5
1 4 1
2 2
5 2 Ward’s Method 5 2
3 6 Group Average 3 6
3 1 1
4
4 4
3
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one point
(p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
Hierarchical Clustering: Time and Space requirements
– A border point is not a core point, but is in the neighborhood of a core point
– A noise point is any point that is not a core point or a border point
DBSCAN: Core(point A), Border, and Noise Points
MinPts = 7
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
DBSCAN: Core, Border and Noise Points
• Resistant to Noise
• Can handle clusters of different shapes and sizes
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data (MinPts=4, Eps=9.92)
Hard (Crisp) vs Soft (Fuzzy) Clustering
• Hard (Crisp) vs. Soft (Fuzzy) clustering
– For soft clustering allow point to belong to more than one cluster
– For K-means, generalize objective function
𝑘 𝑚 k
𝑆𝑆𝐸=∑ ∑ 𝑤 𝑑𝑖𝑠𝑡( 𝒙 , 𝒄 )
𝑖𝑗 𝑖 𝑗
2
j 1
wij 1
𝑗=1 𝑖=1
1 2 5
SSE ( x ) wx1 ( 2 1) 2 wx 2 (5 2) 2
w x1 9 wx 2
With restriction as
n is no. of class
Fuzzy C Mean
There will be no empty class and no class contain all points
m’ = 2 or more
Fuzzy C-means
p: fuzzifier (p > 1)
• Objective function 𝑘 𝑚 k
𝑆𝑆𝐸=∑ ∑ 𝑤 𝑑𝑖𝑠𝑡 ( 𝒙𝑖 , 𝒄 𝑗 ) 𝑝 2 w ij 1
𝑖𝑗 j 1
𝑗=1 𝑖=1
• : weight with which object belongs to cluster
• a power for the weight not a superscript and controls how “fuzzy” the clustering is
1 2 5
SSE(
x)
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means 0.7 0.7
Complete
0.6 0.6 Link
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
P o in t s
0.6
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
P o in ts
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
DBSCAN
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
P o in ts
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
K-means
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
P o in ts
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
Complete Link
Internal Measures: SSE
• Clusters in more complicated figures aren’t well separated
• Internal Index: Used to measure the goodness of a clustering structure without
respect to external information
– SSE
• SSE is good for comparing two clusterings or two clusters (average SSE).
• Can also be used to estimate the number of clusters
10
6 9
8
4
7
2 6
SSE
0 5
4
-2
3
-4 2
-6 1
5 10 15 0
2 5 10 15 20 25 30
K
Internal Measures: SSE
• SSE curve for a more complicated data set
1
2 6
3
4
1
50
0.9
45
0.8
40
0.7
35
0.6
30
C o u nt
0.5
y
25
0.4
20
0.3
15
0.2
10
0.1
5
0
0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE
Statistical Framework for Correlation
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
SSE WSS ( x mi ) 2
i xCi
– BSS is
Separation
C by
measured
i
(m m ) 2 cluster sum of squares
thebetween
i i
• Example: SSE
– BSS + WSS = constant
m
1
m1 2
3 4
m 2 5
cohesion separation
Internal Measures: Silhouette Coefficient
• Silhouette coefficient combines ideas of both cohesion and separation, but
for individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
Distances used
to calculate b
i
s = (b – a) / max(a,b)
Distances used
to calculate a
– Typically between 0 and 1.
– The closer to 1 the better.
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
https://scikit-learn.org/stable/modules/
clustering.html