0% found this document useful (0 votes)
21 views169 pages

Clustering Vivek Saxena

The document provides an introduction to clustering within the context of AI and ML, covering various clustering methods such as K-Means, Hierarchical clustering, and DBSCAN. It discusses data types, quality checks, and the importance of clustering in various applications including biology, information retrieval, and business. Additionally, it highlights the significance of data preprocessing and the challenges related to data quality.

Uploaded by

Ravi Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views169 pages

Clustering Vivek Saxena

The document provides an introduction to clustering within the context of AI and ML, covering various clustering methods such as K-Means, Hierarchical clustering, and DBSCAN. It discusses data types, quality checks, and the importance of clustering in various applications including biology, information retrieval, and business. Additionally, it highlights the significance of data preprocessing and the challenges related to data quality.

Uploaded by

Ravi Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Introduction to Clustering

Vivek Saxena, Sc-F, DTRL, DRDO

Module name : Probability Theory and Pattern


Recognition

Topic name : Introduction to Clustering

Online Training & Certification Course on AI


& ML
Defence Institute of Advanced Technology
Outline of Presentation
• Data Types, Quality Check, Pre-processing
• Measures of similarity
• K-Means
• Hierarchical clustering
• DBSCAN
• Fuzzy C Mean
• Clustering Evaluation
<<Module name, Topic name>> 2
ML Tasks

Clu
ste
rin
Dat lin
g
e
Tid
Tid Refund
Refund Marital
Marital Taxable
Taxable

od
Cheat

a
Cheat

g
Status
Status Income
Income

1
1 Yes
Yes Single
Single 125K
125K No
No
2
2 No
No Married
Married 100K
100K No
No
M
3
3 No
No Single
Single 70K
70K No
No
e
4
4 Yes
Yes Married
Married 120K
120K No
No

t iv
ic
5
5 No
No Divorced
Divorced 95K
95K Yes
Yes

d
6
6 No
No Married
Married 60K
60K No
No

e
PrA
7
7 Yes
Yes Divorced
Divorced 220K
220K No
No
8
8 No
No Single
Single 85K
85K Yes
Yes
9
9 No
No Married
Married 75K
75K No
No

n
De om
10
10 No
No Single
Single 90K
90K Yes
Yes

n
tio
11
11 No
No Married
Married 60K
60K No
No

ia te aly
12
12 Yes
Yes Divorced
Divorced 220K
220K No
No

soc
13 No Single 85K Yes

cti
13 No Single 85K Yes

As
14
14 No
No Married
Married 75K
75K No
No

10

10
15
15 No
No Single
Single 90K
90K Yes
Yes
on

le s
Ru
Milk
What is not Cluster Analysis?

• Simple segmentation
– Dividing students into different registration groups alphabetically, by last name

• Results of a query
– Groupings are a result of an external specification
– Clustering is a grouping of objects based on the data

• Supervised classification
– Have class label information

• Association Analysis
– Local vs. global connections
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications?

• Biology
– Creating mathmatical taxonomy (hierarchichal classification) of all living things like
kingdom, phylum, class, order, family, genus, species.

• Information retrieval
– Classification of web pages based on search result of search engine

• Climate
– Find pattern in atmospheric pressure of polar region and area of ocean that have
significant impact of land climate

• Psychology & medicine


– Find different types of depression, detection of pattern in spatial/temporal
distribution of a disease
• Bussiness
– Current and potential customer segmentation
• Geo-Science
Notion of a Cluster can be Ambiguous

20 points
clustering

How many clusters? Six Clusters

Two Clusters Four Clusters


Data Types

Data set can be viewed as collection of data object. Data objects are
described as number of attributes that have basic characteristic of
object. It is important to identify the type of attributes for consistency
check of a ML/DM technique. On the basis of attributes, data types are
classified into two types:

1.Qualitative or Categorical Data


2.Quantitative or Numeric Data
8
❏ Qualitative data is defined as the data that
approximates and characterizes.
❏ Nominal and ordinal attributes are collectively
known as Qualitative attributes.
❏ Nominal values provide enough information
to distinguish one object from another.
Qualitativ Examples are zip code, eye color, gender.
❏ Ordinal attributes provides information to
e Data order the objects.
❏ It is descriptive and collective in nature. It can
take binary as well as discrete values.
❏ There are certain transformation which can be
applied on these data type are one-to-one
mapping and order-preserving change of
values.

9
❏ Quantitative data is statistical and is
typically structured - means defined and
more rigid.
❏ It is represented by numbers and have
most of the properties of numbers.
❏ It can be integer-values or continuous and
Quantitati more suitable for data analysis.
❏ Interval and ratio attributes are collectively
ve Data referred to as Quantitative data.
❏ Interval attributes are those values
whose difference between values are
meaningful. Examples are calendar dates,
exam grade etc.
❏ Ratio attributes are those for which both
difference and ratio are meaningful.
Examples are length of table, electrical 10
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
C a te g o ric a l

values only ID numbers, eye contingency


Q u a lita tiv e

distinguish. (=, ) color, sex: {male, correlation, 2


female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
Q u a n tita tiv e

attributes, temperature in deviation,


N u m e ric

differences between Celsius or Fahrenheit Pearson's


values are correlation, t and
meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens


Example:

Data set of cars


having
Quantitative as
well as Qualitative
data.

12
As the field of data mining is developing,
a great variety of data sets become
Characteristics available. First we proceed with the three
of Data Sets main characteristics that apply to many
data sets.

❏ Dimensionality: It is the number of


attributes that the objects in the
data sets possess.
❏ Sparsity: It deals with storage of
only non-zero values from
asymmetric features data set. It is
significant with respect to
computation time and storage.
❏ Resolution: Data can be obtained
at different resolution levels with
different properties at each level. 13
Types of Data Sets

Record Data: The data Graph Based Data: It Ordered Data: Data in
set is collection of is more powerful and which relationship
records, each of which convenient to represent between attributes
have fixed set of data in the form of involve order in time or
attributes. It is stored graph. It take space. There are various
either in flat files or
relationship among data types of ordered data
relational database. The
objects and data objects such as sequential,
database is more
are represented as genetic sequence, time
convenient place to find
graphs. Example web series and spatial data.
records. Example market 14
pages, road network
Record Data

• Data that consists of a collection of records, each of which


consists of aTidfixed set of attributes
Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
10
Data Matrix

• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute

• Such a data set can be represented by an m by n matrix, where there


are m rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load T hickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the corresponding term
occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
– One can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

• Examples: Generic graph, a molecule, and


webpages

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Data Quality

1. Data Quality is comparison between actual state of a particular set of data to


a desired state, with the desired data being typically referred to as “fit for
use”.
2. It describes the usefulness, accuracy, validity, latency, and correctness of
data for its application.
3. Preventing data quality problem is not an option, focuses should be on:
○ The detection and correction of data quality problems.
○ The use of algorithm that can tolerate poor data quality.
○ Some of the specific aspects of data quality are measurement and data collection issues as well
as some application related issues.

22
Measurement Quality
Assessment
Precision: The closeness of measurement of
same quality to one another.

Bias: A systematic variation of measurement from the


quantity being measured. It is the difference between the
mean and the measured quantity.

Accuracy: The closeness of measurement to


the true value of the quantity being measured. It depends
on precision and bias.
Outliers: The data objects whose characteristics are
different from most of the other data objects in the data 23
Missing Value: The information which is not present in one or more
attributes in data set. They are often encoded as NaN or blank in a data
set. In this case probability cannot be calculated and analysis may be
difficult or impossible. This can be handled by many ways such as

1.Simple and effective way is to delete data object or attribute.


2.Estimate the missing values by using the remaining values. If the
attribute is continuous that take average value of neighbours otherwise
take most commonly occurring attribute value.
3.Ignore the missing values during the analysis. There are many
classification schemes can be modified to work with missing values.
24
Inconsistent Values: Data can contain inconsistent values. It is important to
detect inconsistent and if possible correct it. The correction of an inconsistency
requires additional or redundant information. For example a person height in
negative is inconsistent.

Duplicate Data: Data set may contain duplicate data objects. This duplicate
data unnecessarily increases the number of data objects in data set. Deduplication
is the process that deal with these issues. To detect and eliminate such duplicates
from dataset we may follow:

1. If two data objects represent single object then the value of corresponding
attributes may differ, and then these inconsistent values must be solved.
2. Do not combine the data objects that are similar but not duplicates, such as25
Timeliness: This is the most important issue,
as the data is collected it get older as the time
passes and some time it is very difficult to work
with it. Example in the case of Web Browsing data.
General If the data is out of date, then so the models and
patterns are based on it.
Issues
Relevance: The available data must contain
Related to the information necessary for the application.
Application Make sure the data object is relevant. A common
problem is sampling bias. Domain Expert help is
essential.

Knowledge about the Data:


Documentation of the dataset must be known with
all its different aspects. The quality of this
documentation can either help or retard the 26
Data Preprocessing

1. Data preprocessing is applied to solve some issues of data set and provide
us more suitable data for analysis.
2. The full and final outcome of data preprocessing is to improve the pattern
analysis with respect to time, cost and quality.
3. There are number of strategies and techniques that are interrelated in
complex ways. We will discuss some of the main ideas and approaches by
which it is easy to point out the relationships among them.

27
Data
Preprocessi
ng
Techniques

1. Data Cleaning:The data can have many


irrelevant and missing parts. To handle this
part, data cleaning is done.
28
Data Cleaning involves handling of missing data, noisy data etc.

❏ Missing Data:-This situation arises when some data is missing in the


data. It can be handled in various ways such as Ignore the tuples and Fill
the Missing values.
❏ Noisy data :-Noisy data is a meaningless data that can’t be interpreted by
machines.It can be generated due to faulty data collection, data entry
errors etc. It can be handled by Binning Method, Regression and Clustering.

2. Data Transformation : This step is taken in order to transform the data in


appropriate forms suitable for mining process. This involves

❏ Normalization: It is done in order to scale the data values in a specified


range (-1.0 to 1.0 or 0.0 to 1.0). 29
❏ Attribute Selection: In this strategy, new attributes are constructed from
the given set of attributes to help the mining process.
❏ Discretization: This is done to replace the raw values of numeric attribute
by interval levels or conceptual levels.
❏ Concept Hierarchy Generation: Here attributes are converted from low
level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country.

3. Data Reduction : Since data mining is a technique that is used to


handle huge amount of data. While working with huge volume of data,
analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique.
30
Curse of Dimensionality

• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

• Definitions of density and


distance between points,
which are critical for
clustering and outlier
• Randomly generate 500 points
detection, become less
meaningful • Compute difference between max and
min distance between any pair of points
Dimensionality Reduction:
Dimensionality reduction is the process of reducing the number of variables
under consideration by obtaining a set of principal variables. Approaches can be
divided into feature selection and feature extraction.

❏ Feature Selection: Feature Selection approaches try to find a subset of


the input variables. The three strategies are: the filter strategy (e.g.
information gain), the wrapper strategy (e.g. search guided by accuracy),
and the embedded strategy (selected features add or are removed while
building the model based on prediction errors).
Data analysis such as regression or classification can be done in the reduced
32
space more accurately than in the original space.
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Measures of Similarity

● Similarity and dissimilarity are important because they are used by a number
of data mining techniques, such as clustering, nearest neighbor classification,
and anomaly detection.
● It transform the data to a similarity spaces and then performing analysis.
● The term proximity is used to refer either similarity or dissimilarity.
● This includes measure such as time series and Euclidean distance, which are
useful for dense data such as time series or two-dimensional points, as well as
the Jaccard and cosine similarity measures, which are useful for sparse data.

34
Similarit
y
● It is the numerical measure of the degree to which the two objects are alike.
● It is higher for pair of objects that are more alike.
● It is non-negative and are often between 0 (no similarity) and 1 (complete
similar)
Dissimilari
● It is numerical measure of the degree to which the two objects are
ty different.
● Lower when objects are more alike.
● Minimum dissimilarity is 0, while maximum dissimilarity value varies.

35
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects x and y, with respect to a single, simple
attribute.
Euclidean Distance

3
poi nt x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean


Distance

Where r is a parameter, n is the number of


dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or
data objects x and y.
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L 1 norm) distance.


– A common example of this for binary vectors is the Hamming distance,
which is just the number of bits that are different between two binary
vectors

• r = 2. Euclidean distance

• r 
. “supremum” (Lmax norm, Lnorm) distance.
– This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
Mahalanobis Distance
𝑇 −1
( )
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 𝐱 , 𝐲 =(𝐱 − 𝐲 ) Ʃ ( 𝐱 − 𝐲 )
is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Mahalanobis Distance
Covariance
Matrix:
 0.3 0.2
C   
 0. 2 0 . 3
B A: (0.5, 0.5)
A B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Similarity Between Binary Vectors

• Common situation is that objects, x and y, have only binary attributes

• Compute similarities using the following quantities


f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of 11 matches / number of non-zero attributes


= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)
f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity

• If d1 and d2 are two document vectors, then


cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product
of vectors, d1 and d2, and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Extended Jaccard Coefficient (Tanimoto)

• Variation of Jaccard for continuous or


count attributes
– Reduces to Jaccard for binary attributes
Correlation measures the linear relationship between objects
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
Drawback of Correlation (non linear cases)

• x = (-3, -2, -1, 0, 1, 2, 3)


• y = (9, 4, 1, 0, 1, 4, 9)

y i = x i2

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0

VIF for multi collinearity test


Entropy
• For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by

• Entropy is between 0 and log2n and is measured in bits


– Thus, entropy is a measure of how many bits it takes to represent an observation
of X on average
Entropy Examples
• For a coin with probability p of heads and
probability q = 1 – p of tails

– For p= 0.5, q = 0.5 (fair coin) H = 1


– For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die?


Summary Statistics
Statistics is a branch of mathematics that deals with collecting,
interpreting, organization and interpretation of data.

Descriptive (Summary) Statistics describe or characterize data in such a


way that none of the original information is lost or distorted.

It can be useful for some of the main purposes:

1. To provide basic information about variables in a dataset


2. To provide statistical relationships between variables.
3. To indicate the error associated with results and graphical output. 51
Types of Summary Statistics
Descriptive statistics is divided into two categories. Measures of central
tendency and measures of variability(spread).

Measures of Central Tendency


Central tendency refers to the idea that is one number that best
summarizes the entire set of measurements, a number that is is some way
“central” to the set. There are three main central tendency calculation
terms they are:
1.Mean (Average)
2.Median
52
3.Mode
Mean (Average)
Mean or Average is a central tendency of the data i.e. a number around which a
whole data is spread out. In a way, it is a single number which can estimate the
value of whole dataset.
Median
Median is the value which divides the point in two equal parts i.e. number of its
right side of it is same as number of terms on left side of it when data is arranged
in either ascending or descending order .

Mode
Mode is the term appearing maximum time in dataset i.e. term that has highly
frequency. But there could be a dataset where there is no mode at all as all values
appears same number of times. If frequency of two terms are same then the
dataset is bimodal, if three terms have maximum same frequency then it is
53
trimodal and for n modes, that dataset is multimodal
Measures of Spread/Dispersion
Measures of Spread refers to the idea of variability within your data.

Standard Deviation
● Standard deviation is the measurement of average distance between
each quantity and mean.
● It shows how data is spread out from mean.
● A low standard deviation indicates that the data points tend to be close
to the mean of the data set, while a high standard deviation indicates
that the data points are spread out over a wider range of values.

54
Mean Deviation/Mean Absolute
Deviation
It is an average of absolute differences between each values in a set of
values, and the average of all the values of that set.

Variance
Variance is a square of average distance between each quantity and
mean. That it is square of standard deviation.

Range
Range is one of the simplest technique of descriptive statistics. It is the
difference between lowest and highest value. 55
Skewness
Skewness is a measure of asymmetry of the probability distribution of a real-valued
random variable about its mean. It value may be positive or negative, or undefined.

Tail of the curve depend on skewness of the distribution. When the distribution is left
skewed then tail on the curve’s left-hand is longer and mean is less than the mode. This
situation is also called negative skewness. When distribution is right skewed then it is
vice-versa.

There are two method to calculate skewness coefficient :

1.Pearson First coefficient of skewness (Mode Skewness) = (Mean - Mode)/(S.D)


2.Pearson Second Coefficient of skewness (Median Skewness) = 3(Mean - Median)/(S.D.)

where S.D. = Standard Deviation


56
Kurtosis
Kurtosis is a measure of whether the data are heavy
tailed (profusion of outliers) or light tailed (lack of
outliers) relative to normal
distribution.There are
three types of kurtosis:

1.Mesokurtic: Which has similar kurtosis as normal


distribution kurtosis, which is zero.
2.Leptokurtic:The curve is more peaked than mesokurtic ,
it is referred to as leptokurtic.
3.Platykurtic: The curve of a distribution is less peaked than a mesokurtic curve
is known as platykurtic. Tail of such distribution is thinner.
57
Note: Main difference between skewness and kurtosis is first refer to
Types of Clustering

• A clustering is a set of clusters


• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
– A division of data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram


Other Distinctions Between Sets of Clusters

• Exclusive versus non-exclusive


– In non-exclusive clusterings, points may belong to multiple clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete dataset clustering
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Clusters of widely different sizes, shapes, and densities
Types of Clusters
• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function


Types of Clusters: Well-Separated

• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster
than to any point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based

• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to
the center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most
“representative” point of a cluster

4 center-based clusters
Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or


Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.

8 contiguous clusters
Types of Clusters: Density-Based

• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and
when noise and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual


Clusters
– Finds clusters that share some common property or
represent a particular concept.
.

2 Overlapping Circles
Types of Clusters: Objective Function

• Clusters Defined by an Objective Function


– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters
and evaluate the `goodness' of each potential set of clusters by
using the given objective function. (NP Hard)
– Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the
data to a parameterized model.
• Parameters for the model are determined from the data.
Approaches for clustering
Map Clustering Problem to a Different Problem

• Map the clustering problem to a different domain and solve a


related problem in that domain
– Proximity matrix defines a weighted graph, where the nodes are the
points being clustered, and the weighted edges represent the
proximities between points

– Clustering is equivalent to breaking the graph into connected


components, one for each cluster.

– Want to minimize the edge weight between clusters and maximize


the edge weight within clusters
Characteristics of the Input Data Are Important
• Type of proximity or density measure
– Central to clustering
– Depends on data and application

• Data characteristics that affect proximity and/or density are


– Dimensionality
• Sparseness
– Attribute type
– Special relationships in the data
• For example, autocorrelation
– Distribution of the data

• Noise and Outliers


– Often interfere with the operation of the clustering algorithm
Clustering Algorithms
• K-means and its variants

• Hierarchical clustering

• Density-based clustering
K-means Clustering

• Partitional clustering approach


• Number of clusters, K, must be specified apriory
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
• The basic algorithm is very simple
Example of K-means Clustering
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
K-means Clustering – Details

• Initial centroids are often chosen randomly.


– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters by objective function

• Most common measure is Sum of Squared Error (SSE)


– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these
K errors and sum them.
SSE   dist 2 ( mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for cluster Ci or


centre/mean of the cluster
• can show that mi corresponds to the center (mean) of the cluster
– Given two sets of clusters, we prefer the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Two different K-means Clusterings
3

2.5

Original Points
2

1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Limitations of K-means
• K-means has problems when clusters are of
differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data contains


outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)


Overcoming K-means Limitations

Original Points K-means Clusters


One solution is to use many clusters and later on merge near by cluster
Find parts of clusters, but need to put together.
Overcoming K-means Limitations

Original Points K-means Clusters


Overcoming K-means Limitations

Original Points K-means Clusters


Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting one centroid from each
cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/10 10 = 0.00036


– Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes
they don’t
– Consider an example of five pairs of clusters
10 Clusters Example

Iteration 4
1
2
3
8

0
y

-2

-4

-6

0 5 10 15 20
x

Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example

Iteration 4
1
2
3
8

0
y

-2

-4

-6

0 5 10 15 20
x

Starting with some pairs of clusters having three initial centroids, while other
have only one.
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while other have only
one.
strategies to Initial Centroids Problem
• Multiple runs and select cluster with min. SSE
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than k initial centroids and then select among these
initial centroids
– Select most widely separated
• Postprocessing
• Generate a larger number of clusters and then perform a
hierarchical clustering
• Bisecting K-means
– Not as susceptible to initialization issues
K-means++

• This approach can be slower than random initialization, but


very consistently produces better results in terms of SSE
– The k-means++ algorithm guarantees an approximation ratio
O(log k) in expectation, where k is the number of centers
• To select a set of initial centroids, C, perform the following
1. Select an initial point at random to be the first centroid
2. For k – 1 steps
3. For each of the N points, xi, 1 ≤ i ≤ N, find the minimum squared
distance to the currently selected centroids, C1, …, Cj, 1 ≤ j < k,
i.e.,
4. Randomly select a new centroid by choosing a point with
probability
proportional to is
5. End For
K Means additional issues :
Empty Clusters

• K-means can yield empty clusters


6.8 13 18

6.5
X X 15 16
X18.5
9 10

7.75 12.5 17.25

6.5
X 9 10
X 15 16
X 18.5

Empty
Cluster
Handling Empty Clusters
• Basic K-means algorithm can yield empty clusters

• Several strategies
– Choose the point that is farthest from current centroid
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.
Updating Centers Incrementally
• In the basic K-means algorithm, centroids are updated
after all points are assigned to a centroid

• An alternative is to update the centroids after each


assignment (incremental approach)
– Each assignment updates zero or two centroids
– More expensive
– Introduces an order dependency
– Never get an empty cluster
– Can use “weights” to change the impact
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers

• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
Bisecting K-means

• Bisecting K-means algorithm


– Variant of K-means that can produce a partitional or a hierarchical clustering
Bisecting K-means Example
Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits

6 5
0.2
4
3 4
0.15 2
5
2
0.1

1
0.05
3 1

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendrogram at the proper level

• They may correspond to meaningful


taxonomies
– Example in biological sciences (e.g., animal
kingdom)
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains an individual point (or there are
k clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix


– Merge or split one cluster at a time
Agglomerative Clustering Algorithm

• Most popular hierarchical clustering technique

• Basic algorithm is straightforward


1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two
clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms
Starting Situation
• Start with clusters of individual points and a proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
. Proximity Matrix
.

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update the
proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C1 C3 C4
C5
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

C1
Proximity Matrix

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Distance

p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
 MIN p5
 MAX .
 Group Average . Proximity Matrix
.
 Distance Between Centroids
 Other methods driven by an objective function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN .
 MAX .
 Group Average . Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function

How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN .
 MAX .
 Group Average . Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function

How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN .
 MAX .
 Group Average . Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function

How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
  p2
p3
p4
p5
 MIN .
 MAX .
 Group Average . Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function

MIN or Single Link
• Proximity of two clusters is based on the two
closest points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph
• Example:
Distance Matrix:
Hierarchical Clustering: MIN

5
1
3
5
0.2

2 1 0.15

2 3 6 0.1

0.05

4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


Strength of MIN

Original Points Six Clusters

• Can handle non-elliptical shapes


Limitations of MIN

Two Clusters

Original Points

• Sensitive to noise and outliers


Three Clusters
MAX or Complete Linkage
• Proximity of two clusters is based on the two most distant points
in the different clusters
– Determined by all pairs of points in the two clusters

Distance Matrix:
Hierarchical Clustering: MAX

4 1
2 5
0.4

0.35

5
2
0.3

0.25

3 6
0.2

3
0.15

1 0.1

0.05
4 0
3 6 4 1 2 5

Nested Clusters Dendrogram


Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers


Limitations of MAX

Original Points Two Clusters

• Tends to break large clusters


• Biased towards or favour globular clusters
Group Average
• Proximity of two clusters is the average of pairwise proximity between
points in the two clusters.  proximity( pi ,pj )i j
piiClusterii
pjj Cluster
proximity(
Cluster
i , Cluster
j) 
jj

|Clusteri | |Clusterj |

• Need to use average connectivity for scalability since total proximity favors
large clusters
Distance Matrix:
Hierarchical Clustering: Group Average

5 4 1
0.25
2
5 0.2

2 0.15

3 6 0.1

1 0.05

4 0

3
3 6 4 1 2 5

Nested Clusters Dendrogram


Hierarchical Clustering: Group Average
• Compromise between Single and Complete Link

• Strengths
– Less susceptible to noise and outliers

• Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
• Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is distance squared

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means


– Can be used to initialize K-means
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
2 MIN MAX
3 6 3
3 6
1
4 4
4

5 5
1 4 1
2 2
5 2 Ward’s Method 5 2
3 6 Group Average 3 6
3 1 1
4
4 4
3
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one point
(p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
Hierarchical Clustering: Time and Space requirements

• O(N2) space since it uses the proximity matrix.


– N is the number of points.

• O(N3) time in many cases


– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time
with some cleverness
Hierarchical Clustering: Problems and Limitations

• Once a decision is made to combine two clusters, it cannot be


undone

• No global objective function is directly minimized

• Different schemes have problems with one or more of the


following:
– Sensitivity to noise and outliers
– Difficulty handling clusters of different sizes and non-globular shapes
– Breaking large clusters
DBSCAN

• DBSCAN is a density-based algorithm.


– Density = number of points within a specified radius (Eps)

– A point is a core point if it has at least a specified number of points


(MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself

– A border point is not a core point, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point
DBSCAN: Core(point A), Border, and Noise Points

MinPts = 7
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
DBSCAN: Determining EPS and MinPts

• Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data (MinPts=4, Eps=9.92)
Hard (Crisp) vs Soft (Fuzzy) Clustering
• Hard (Crisp) vs. Soft (Fuzzy) clustering
– For soft clustering allow point to belong to more than one cluster
– For K-means, generalize objective function
𝑘 𝑚 k

𝑆𝑆𝐸=∑ ∑ 𝑤 𝑑𝑖𝑠𝑡( 𝒙 , 𝒄 )
𝑖𝑗 𝑖 𝑗
2
j 1

wij 1
𝑗=1 𝑖=1

: weight with which object xi belongs to cluster

– To minimize SSE, repeat the following steps:


• Fix and determine w(cluster assignment)
• Fixw and recompute
– Hard clustering:w{0,1}
Soft (Fuzzy) Clustering: Estimating Weights
c1 x c2

1 2 5

SSE ( x )  wx1 ( 2  1) 2  wx 2 (5  2) 2
 w x1  9 wx 2

SSE(x) is minimized when wx1 = 1, wx2 = 0


Fuzzy C Mean

Membership value of kth data point in ith class is defined as

With restriction as

n is no. of class
Fuzzy C Mean
There will be no empty class and no class contain all points

After initializing partition matrix; Cluster center calculated as


Fuzzy C Mean
Define a objective function for fuzzy C partition as
Fuzzy C Mean
Update the partition matrix

m’ = 2 or more
Fuzzy C-means
p: fuzzifier (p > 1)

• Objective function 𝑘 𝑚 k

𝑆𝑆𝐸=∑ ∑ 𝑤 𝑑𝑖𝑠𝑡 ( 𝒙𝑖 , 𝒄 𝑗 ) 𝑝 2 w ij 1
𝑖𝑗 j 1
𝑗=1 𝑖=1
• : weight with which object belongs to cluster
• a power for the weight not a superscript and controls how “fuzzy” the clustering is

– To minimize objective function, repeat the following:


• Fix and determine w
• Fixwand recompute

– Fuzzy c-means clustering:w


[0,1]
Bezdek, James C. Pattern recognition with fuzzy objective function algorithms . Kluwer Academic Publishers, 1981.
Fuzzy C-means
c1 x c2

1 2 5

SSE ( x )  wx21 ( 2  1) 2  wx22 (5  2) 2


 wx21  9 wx22

SSE(
x)

SSE(x) is minimized when wx1 = 0.9, wx2 = 0.1


Cluster Validity
• For supervised classification we have a variety of measures
to evaluate how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to


evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?


– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Clusters found in Random Data
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6


DBSCAN
0.5 0.5
Points
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

0.8 0.8
K-means 0.7 0.7
Complete
0.6 0.6 Link
0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the


entire clustering or just individual clusters.
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of cluster validity,
are classified into the following three types.
– Supervised (External Index): Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Unsupervised (Internal Index): Used to measure the goodness of a clustering
structure without respect to external information in terms of cluster cohesion
and cluster sepration
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or clusters.
• Often an external or internal index is used for this function, e.g., SSE or entropy
• Sometimes these are referred to as criteria instead of indices
– However, sometimes criterion is the general strategy and index is the numerical measure
that implements the criterion.
Measuring Cluster Validity Via Correlation
• Two matrices
– Proximity Matrix
– Ideal Similarity Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to the same
cluster are close to each other.
• Not a good measure for some density or contiguity based
clusters.
Measuring Cluster Validity Via Correlation
• Correlation of ideal similarity and proximity matrices for the
K-means clusterings of the following two data sets.
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = -0.9235 Corr = -0.5810


Using Similarity Matrix for Cluster Validation

• Order the similarity matrix with respect to cluster labels and


inspect visually.
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6

P o in t s
0.6
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1

10 0.9 0.9

20 0.8 0.8

30 0.7 0.7

40 0.6 0.6
P o in ts

50 0.5 0.5

y
60 0.4 0.4

70 0.3 0.3

80 0.2 0.2

90 0.1 0.1

100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x

DBSCAN
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1

10 0.9 0.9

20 0.8 0.8

30 0.7 0.7

40 0.6 0.6
P o in ts

50 0.5 0.5

y
60 0.4 0.4

70 0.3 0.3

80 0.2 0.2

90 0.1 0.1

100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x

K-means
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1 1

10 0.9 0.9

20 0.8 0.8

30 0.7 0.7

40 0.6 0.6
P o in ts

50 0.5 0.5

y
60 0.4 0.4

70 0.3 0.3

80 0.2 0.2

90 0.1 0.1

100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x

Complete Link
Internal Measures: SSE
• Clusters in more complicated figures aren’t well separated
• Internal Index: Used to measure the goodness of a clustering structure without
respect to external information
– SSE
• SSE is good for comparing two clusterings or two clusters (average SSE).
• Can also be used to estimate the number of clusters
10

6 9

8
4
7
2 6

SSE
0 5

4
-2
3
-4 2

-6 1

5 10 15 0
2 5 10 15 20 25 30
K
Internal Measures: SSE
• SSE curve for a more complicated data set

1
2 6

3
4

SSE of clusters found using K-means


Statistical Framework for SSE
• Example (K-Means)
– Compare SSE of 0.005 against three clusters in random runs
– Histogram shows SSE of three clusters in 500 random runs of data points
of size 100 distributed over the range 0.2 – 0.8 for x and y values

1
50
0.9
45
0.8
40
0.7
35
0.6
30

C o u nt
0.5
y

25
0.4
20
0.3
15
0.2
10
0.1
5
0
0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE
Statistical Framework for Correlation

• Correlation of ideal similarity and proximity matrices for the K-means


clusterings of the following two data sets.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = -0.9235 Corr = -0.5810


Internal Measures: Cohesion and Separation
• Cluster Cohesion: Measures how closely related are objects in
a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-separated a
cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)

SSE WSS   ( x  mi ) 2
i xCi

– BSS is
Separation

C by
measured
i
(m m ) 2 cluster sum of squares
thebetween
i i

– Where |Ci| is the size of cluster i


Internal Measures: Cohesion and Separation

• Example: SSE
– BSS + WSS = constant
m

1

m1 2

3 4

m 2 5

K=1 cluster: SSE WSS (1  3) 2  ( 2  3) 2  ( 4  3) 2  (5  3) 2 10


BSS 4 ( 3  3) 2 0
Total 10  0 10

K=2 clusters: SSE WSS(1  1.5) 2  ( 2  1.5) 2  ( 4  4.5) 2  (5  4.5) 2 1


BSS2 (3  1.5) 2  2 ( 4.5  3) 2 9
Total 1  9 10
Internal Measures: Cohesion and Separation
• A proximity graph based approach can also be used for
cohesion and separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient
• Silhouette coefficient combines ideas of both cohesion and separation, but
for individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
Distances used
to calculate b
i
s = (b – a) / max(a,b)
Distances used
to calculate a
– Typically between 0 and 1.
– The closer to 1 the better.

• Can calculate the average silhouette coefficient for a cluster or a clustering


External Measures of Cluster Validity: Entropy and Purity
Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult
and frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will


remain a black art accessible only to those true believers who
have experience and great courage.”

“Algorithms for Clustering Data” Jain and Dubes


Reference Material
• “INTRODUCTION TO DATA MINING” BY PANGNING TAN, VIPIN
KUMAR
• “FUZZY LOGIC WITH ENGINEERING APPLICATIONS” BY TIMOTHY J.
ROSS

CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

https://scikit-learn.org/stable/modules/
clustering.html

<<Module name, Topic name>> 167


Epilogue
• Acknowledgements
• Contact details :
– E-MAIL : [email protected]
– M : 9718718889

<<Module name, Topic name>> 168


Thank You!

<<Module name, Topic name>> 169

You might also like