UNIT-V
Cluster
Analysis
19/02/25
Cluster Analysis:
Basic Concepts and Algorithms
Overview, What Is Cluster Analysis? Different
Types of Clustering, Different Types of
Clusters;
K-means: The Basic K-means Algorithm, K-
means Additional Issues, Bisecting K-means,
Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic
Agglomerative Hierarchical Clustering
Algorithm
DBSCAN: Traditional Density Center-Based
Approach, DBSCAN Algorithm, Strengths and
Weaknesses.
19/02/25
Cluster Analysis
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping
similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
19/02/25
Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
19/02/25
Clustering Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining
tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
19/02/25
Quality - Good Clustering
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
The quality of a clustering method is also measured
by its ability to discover some or all of the hidden
patterns
19/02/25
Different Types of Clustering
Hierarchical versus Partitional
Exclusive versus Overlapping versus Fuzzy
Complete versus Partial
19/02/25
Different Types of Clusters
Well-separated Cluster
Prototype-Based cluster
Graph-Based Cluster
Density-Based Cluster
Shared- property or Conceptual Clusters
19/02/25
Different Types of Clusters
Well-separated clusters : Each point is
closer to all of the points in its cluster than
to any point in another cluster.
Prototype-based clusters : Each point is closer
to the center of its cluster than to the center of
any other cluster.
19/02/25
Graph-based clusters: Each point is closer
to at least one point in its cluster than to
any point in another cluster.
Density-based clusters: Clusters are
regions of high density separated by
regions of low density.
19/02/25
Shared clusters: Points in a cluster share
some general property that derives from
the entire set of points.
19/02/25
Requirements of Clustering in Data
Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
19/02/25
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
19/02/25
Major Clustering Approaches
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
19/02/25
Typical Alternatives to Calculate the
Distance between Clusters
Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dis(K i, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and
an element in the other, i.e., dis(K i, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e.,
dis(Ki, Kj) = dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
19/02/25
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN1(t )
Cm N ip
Radius: square root of average distance from any point of
the cluster to its centroid
N (t cm ) 2
Rm i 1 ip
N
Diameter: square root of average mean squared distance
between all pairs of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
19/02/25
K-means
Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a database D of n
objects into a set of k clusters, s.t., min sum of squared distance
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
19/02/25
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in
four steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid is
the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the
nearest seed point ,Go back to Step 2, stop
when no more new assignment
19/02/25
Basic K-means algorithm
1: Select K points as initial centroids.
2: repeat
3: Form K clusters by assigning each point to
its closest centroid.
4: Recompute the centroid of each cluster.
5: until Centroids do not change
19/02/25
The K-Means Clustering
Method
Given k, the k-means algorithm is implemented in four
steps:
Cluster Analysis
The K-Means Clustering Method
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose 7 7
6 6
K object as initial 5 5
cluster center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
19/02/25
Table of notation
19/02/25
K-Means Method
Strength: Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
Weakness
Applicable only when mean is defined, then what about
categorical data
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
19/02/25
Example
Cluster the following eight points (with (x,
y) representing locations) into three
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7,
5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5,
8) and A7(1, 2).
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
K Means
Advantages
1. Easy to implement.
2. With a large number of variables, K-Means
may be computationally faster than
hierarchical clustering (if K is small)
Disadvantages
1. Difficult to predict the number of clusters
(K-Value).
2. Can converge on local minima
3. Sensitive to outliers
19/02/25
K-means: Additional Issues
Handling Empty Clusters:
One of the problems with the basic K-
means algorithm - empty clusters can be
obtained if no points are allocated to a
cluster during the assignment step
One approach is to choose the point that is
farthest away from any current centroid.
Another approach is to choose the
replacement centroid from the cluster that
has the highest SSE
19/02/25
Reducing the SSE with
Postprocessing
Two strategies that decrease the total SSE
by increasing the number of clusters are the
following:
Split a cluster: The cluster with the largest
SSE is chose and split
Introduce a new cluster centroid: the
point that is farthest from any cluster center
is chosen then determine this if we keep
track of the SSE contributed by each point.
Or
choose randomly from all points or from the
points with the highest SSE.
19/02/25
Two strategies that decrease the number of clusters,
while trying to minimize the increase in total SSE,
are the following:
Disperse a cluster: removing the centroid that
corresponds to the cluster and reassigning the
points to other clusters
Merge two clusters: The clusters with the closest
centroids are chosen, and merge the two clusters
that result in the smallest increase in total SSE
Updating Centroids Incrementally: the centroids
can be updated incrementally, after each
assignment of a point to a cluster
19/02/25
Bisecting K-means
The bisecting K-means algorithm is a
straightforward extension of the basic K-
means algorithm .
To obtain K clusters, split the set of all
points into two clusters, select one of these
clusters to split, and so on.
until K clusters have been produced.
19/02/25
Bisecting K-means
19/02/25
EXAMPLE
19/02/25
Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
19/02/25
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g.,
Splus
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
10 10 10
9
8
Eventually all nodes belong to the same cluster 9
8
9
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
19/02/25
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
19/02/25
19/02/25
Defining Proximity between Clusters
cluster proximity that differentiates the various
agglomerative hierarchical techniques
Agglomerative hierarchical clustering techniques,
such as MIN, MAX, and Group Average, come from a
graph-based view of clusters.
MIN defines cluster proximity as the proximity
between the closest two points that are in different
clusters
MAX takes the proximity between the farthest two
points in different clusters to be the cluster proximity
Group average technique, defines cluster proximity
to be the average pairwise proximities of all pairs of
points from different clusters
19/02/25
19/02/25
Agglomerative
19/02/25
Agglomerative
19/02/25
DENDROGRAM
19/02/25
DENDROGRAM
19/02/25
EXample
19/02/25
Plot
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
19/02/25
Complete Link
19/02/25
19/02/25
19/02/25
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g.,
Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
19/02/25
Wards Method and Centroid Methods
Ward's method, the proximity between two
clusters is defined as the increase in the squared
error that results when two clusters are merged
Ward's method somewhat distinct from other
hierarchical techniques, it can be shown
mathematically that Ward's method is very similar
to the group average method when the proximity
between two points is taken to be the square of
the distance between them
Centroid methods calculate the proximity between
two clusters by calculating the distance between
the centroids of clusters, similar to K-means.
19/02/25
Strengths and Weaknesses
Both algorithms can produce better-quality
clusters
Agglomerative hierarchical clustering algorithms
are expensive in terms of their computational and
storage requirements.
all merges are final can also cause trouble for
noisy, high-dimensional data, such as document
data.
these two problems can be addressed to some
degree by first partially clustering the data using
another technique, such as K-means.
19/02/25
Density-Based Clustering Methods
Clustering based on density (local cluster criterion),
such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
19/02/25
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts
if
p belongs to NEps(q) p MinPts = 5
q
core point condition: Eps = 1 cm
|NEps (q)| >= MinPts
19/02/25
Classification of Points According to
Center-Based Density
Core points: These points are in the interior of a
density-based cluster. A point is a core point if the
number of points within a given neighbor hood
around the point as determined by the distance
function and a user specified distance parameter -
Eps, exceeds a certain threshold, MinPts, which is
also a user-specified parameter.
Border points: A border point is not a core point,
but falls within the neighborhood of a core point. A
border point can fall within the neighborhoods of
several core points.
Noise points: A noise point is any point that is
neither a core point nor a border point.
19/02/25
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, …, p1
q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
p q
A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both,
p and q are density-reachable
from o w.r.t. Eps and MinPts
19/02/25
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
19/02/25
Example
19/02/25
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t.
Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
Continue the process until all of the points have
been processed.
19/02/25
19/02/25
Strengths and Weaknesses
DBSCAN uses a density-based definition of a
cluster, it is relatively resistant to noise and can
handle clusters of arbitrary shapes and sizes
DBSCAN can find many clusters that could not be
found using K-means
DBSCAN has trouble when the clusters have widely
varying densities. It also has trouble with high-
dimensional data because density is more difficult
to define for such data.
DBSCAN can be expensive when the computation
of nearest neighbors requires computing all
pairwise proximities, as is usually the case for
high-dimensional data.
19/02/25
Outlier
What are outliers?
The set of objects are considerably dissimilar
from the remainder of the data
Example: Sports: Michael Jordon, Wayne
Gretzky, ...
Problem: Define and find outliers in large data sets
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
19/02/25