0% found this document useful (0 votes)
33 views40 pages

Unsupervised Learning 1

The workshop discusses clustering, a method of organizing unlabeled data into similarity groups, focusing on techniques such as k-Means and hierarchical clustering. It covers the importance of proximity measures, evaluation criteria, and the challenges of determining the optimal number of clusters. Additionally, it highlights the strengths and weaknesses of the k-Means algorithm, including its sensitivity to initial seeds and limitations with non-convex shapes.

Uploaded by

Sayani Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views40 pages

Unsupervised Learning 1

The workshop discusses clustering, a method of organizing unlabeled data into similarity groups, focusing on techniques such as k-Means and hierarchical clustering. It covers the importance of proximity measures, evaluation criteria, and the challenges of determining the optimal number of clusters. Additionally, it highlights the strengths and weaknesses of the k-Means algorithm, including its sensitivity to initial seeds and limitations with non-convex shapes.

Uploaded by

Sayani Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Workshop on Statistics and Machine Learning in Practice

Topic: Unsupervised learning


clustering
STAT&ML LAB (https://www.ctanujit.org/statml-lab.html)

by
Swarup Chattopadhyay, ISI, Kolkata
What is clustering?
• The organization of unlabeled data into similarity groups called
clusters.

• A cluster is a collection of data items which are “similar” between


them, and “dissimilar” to data items in other clusters.
• high intra-class similarity
• low inter-class similarity

• More informally, finding natural groupings among objects.


What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males


Computer vision application : Image segmentation
Data Representations for Clustering
• Input data to algorithm is usually a vector (also called a “tuple” or “record”)

• Types of data
• Numerical
• Categorical
• Boolean
• Example: Clinical Sample Data
• Age (numerical)
• Weight (numerical)
• Gender (categorical)
• Diseased? (boolean)

• Must also include a method for computing similarity of or distance between


vectors
What do we need for clustering?
• Proximity measure, either
• Similarity measure s(𝑋𝑖 , 𝑋𝑘 ): large if 𝑋𝑖 , 𝑋𝑘 are similar
• Dissimilarity (or distance) measure d(𝑋𝑖 , 𝑋𝑘 ): small if 𝑋𝑖 , 𝑋𝑘 are similar
Large s, small d
Large d, small s
• Criterion function to evaluate clustering

• Algorithm to evaluate clustering


• For example by optimizing criterion function
Distance (dissimilarity) measures
• Euclidean distance
 x 
d
k 2
d ( xi , x j )  k
x
• Translation invariant k 1
i j

• Manhattan (city block) distance d

• Approximation to Euclidean distance d ( xi , x j )   xik  x kj


k 1
• Cheaper to compute

1
• They are special cases of Minkowski distance:  d k k p
d p ( xi , x j )    xi  x j 
p

• p is a positive integer  k 1 
Cluster evaluation (a hard problem)

• Intra-cluster cohesion (compactness):

• Cohesion measures how near the data points in a cluster are to the cluster centroid.
• Sum of squared error (SSE) is a commonly used measure.

• Inter-cluster separation (isolation):

• Separation means that different cluster centroids should be far away from one another.

• In most applications, expert judgments are still the key


How many clusters?

• Possible approaches
• Fix the number of cluster to k
• Find the best clustering according to the criterion function
Clustering techniques
• k-Means algorithm [1957, 1967]
• k-Medoids algorithm
Partitioning • k-Modes [1998]
methods • Fuzzy c-means algorithm [1999]

• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques • STING [1997] • DENCLUE [1998]
Density-based
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]

• MST Clustering [1999]


Graph based
• OPOSSUM [2000]
methods • SNN Similarity Clustering [2001, 2003]

• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]
Clustering techniques

• In this lecture, we shall try to cover the following clustering


techniques only.
• Partitioning
• k-Means algorithm
• PAM (k-Medoids algorithm)
• Hierarchical
• Divisive algorithm
• Agglomerative algorithm
Hierarchical clustering
• Consider flat clustering

• For some data hierarchical clustering more appropriate than


‘flat’ clustering

• Hierarchical clustering
Example: biological taxonomy
Types of hierarchical clustering
Divisive (top down) clustering
• Starts with all data points in one cluster, the root, then
• Splits the root into a set of child clusters. Each child cluster is recursively divided
further
• stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point
Agglomerative (bottom up) clustering
• The dendrogram is built from the bottom level by
• merging the most similar (or nearest) pair of clusters
• stopping when all the data points are merged into a single cluster (i.e., the root
cluster).
• The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!]
Summary of Hierarchal Clustering Methods
• No need to specify the number of clusters in advance.

• Hierarchal nature maps nicely onto human intuition for some domains

• They do not scale well: time complexity of at least O(n2), where n is


the number of total objects.

• Like any heuristic search algorithms, local optima are a problem.

• Interpretation of results is (very) subjective.


K-Means clustering

• K-means (MacQueen, 1967) is a partition clustering algorithm

• Let the set of data points D be X 1 , X 2 ,..., X n  , where X i   X i1 , X i 2 ,..., X ir 


is a vector in X  R r , and r is the number of dimensions.

• The k-means algorithm partitions the given data into k clusters:


• Each cluster has a cluster center, called centroid.
• k is specified by the user
K-Means clustering

• Given k, the k-means algorithm works as follows:

1. Choose k(random) data points (seeds) to be the initial centroids,


cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster memberships
4. If a convergence criterion is not met, repeat steps 2 and 3
K-Means clustering
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.
3. Compute the “cluster centers” of each cluster. These become the
new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
K-means convergence (stopping) criterion
• no (or minimum) re-assignments of data points to different clusters, or
• no (or minimum) change of centroids, or
• minimum decrease in the sum of squared error (SSE),
k
SSE   X C d ( X , m j ) 2
j
j 1

• 𝐶𝑗 is the jth cluster,


• 𝑚𝑗 is the centroid of cluster 𝐶𝑗 (the mean vector of all the data points in 𝐶𝑗 ),
• 𝑑(𝑋, 𝑚𝑗 ) is the (Eucledian) distance between data point x and centroid 𝑚𝑗 .
K-means Clustering: Step 1
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 4
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 5

k1

k2
k3
Comments on k-Means algorithm (k?)
• We may not have an idea about the possible number of clusters for high
dimensional data, and for data that are not scatter-plotted.
• Normally 𝑘≪𝑛 and there is heuristic to follow 𝑘≈√𝑛.
• There is no principled way to know what the value of k ought to be. We may try
with successive value of k starting with 2 using two criterion:
• The Elbow Method
• Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and
choose the k for which WSS becomes first starts to diminish.
• The Silhouette Method
• A high value is desirable and indicates that the point is placed in the correct cluster for
different values of k.
Comments on k-Means algorithm
Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that should
quantify the notion of “closest” for the objects under clustering.

• Usually Euclidean distance (L2 norm) is the best measure when object points are
defined in n-dimensional Euclidean space.
• Other measure namely cosine similarity is more appropriate when objects are of
document type.
• Further, there may be other type of proximity measures that appropriate in the
context of applications.
• For example, Manhattan distance (L1 norm), Jaccard measure, etc.
Comments on k-Means algorithm
Distance Measurement:
• Thus, in the context of different measures, the sum-of-squared error (i.e.,
objective function/convergence criteria) of a clustering can be stated as under.
• Data in Euclidean space (L2 norm):
k
SSE   X C d ( X , m j ) 2
j
j 1

• Data in Euclidean space (L1 norm):


• The Manhattan distance (L1 norm) is used as a proximity measure, where the
objective is to minimize the sum-of-absolute error denoted as SAE and
k
SAE   X C | d ( X , m j ) |
defined as
j
j 1
Comments on k-Means algorithm

Note:

• When SSE (L2 norm) is used as objective function and the objective is to
minimize, then the cluster centroid (i.e. mean) is the mean value of the
objects in the cluster.

• When the objective function is defined as SAE (L1 norm), minimizing


the objective function implies the cluster centroid as the median of the
cluster.
Comments on k-Means algorithm
2

SSE   X C d ( X , m j ) 2   xC m j  x 
k k
We know
j j
j 1 j 1

𝜕(𝑆𝑆𝐸)
To minimize SSE means, =0
𝜕𝑚𝑗

Thus,
𝜕 𝑘 2
𝑗=1 𝑥∈𝑪𝑗 𝑚𝑗 − 𝑥 =0
𝜕𝑚𝑗

Or,
𝑘
𝜕 2
𝑚𝑗 − 𝑥 =0
𝜕𝑚𝑗
𝑗=1 𝑥∈𝑪𝑗
Comments on k-Means algorithm
Or,
2 𝑚𝑗 − 𝑥 = 0
𝑥∈𝑪𝑗
Or,
𝑛𝑗 ⋅ 𝑚𝑗 = 𝑥
𝑥∈𝑪𝑗

1
𝑚𝑗 = 𝑥
𝑛𝑗
𝑥∈𝑪𝑗

• Thus, the best centroid for minimizing SSE of a cluster is the mean of the
objects in the cluster.
• In a Similar way, the best centroid for minimizing SAE of a cluster is the
median of the objects in the cluster.
Why use K-means?
Strengths:
• Simple: easy to understand and to implement

• Efficient: Time complexity: O(tkn),


• where n is the number of data points,
• k is the number of clusters, and
• t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used.


Weaknesses of K-means
• The algorithm is only applicable if the mean is defined.
• For example, k-Means does not work on categorical data because mean
cannot be defined.
• The user needs to specify k.
• Unable to handle noisy data and outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
• Not suitable to discover clusters with non-convex shapes
Outliers

Undesirable Clusters

Ideal Clusters
Sensitivity to initial seeds
Special data structures

• The k-means algorithm is not suitable for discovering clusters


with non-convex shapes

Two Natural Clusters K-means Clusters


Different variants of k-means algorithm
• There are a quite few variants of the k-Means algorithm. These can differ in
the procedure of selecting the initial k means, the calculation of proximity and
strategy for calculating cluster means.
Few variant of k-Means algorithm includes:
• Bisecting k-Means (addressing the issue of initial choice of cluster means).
• M. Steinbach, G. Karypis and V. Kumar “A comparison of document clustering techniques”, Proceedings of KDD workshop on Text mining, 2000.

• Mean of clusters (Proposing various strategies to define means and variants of


means).
• B. zhan “Generalised k-Harmonic means – Dynamic weighting of data in unsupervised learning”, Technical report,
HP Labs, 2000.
• A. D. Chaturvedi, P. E. Green, J. D. Carroll, “k-Modes clustering”, Journal of classification, Vol. 18, PP. 35-36,
2001.
• D. Pelleg, A. Moore, “x-Means: Extending k-Means with efficient estimation of the number of clusters”, 17th
International conference on Machine Learning, 2000.
The K-Medoids Clustering Method
• Now, we shall study a variant of partitioning algorithm called k-Medoids
algorithm.
• Medoids are similar in concept to means or centroids, but medoids are always
restricted to be members of the data set. Medoids are most commonly used on data
when a mean or centroid cannot be defined, such as graphs.

• Motivation: We have learnt that the k-Means algorithm is sensitive to outliers


because an object with an “extremely large value” may substantially distort
the distribution. The effect is particularly exacerbated due to the use of the
SSE (sum-of-squared error) objective function. The k-Medoids algorithm
aims to diminish the effect of outliers.
Basic concepts: K-Medoids Clustering
• The basic concepts of this algorithm is to select an object as a cluster center (one
representative object per cluster) instead of taking the mean value of the objects in a
cluster (as in k-Means algorithm).
• We call this cluster representative as a cluster medoid or simply medoid.
• Initially, it selects a random set of k objects as the set of medoids.
• Then at each step, all objects from the set of objects, which are not currently medoids are examined one
by one to see if they should be medoids.
• The sum-of-absolute error (SAE) function is used as the objective function.
• The procedure terminates, if there is no any change in SAE in successive iteration
(i.e. there is no change in medoid).
• This k-Medoids algorithm is also known as PAM (Partitioning around Medoids).
PAM (Partitioning around Medoids)

(a) Cluster with 𝑜1 , 𝑜2 , 𝑜3 , 𝑎𝑛𝑑 𝑜4 (b) Cluster after swapping 𝑜4 𝑎𝑛𝑑 𝑜5


as medoids (𝑜5 becomes the new medoid).

The k-Medoid method is more robust than k-Means in the presence of outliers,
because a medoid is less influenced by outliers than a mean.
Thank You

You might also like