Machine Learning
Clustering and k-Means Algorithm
Lecture – 7
Instructor: Qamar Askari
Headlines
• Supervised vs. Unsupervised Learning
• Clustering
• K-Means algorithm
• Random initialization
• Choosing K – Elbow Method
• Implementation of K-Means Algorithm in Python
Supervised learning
Training set:
Unsupervised learning
Training set:
Clustering
• It is the task of identifying subgroups in the data such that data points
in the same subgroup (cluster) are very similar while data points in
different clusters are very different.
Market segmentation Social network analysis
K-means algorithm
An example on board
K-Means Algorithm
• 1.Initialize:
• Choose K random data points from the data set to represent the initial centers
of the K partitions
• 2.Calculate the group memberships:
1 if x t
− mi = min j x t − mj
bi =
t
0 otherwise
• 3.Update the centroids:
ix
b
t
t t
mi =
i
b t
t
• 4.Repeat steps 2 and 3
• until converge to stable values
An empirical study
Initial centres
Ref: K. Javed et. al, “The behavior of K-Means: An empirical study”,
International Conference on Electrical Engineering, 2008
Ref: K. Javed et. al, “The behavior of K-Means: An empirical study”,
International Conference on Electrical Engineering, 2008
Ref: K. Javed et. al, “The behavior of K-Means: An empirical study”,
International Conference on Electrical Engineering, 2008
Ref: K. Javed et. al, “The behavior of K-Means: An empirical study”,
International Conference on Electrical Engineering, 2008
Ref: K. Javed et. al, “The behavior of K-Means: An empirical study”,
International Conference on Electrical Engineering, 2008
Application – Image color quantization
20
40
60
80
100
120
140
160
180
200
220
50 100 150 200 250 300
•Look at the above picture….it does not have all 2563 colors.
•Suppose we want to represent it using even less colors
•Kmeans has an application called color quantization
20 20 20
40 40 40
60 60 60
80 80 80
100 100 100
120 120 120
140 140 140
160 160 160
180 180 180
200 200 200
220 220 220
50 100 150 200 250 300 50 100 150 200 250 300 50 100 150 200 250 300
Original Image: K=2 K=10
37859 Shades of Color
20 20
40 40
60 60
80 80
100 100
KMEANS
120 120
140 140
160 160
180 180
200 200
220 220
50 100 150 200 250 300 50 100 150 200 250 300
K=15 K=20
k=2 k=3 k=10
Reference: Bishop 2006.
K-means for non-separated clusters
T-shirt sizing
Weight
Height
K-Means and Globular/Non-Globular
structures
Globular Structure Non-Globular Structure
K-Means and Globular/Non-Globular
structures
K-Means on Non-Globular Structure
K-Means is sensitive to outliners
Handling Outliers
1. Remove/ignore outlier(s). Outliers can be identified from distance
of point from centroid
2. Make extra cluster for outlier as shown below
Random initialization
Should have
Randomly pick training
examples.
Set equal to these
examples.
Local optima
Random initialization
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get .
Compute cost function (distortion/WCSS)
Pick clustering that gave lowest cost
What is the right value of K?
Choosing the value of K
Elbow method:
Cost function
Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(no. of clusters) (no. of clusters)
Choosing the value of K
Sometimes, you’re running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for
how well it performs for that later purpose.
E.g. T-shirt sizing T-shirt sizing
Weight
Weight
Height Height
Implementation of K-Means Algorithm
Discussion from Google Colab