0% found this document useful (0 votes)

33 views40 pages

Unsupervised Learning 1

The workshop discusses clustering, a method of organizing unlabeled data into similarity groups, focusing on techniques such as k-Means and hierarchical clustering. It covers the importance of proximity measures, evaluation criteria, and the challenges of determining the optimal number of clusters. Additionally, it highlights the strengths and weaknesses of the k-Means algorithm, including its sensitivity to initial seeds and limitations with non-convex shapes.

Uploaded by

Sayani Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views40 pages

Unsupervised Learning 1

Uploaded by

Sayani Pradhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Workshop on Statistics and Machine Learning in Practice

Topic: Unsupervised learning

clustering
STAT&ML LAB (https://www.ctanujit.org/statml-lab.html)

by
Swarup Chattopadhyay, ISI, Kolkata
What is clustering?
• The organization of unlabeled data into similarity groups called
clusters.

• A cluster is a collection of data items which are “similar” between

them, and “dissimilar” to data items in other clusters.
• high intra-class similarity
• low inter-class similarity

• More informally, finding natural groupings among objects.

What is a natural grouping among these objects?

Clustering is subjective

Simpson's Family School Employees Females Males

Computer vision application : Image segmentation
Data Representations for Clustering
• Input data to algorithm is usually a vector (also called a “tuple” or “record”)

• Types of data
• Numerical
• Categorical
• Boolean
• Example: Clinical Sample Data
• Age (numerical)
• Weight (numerical)
• Gender (categorical)
• Diseased? (boolean)

• Must also include a method for computing similarity of or distance between

vectors
What do we need for clustering?
• Proximity measure, either
• Similarity measure s(𝑋𝑖 , 𝑋𝑘 ): large if 𝑋𝑖 , 𝑋𝑘 are similar
• Dissimilarity (or distance) measure d(𝑋𝑖 , 𝑋𝑘 ): small if 𝑋𝑖 , 𝑋𝑘 are similar
Large s, small d
Large d, small s
• Criterion function to evaluate clustering

• Algorithm to evaluate clustering

• For example by optimizing criterion function
Distance (dissimilarity) measures
• Euclidean distance
 x 
d
k 2
d ( xi , x j )  k
x
• Translation invariant k 1
i j

• Manhattan (city block) distance d

• Approximation to Euclidean distance d ( xi , x j )   xik  x kj

k 1
• Cheaper to compute

1
• They are special cases of Minkowski distance:  d k k p
d p ( xi , x j )    xi  x j 
p

• p is a positive integer  k 1 
Cluster evaluation (a hard problem)

• Intra-cluster cohesion (compactness):

• Cohesion measures how near the data points in a cluster are to the cluster centroid.
• Sum of squared error (SSE) is a commonly used measure.

• Inter-cluster separation (isolation):

• Separation means that different cluster centroids should be far away from one another.

• In most applications, expert judgments are still the key

How many clusters?

• Possible approaches
• Fix the number of cluster to k
• Find the best clustering according to the criterion function
Clustering techniques
• k-Means algorithm [1957, 1967]
• k-Medoids algorithm
Partitioning • k-Modes [1998]
methods • Fuzzy c-means algorithm [1999]

• DIANA [1990]
Divisive
• AGNES [1990]
Hierarchical • BIRCH [1996]
methods Agglomerative • CURE [1998]
methods • ROCK [1999]
• Chamelon [1999]
Clustering
Techniques • STING [1997] • DENCLUE [1998]
Density-based
• DBSCAN [1996] • OPTICS [1999]
methods • CLIQUE [1998] • Wave Cluster [1998]

• MST Clustering [1999]

Graph based
• OPOSSUM [2000]
methods • SNN Similarity Clustering [2001, 2003]

• EM Algorithm [1977]
Model based • Auto class [1996]
clustering • COBWEB [1987]
• ANN Clustering [1982, 1989]
Clustering techniques

• In this lecture, we shall try to cover the following clustering

techniques only.
• Partitioning
• k-Means algorithm
• PAM (k-Medoids algorithm)
• Hierarchical
• Divisive algorithm
• Agglomerative algorithm
Hierarchical clustering
• Consider flat clustering

• For some data hierarchical clustering more appropriate than

‘flat’ clustering

• Hierarchical clustering
Example: biological taxonomy
Types of hierarchical clustering
Divisive (top down) clustering
• Starts with all data points in one cluster, the root, then
• Splits the root into a set of child clusters. Each child cluster is recursively divided
further
• stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point
Agglomerative (bottom up) clustering
• The dendrogram is built from the bottom level by
• merging the most similar (or nearest) pair of clusters
• stopping when all the data points are merged into a single cluster (i.e., the root
cluster).
• The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!]
Summary of Hierarchal Clustering Methods
• No need to specify the number of clusters in advance.

• Hierarchal nature maps nicely onto human intuition for some domains

• They do not scale well: time complexity of at least O(n2), where n is

the number of total objects.

• Like any heuristic search algorithms, local optima are a problem.

• Interpretation of results is (very) subjective.

K-Means clustering

• K-means (MacQueen, 1967) is a partition clustering algorithm

• Let the set of data points D be X 1 , X 2 ,..., X n  , where X i   X i1 , X i 2 ,..., X ir 

is a vector in X  R r , and r is the number of dimensions.

• The k-means algorithm partitions the given data into k clusters:

• Each cluster has a cluster center, called centroid.
• k is specified by the user
K-Means clustering

• Given k, the k-means algorithm works as follows:

1. Choose k(random) data points (seeds) to be the initial centroids,

cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster memberships
4. If a convergence criterion is not met, repeat steps 2 and 3
K-Means clustering
Input: D is a dataset containing n objects, k is the number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster centroids.
2. For each of the objects in D do
• Compute distance between the current objects and k cluster centroids
• Assign the current object to that cluster to which it is closest.
3. Compute the “cluster centers” of each cluster. These become the
new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
5. Stop
K-means convergence (stopping) criterion
• no (or minimum) re-assignments of data points to different clusters, or
• no (or minimum) change of centroids, or
• minimum decrease in the sum of squared error (SSE),
k
SSE   X C d ( X , m j ) 2
j
j 1

• 𝐶𝑗 is the jth cluster,

• 𝑚𝑗 is the centroid of cluster 𝐶𝑗 (the mean vector of all the data points in 𝐶𝑗 ),
• 𝑑(𝑋, 𝑚𝑗 ) is the (Eucledian) distance between data point x and centroid 𝑚𝑗 .
K-means Clustering: Step 1
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 4
5

4
k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 5

k2
k3
Comments on k-Means algorithm (k?)
• We may not have an idea about the possible number of clusters for high
dimensional data, and for data that are not scatter-plotted.
• Normally 𝑘≪𝑛 and there is heuristic to follow 𝑘≈√𝑛.
• There is no principled way to know what the value of k ought to be. We may try
with successive value of k starting with 2 using two criterion:
• The Elbow Method
• Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and
choose the k for which WSS becomes first starts to diminish.
• The Silhouette Method
• A high value is desirable and indicates that the point is placed in the correct cluster for
different values of k.
Comments on k-Means algorithm
Distance Measurement:
• To assign a point to the closest centroid, we need a proximity measure that should
quantify the notion of “closest” for the objects under clustering.

• Usually Euclidean distance (L2 norm) is the best measure when object points are
defined in n-dimensional Euclidean space.
• Other measure namely cosine similarity is more appropriate when objects are of
document type.
• Further, there may be other type of proximity measures that appropriate in the
context of applications.
• For example, Manhattan distance (L1 norm), Jaccard measure, etc.
Comments on k-Means algorithm
Distance Measurement:
• Thus, in the context of different measures, the sum-of-squared error (i.e.,
objective function/convergence criteria) of a clustering can be stated as under.
• Data in Euclidean space (L2 norm):
k
SSE   X C d ( X , m j ) 2
j
j 1

• Data in Euclidean space (L1 norm):

• The Manhattan distance (L1 norm) is used as a proximity measure, where the
objective is to minimize the sum-of-absolute error denoted as SAE and
k
SAE   X C | d ( X , m j ) |
defined as
j
j 1
Comments on k-Means algorithm

Note:

• When SSE (L2 norm) is used as objective function and the objective is to
minimize, then the cluster centroid (i.e. mean) is the mean value of the
objects in the cluster.

• When the objective function is defined as SAE (L1 norm), minimizing

the objective function implies the cluster centroid as the median of the
cluster.
Comments on k-Means algorithm
2

SSE   X C d ( X , m j ) 2   xC m j  x 
k k
We know
j j
j 1 j 1

𝜕(𝑆𝑆𝐸)
To minimize SSE means, =0
𝜕𝑚𝑗

Thus,
𝜕 𝑘 2
𝑗=1 𝑥∈𝑪𝑗 𝑚𝑗 − 𝑥 =0
𝜕𝑚𝑗

Or,
𝑘
𝜕 2
𝑚𝑗 − 𝑥 =0
𝜕𝑚𝑗
𝑗=1 𝑥∈𝑪𝑗
Comments on k-Means algorithm
Or,
2 𝑚𝑗 − 𝑥 = 0
𝑥∈𝑪𝑗
Or,
𝑛𝑗 ⋅ 𝑚𝑗 = 𝑥
𝑥∈𝑪𝑗

1
𝑚𝑗 = 𝑥
𝑛𝑗
𝑥∈𝑪𝑗

• Thus, the best centroid for minimizing SSE of a cluster is the mean of the
objects in the cluster.
• In a Similar way, the best centroid for minimizing SAE of a cluster is the
median of the objects in the cluster.
Why use K-means?
Strengths:
• Simple: easy to understand and to implement

• Efficient: Time complexity: O(tkn),

• where n is the number of data points,
• k is the number of clusters, and
• t is the number of iterations.
• Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used.

Weaknesses of K-means
• The algorithm is only applicable if the mean is defined.
• For example, k-Means does not work on categorical data because mean
cannot be defined.
• The user needs to specify k.
• Unable to handle noisy data and outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with
very different values.
• Not suitable to discover clusters with non-convex shapes
Outliers

Undesirable Clusters

Ideal Clusters
Sensitivity to initial seeds
Special data structures

• The k-means algorithm is not suitable for discovering clusters

with non-convex shapes

Two Natural Clusters K-means Clusters

Different variants of k-means algorithm
• There are a quite few variants of the k-Means algorithm. These can differ in
the procedure of selecting the initial k means, the calculation of proximity and
strategy for calculating cluster means.
Few variant of k-Means algorithm includes:
• Bisecting k-Means (addressing the issue of initial choice of cluster means).
• M. Steinbach, G. Karypis and V. Kumar “A comparison of document clustering techniques”, Proceedings of KDD workshop on Text mining, 2000.

• Mean of clusters (Proposing various strategies to define means and variants of

means).
• B. zhan “Generalised k-Harmonic means – Dynamic weighting of data in unsupervised learning”, Technical report,
HP Labs, 2000.
• A. D. Chaturvedi, P. E. Green, J. D. Carroll, “k-Modes clustering”, Journal of classification, Vol. 18, PP. 35-36,
2001.
• D. Pelleg, A. Moore, “x-Means: Extending k-Means with efficient estimation of the number of clusters”, 17th
International conference on Machine Learning, 2000.
The K-Medoids Clustering Method
• Now, we shall study a variant of partitioning algorithm called k-Medoids
algorithm.
• Medoids are similar in concept to means or centroids, but medoids are always
restricted to be members of the data set. Medoids are most commonly used on data
when a mean or centroid cannot be defined, such as graphs.

• Motivation: We have learnt that the k-Means algorithm is sensitive to outliers

because an object with an “extremely large value” may substantially distort
the distribution. The effect is particularly exacerbated due to the use of the
SSE (sum-of-squared error) objective function. The k-Medoids algorithm
aims to diminish the effect of outliers.
Basic concepts: K-Medoids Clustering
• The basic concepts of this algorithm is to select an object as a cluster center (one
representative object per cluster) instead of taking the mean value of the objects in a
cluster (as in k-Means algorithm).
• We call this cluster representative as a cluster medoid or simply medoid.
• Initially, it selects a random set of k objects as the set of medoids.
• Then at each step, all objects from the set of objects, which are not currently medoids are examined one
by one to see if they should be medoids.
• The sum-of-absolute error (SAE) function is used as the objective function.
• The procedure terminates, if there is no any change in SAE in successive iteration
(i.e. there is no change in medoid).
• This k-Medoids algorithm is also known as PAM (Partitioning around Medoids).
PAM (Partitioning around Medoids)

(a) Cluster with 𝑜1 , 𝑜2 , 𝑜3 , 𝑎𝑛𝑑 𝑜4 (b) Cluster after swapping 𝑜4 𝑎𝑛𝑑 𝑜5

as medoids (𝑜5 becomes the new medoid).

The k-Medoid method is more robust than k-Means in the presence of outliers,
because a medoid is less influenced by outliers than a mean.
Thank You

Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Unit 5
No ratings yet
Unit 5
63 pages
Cluster
100% (1)
Cluster
72 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
21 pages
Clustering
No ratings yet
Clustering
80 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
Unsupervised Learning Explained
No ratings yet
Unsupervised Learning Explained
54 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
M5
No ratings yet
M5
40 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
60 pages
Eml 10 250825
No ratings yet
Eml 10 250825
91 pages
10 Lecture AI 10
No ratings yet
10 Lecture AI 10
48 pages
Clustering Analysis: What Is Cluster Analysis?
No ratings yet
Clustering Analysis: What Is Cluster Analysis?
5 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Clustering
No ratings yet
Clustering
125 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Clustering Partition Hierachy
No ratings yet
Clustering Partition Hierachy
58 pages
Clustering
No ratings yet
Clustering
38 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
Clustering
No ratings yet
Clustering
65 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Customer Segmentation Techniques Explained
No ratings yet
Customer Segmentation Techniques Explained
46 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering
No ratings yet
Clustering
80 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Unit 4
No ratings yet
Unit 4
74 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Clustering
No ratings yet
Clustering
75 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
Clustering
No ratings yet
Clustering
35 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
AI ML Lecture 6
No ratings yet
AI ML Lecture 6
20 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Countable & Uncountable
No ratings yet
Countable & Uncountable
2 pages
DPP 3-1
No ratings yet
DPP 3-1
2 pages
Assignment-1 - Praveen Sir
No ratings yet
Assignment-1 - Praveen Sir
13 pages
Assignment 3
No ratings yet
Assignment 3
10 pages
Set Assignment 1 1
No ratings yet
Set Assignment 1 1
3 pages
Csir Countable Uncountable Sets-1
No ratings yet
Csir Countable Uncountable Sets-1
6 pages
Sequences and Series of Real Numbers - JAM (MS) - 2005-2023
No ratings yet
Sequences and Series of Real Numbers - JAM (MS) - 2005-2023
18 pages
Tutorial Sheet4
No ratings yet
Tutorial Sheet4
2 pages
PYQ Linear Algebra JAM (MA) (2005-2024)
No ratings yet
PYQ Linear Algebra JAM (MA) (2005-2024)
75 pages
DPP 1
No ratings yet
DPP 1
2 pages
Assignment 27
No ratings yet
Assignment 27
7 pages
3-Infinite Series of Real Numbers
No ratings yet
3-Infinite Series of Real Numbers
22 pages
Assignment 1 CS
No ratings yet
Assignment 1 CS
2 pages
PYQ Sequence and Series (JAM (MA) - 2005-2023) .
100% (1)
PYQ Sequence and Series (JAM (MA) - 2005-2023) .
38 pages
Monte Carlo Statistical Methods (PDFDrive)
No ratings yet
Monte Carlo Statistical Methods (PDFDrive)
299 pages
Analysis of Variance and Design of Experiment 1 by DR Shalabh
No ratings yet
Analysis of Variance and Design of Experiment 1 by DR Shalabh
382 pages
Series
No ratings yet
Series
2 pages
3-Eigen Values & Eigen Vectors
No ratings yet
3-Eigen Values & Eigen Vectors
23 pages
1stmid (08) Que
No ratings yet
1stmid (08) Que
1 page
DPP 5
No ratings yet
DPP 5
2 pages
Assignment 35
No ratings yet
Assignment 35
6 pages
Rajarathinam 2023 BP 7762A
No ratings yet
Rajarathinam 2023 BP 7762A
15 pages
18 - Descriptive - Statisics Pages 162 171,173,175 215
No ratings yet
18 - Descriptive - Statisics Pages 162 171,173,175 215
52 pages
Assignment 28
No ratings yet
Assignment 28
20 pages
References
No ratings yet
References
2 pages
Iiquiz (20) 2
No ratings yet
Iiquiz (20) 2
1 page
BNBIBDARS
No ratings yet
BNBIBDARS
17 pages
MTH516A FCH Jan18
No ratings yet
MTH516A FCH Jan18
1 page
Lec8 BIBD
No ratings yet
Lec8 BIBD
21 pages
Unit 14
No ratings yet
Unit 14
16 pages
Correlation and Regression
No ratings yet
Correlation and Regression
41 pages
Econometric Model Stability Tests
No ratings yet
Econometric Model Stability Tests
8 pages
A Sense of Belonging at University: Student Retention, Motivation and Enjoyment
No ratings yet
A Sense of Belonging at University: Student Retention, Motivation and Enjoyment
13 pages
Big Data Analytics and NoSQL Overview
No ratings yet
Big Data Analytics and NoSQL Overview
8 pages
Project-Based Research Milestones
No ratings yet
Project-Based Research Milestones
4 pages
Corrections to Bayesian Econometrics
No ratings yet
Corrections to Bayesian Econometrics
4 pages
Choosing A Statistical Test: © Louis Cohen, Lawrence Manion & Keith Morrison
No ratings yet
Choosing A Statistical Test: © Louis Cohen, Lawrence Manion & Keith Morrison
16 pages
SD 11
No ratings yet
SD 11
4 pages
Data Analyst & Finance Consultant Profile
No ratings yet
Data Analyst & Finance Consultant Profile
1 page
Human Capital's Role in Ethiopia's Growth
No ratings yet
Human Capital's Role in Ethiopia's Growth
19 pages
Qualitative Sampling Methods
No ratings yet
Qualitative Sampling Methods
30 pages
OLTP to OLAP Data Transformation Guide
No ratings yet
OLTP to OLAP Data Transformation Guide
9 pages
Module9-Correlation and Regression (Business)
No ratings yet
Module9-Correlation and Regression (Business)
15 pages
Regression Analysis Overview
No ratings yet
Regression Analysis Overview
10 pages
BMC Strategy for Karawang Batik Business
No ratings yet
BMC Strategy for Karawang Batik Business
14 pages
Prakash NM Oracle Report Final
No ratings yet
Prakash NM Oracle Report Final
23 pages
Laboratory Report Format
No ratings yet
Laboratory Report Format
3 pages
Population and Lifespan - The Linear Regression Mini-Project
No ratings yet
Population and Lifespan - The Linear Regression Mini-Project
4 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
Reading and Analysing Misleading Graphs
No ratings yet
Reading and Analysing Misleading Graphs
19 pages
Bda Sem End Answers
No ratings yet
Bda Sem End Answers
18 pages
316321-Data Warehousing With Mining Techniques
No ratings yet
316321-Data Warehousing With Mining Techniques
5 pages
ANOVA
No ratings yet
ANOVA
12 pages
Business Analytics Notes by Tarun
No ratings yet
Business Analytics Notes by Tarun
8 pages
Value-Creation in Buyer-Seller Relationships
No ratings yet
Value-Creation in Buyer-Seller Relationships
17 pages
Ratio Analysis of Integrated Thermo Plastics
No ratings yet
Ratio Analysis of Integrated Thermo Plastics
63 pages
Toronto Police Race Data Report 2020
100% (2)
Toronto Police Race Data Report 2020
119 pages
Global Wireless Network Testing Software Market
No ratings yet
Global Wireless Network Testing Software Market
124 pages
BENCHMARKING (Tata Steel)
No ratings yet
BENCHMARKING (Tata Steel)
42 pages
Big Data and Data Science Insights
No ratings yet
Big Data and Data Science Insights
386 pages

Unsupervised Learning 1

Uploaded by

Unsupervised Learning 1

Uploaded by

Workshop on Statistics and Machine Learning in Practice

Topic: Unsupervised learning

• A cluster is a collection of data items which are “similar” between

• More informally, finding natural groupings among objects.

Simpson's Family School Employees Females Males

• Must also include a method for computing similarity of or distance between

• Algorithm to evaluate clustering

• Manhattan (city block) distance d

• Approximation to Euclidean distance d ( xi , x j )   xik  x kj

• Intra-cluster cohesion (compactness):

• Inter-cluster separation (isolation):

• In most applications, expert judgments are still the key

• MST Clustering [1999]

• In this lecture, we shall try to cover the following clustering

• For some data hierarchical clustering more appropriate than

• They do not scale well: time complexity of at least O(n2), where n is

• Like any heuristic search algorithms, local optima are a problem.

• Interpretation of results is (very) subjective.

• K-means (MacQueen, 1967) is a partition clustering algorithm

• Let the set of data points D be X 1 , X 2 ,..., X n  , where X i   X i1 , X i 2 ,..., X ir 

• The k-means algorithm partitions the given data into k clusters:

• Given k, the k-means algorithm works as follows:

1. Choose k(random) data points (seeds) to be the initial centroids,

• 𝐶𝑗 is the jth cluster,

• Data in Euclidean space (L1 norm):

• When the objective function is defined as SAE (L1 norm), minimizing

• Efficient: Time complexity: O(tkn),

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used.

• The k-means algorithm is not suitable for discovering clusters

Two Natural Clusters K-means Clusters

• Mean of clusters (Proposing various strategies to define means and variants of

• Motivation: We have learnt that the k-Means algorithm is sensitive to outliers

(a) Cluster with 𝑜1 , 𝑜2 , 𝑜3 , 𝑎𝑛𝑑 𝑜4 (b) Cluster after swapping 𝑜4 𝑎𝑛𝑑 𝑜5

You might also like