Clustering Algorithms
Dalya Baron (Tel Aviv University)
XXX Winter School, November 2018
Clustering
Feature 2
Feature 1
Clustering
cluster #1
Feature 2
cluster #2
Feature 1
Clustering
Why should we look for clusters?
cluster #1
Feature 2
cluster #2
Feature 1
Clustering
K-means
Input: measured features, and the number of clusters, k. The algorithm will
classify all the objects in the sample into k clusters.
Feature 2
Feature 1
K-means
(I) The algorithm places randomly k points that represent the centroids of the clusters.
The algorithm performs several iterations, in each of them:
(II) The algorithm associates each object with a single cluster, according to its distance from
the cluster centroid.
(III) The algorithm recalculates the cluster centroid according to the objects that are associated
with it.
Feature 2
Feature 1
K-means
(I) The algorithm places randomly k points that represent the centroids of the clusters.
The algorithm performs several iterations, in each of them:
(II) The algorithm associates each object with a single cluster, according to its distance from
the cluster centroid.
(III) The algorithm recalculates the cluster centroid according to the objects that are associated
with it.
Feature 2
Two centroids are
randomly placed
Feature 1
K-means
(I) The algorithm places randomly k points that represent the centroids of the clusters.
The algorithm performs several iterations, in each of them:
(II) The algorithm associates each object with a single cluster, according to its distance from
the cluster centroid.
(III) The algorithm recalculates the cluster centroid according to the objects that are associated
with it.
Feature 2
The objects are
associated to the
closest cluster
centroid (Euclidean
distance).
Feature 1
K-means
(I) The algorithm places randomly k points that represent the centroids of the clusters.
The algorithm performs several iterations, in each of them:
(II) The algorithm associates each object with a single cluster, according to its distance from
the cluster centroid.
(III) The algorithm recalculates the cluster centroid according to the objects that are associated
with it.
Feature 2
New cluster
centroids are
computed using the
average location of
the cluster members.
Feature 1
K-means
(I) The algorithm places randomly k points that represent the centroids of the clusters.
The algorithm performs several iterations, in each of them:
(II) The algorithm associates each object with a single cluster, according to its distance from
the cluster centroid.
(III) The algorithm recalculates the cluster centroid according to the objects that are associated
with it.
Feature 2
The objects are
associated to the
closest cluster
centroid (Euclidean
distance).
Feature 1
K-means
(I) The algorithm places randomly k points that represent the centroids of the clusters.
The algorithm performs several iterations, in each of them:
(II) The algorithm associates each object with a single cluster, according to its distance from
the cluster centroid.
(III) The algorithm recalculates the cluster centroid according to the objects that are associated
with it.
Feature 2
The process stops
when the objects that
are associated with a
given class do not
change.
Feature 1
The anatomy of K-means
cluster
Internal choices and/or internal cost function: centroids
(I) Initial centroids are randomly selected from the set of examples.
(II) The global cost function that is minimized by K-means:
cluster
Euclidean
members distance
The anatomy of K-means
cluster
Internal choices and/or internal cost function: centroids
(I) Initial centroids are randomly selected from the set of examples.
(II) The global cost function that is minimized by K-means:
cluster
Euclidean
members distance
k=3, and two different random placements of centroids
The anatomy of K-means
Input dataset: a list of objects with measured features.
For which datasets should we use K-means?
Feature 2
Feature 2
Feature 1 Feature 1
The anatomy of K-means
Input dataset: a list of objects with measured features.
What happens when we have an outlier in the dataset?
Feature 2
outlier!
Feature 1
The anatomy of K-means
Input dataset: a list of objects with measured features.
What happens when we have an outlier in the dataset?
Feature 2
outlier!
Feature 1
The anatomy of K-means
Input dataset: a list of objects with measured features.
What happens when the features have different physical units?
input dataset K-means output
The anatomy of K-means
Input dataset: a list of objects with measured features.
What happens when the features have different physical units?
How can we avoid this?
input dataset K-means output
The anatomy of K-means
Hyper-parameters: the number of clusters, k.
Can we find the optimal k using the cost function?
k=2 k=3 k=5
The anatomy of K-means
Hyper-parameters: the number of clusters, k.
Can we find the optimal k using the cost function?
k=2 k=3 k=5
Minimal cost function
Elbow
Number of clusters
Questions?
Hierarchal Clustering
or, how to visualize complicated similarity measures
Correa-Gallego+ 2016
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Feature 2
Feature 1
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
Feature 1
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
Next: the algorithm merges the two
closest clusters into a single cluster.
Then, the algorithm re-calculates the
distance of the newly-formed cluster
to all the rest.
Feature 2
distance
Feature 1
Dendrogram
Hierarchal Clustering
Input: measured features, or a distance matrix that represents the pair-wise
distances between the objects. Also, we must specify a linkage method.
Initialization: each object is a cluster of size 1.
The process stops when all the objects
are merged into a single cluster
Feature 2
distance
Feature 1
Dendrogram
The anatomy of Hierarchal Clustering
Internal choices and/or internal cost function:
The linkage method is used to define a distance between two newly formed
clusters. Methods include: single (minimal), complete (maximal), average, etc.
Feature 2
distance
single
Feature 1
Dendrogram
The anatomy of Hierarchal Clustering
Internal choices and/or internal cost function:
The linkage method is used to define a distance between two newly formed
clusters. Methods include: single (minimal), complete (maximal), average, etc.
Feature 2
distance
complete
Feature 1
Dendrogram
The anatomy of Hierarchal Clustering
Internal choices and/or internal cost function:
The linkage method is used to define a distance between two newly formed
clusters. Methods include: single (minimal), complete (maximal), average, etc.
Feature 2
distance
average
Feature 1
Dendrogram
The anatomy of Hierarchal Clustering
Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we
can select a threshold d that corresponds to the desired number of clusters, k.
Feature 2
distance
d
Feature 1
Dendrogram
The anatomy of Hierarchal Clustering
Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we
can select a threshold d that corresponds to the desired number of clusters, k.
Feature 2
distance d
Feature 1
Dendrogram
The anatomy of Hierarchal Clustering
Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we
can select a threshold d that corresponds to the desired number of clusters, k.
We can use the resulting dendrogram to choose a “good” threshold:
distance
The anatomy of Hierarchal Clustering
Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we
can select a threshold d that corresponds to the desired number of clusters, k.
We can use the resulting dendrogram to choose a “good” threshold:
distance
The anatomy of Hierarchal Clustering
Input dataset: can either be a list of objects with measured properties, or a
distance matrix that represents pair-wise distances between objects.
What happens if we have an outlier in the dataset?
The anatomy of Hierarchal Clustering
Input dataset: can either be a list of objects with measured properties, or a
distance matrix that represents pair-wise distances between objects.
What happens if the dataset does not have clear clusters?
distance
The anatomy of Hierarchal Clustering
Input dataset: can either be a list of objects with measured properties, or a
distance matrix that represents pair-wise distances between objects.
Different linkage methods are helpful with different datasets.
single linkage complete linkage average linkage
Hierarchal Clustering in Astronomy
“Statistics, Data Mining, and Machine Learning in Astronomy”, by Ivezic, Connolly, Vanderplas, and Gray (2013).
Visualizing similarity matrices with Hierarchical
Clustering
Input: 10,000 emission line spectra, covering the wavelength range 300 - 700
nm. There are ~90 emission lines in each spectrum, with an average SNR of 2-4.
normalized flux
wavelength (nm)
normalized flux
wavelength (nm)
Visualizing similarity matrices with Hierarchical
Clustering
We compute a correlation matrix of all the observed wavelengths.
correlation coefficient
wavelength (nm)
wavelength (nm)
Visualizing similarity matrices with Hierarchical
Clustering
We convert the correlation matrix to a distance matrix, and build a dendrogram
Visualizing similarity matrices with Hierarchical
Clustering
We reorder the correlation matrix (the wavelengths) according to the resulting
dendrogram.
reordered axis
Visualizing similarity matrices with Hierarchical
Clustering
de Souza et. al 2015
Questions?
Gaussian Mixture models
See: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-
covariances-py
Questions?