Chapter 12:
Unsupervised Learning
Jeremy Selva
Introduction
This chapter introduces a diverse set of unsupervised learning techniques
Principal Components Analysis
Clustering
12.1 The Challenge of Unsupervised Learning
Unsupervised learning is often performed as part
of an exploratory data analysis.
Given a set of p features X , X , . . . , X
1 2 p
measured on n observations, find interesting
groups about these features.
Interesting groups include
Finding subgroups among the variables
Finding subgroups among the observation
However, there is no clear criteria to determine if a group is interesting or not as it is subjective.
Image from https://www.sciencedirect.com/science/article/pii/S1532046415001380
12.2 Principal Components Analysis (PCA)
When we have a large set of (preferably correlated) features X , X , . . . , X , PCA helps to summarise them
1 2 p
into a smaller number of representative variables, Z , Z , . . . , Z , where M < p that helps to explain
1 2 M
most of the variability in the original set. Z , Z , . . . , Z are also called principal components.
1 2 M
12.2.1 What Are Principal Components?
Given a n by p data set X, with all p variable to have mean 0,
a principal component Z , for 1 ≤ j ≤ M , must be expressed in terms of its features X
j 1, X2 , . . . , Xp and
loadings ϕ , ϕ , . . . , ϕ in the following (linear combination) way
1j 2j pj
Zj = ϕ1j X1 + ϕ2j X2 +. . . +ϕpj Xp
where the sum of loading squares add up to one or ∑ p
k=1
(ϕkj )
2
= 1 .
12.2.1 What Are Principal Components?
These M loadings make up the principal component loading vector/direction, ϕ j = (ϕ1j , ϕ2j , . . . , ϕpj )
T
Images from TileStats youtube video
12.2.1 What Are Principal Components?
The loading vector ϕ is used to calculate the
j
scores z for 1 ≤ i ≤ n, 1 ≤ j ≤ M for each
ij
principal component Z . j
T
Zj = (z1j , z2j , . . . , zij , . . . , znj )
where z is the sum of the products of the
ij
loadings and the individual data values
p
zij = ∑(ϕkj xik ) = ϕ1j xi1 + ϕ2j xi2 +. . . +ϕpj xip
k=1
Images from TileStats youtube video
12.2.1 What Are Principal Components?
Each principal component loadings ϕ j = (ϕ1j , ϕ2j , . . . , ϕpj )
T
is optimised such that
n p p
1
2 2
maximize { ∑(∑(ϕkj xik )) } subject to ∑(ϕkj ) = 1
ϕ1j ,...,ϕpj n
i=1 k=1 k=1
where ∑ p
k=1
(ϕkj xik ) = zij are the scores in the principal component Z . j
As the mean of the scores z is 0,
ij
n
1
∑
n
i=1
(zij − 0)
2
is the sample variance of the scores in the principal
component Z . j
Images from TileStats youtube video
12.2.2 Another Interpretation
Principal components provide low-dimensional linear surfaces that are closest to the n observations.
The first principal component loading vector ϕ 1 = (ϕ11 , ϕ21 , . . . , ϕp1 )
T
is the line in p-dimensional space
that is closest to the n observations.
12.2.2 Another Interpretation
The first two principal component loading vector ϕ and ϕ is the 2D plane in p-dimensional space that best
1 2
fit the n observations.
12.2.2 Another Interpretation
By a special property of the loading matrix ϕ 1 = ϕ and when M = min(p, n − 1), it is possible to
− T
express each dataset in terms of principal component scores and loadings or x = ∑ (z ϕ )
ij
p
k=1
ik kj
12.2.2 Another Interpretation
However for m < p, x ij ≈ ∑
p
k=1
(zik ϕkj )
xij − ∑
p
k=1
(zik ϕkj ) will get smaller as M gets closer to min(p, n − 1)
12.2.3 The Proportion of Variance Explained
Variance represents how much information for a given data is lost as a result of projecting the observations
onto the first few principal components. We define some parameters.
The total variance present in a data set (assuming that the variables have mean 0) is
p p n
1
2
∑ V ar(Xj ) = ∑( ∑(xij − 0) )
n
j=1 j=1 i=1
The variance explained by the mth principal component is the variance of the scores (which also has the
mean of 0).
n n p
1 1 2
2
∑z = ∑(∑ ϕjm xij )
im
n n
i=1 i=1 j=1
12.2.3 The Proportion of Variance Explained
Images from TileStats youtube video
12.2.3 The Proportion of Variance Explained
The proportion of variance explained (P V E) by the mth principal component
1 n
2
V ariance explained by the mth principal component ∑ z
n i=1 im
P V Em = =
1 n p
T otal variance present in a data set 2
∑ ∑ (xij )
n i=1 j=1
Images from TileStats youtube video
12.2.3 The Proportion of Variance Explained
The variance of the data can also be decomposed into the variance of the M principal components plus the
mean squared error of this M -dimensional approximation.
n p n M n p M
1 1 1
2 2 2
∑ ∑(xij ) = ∑ ∑ (zik ) + ∑ ∑(xij − ∑ zim ϕjm )
n n n
i=1 j=1 i=1 m=1 i=1 j=1 m=1
Var. of data Var. of first M PCs MSE of M-dimensional approximation
Images from TileStats youtube video
12.2.3 The Proportion of Variance Explained
n p n M n p M
1 1 1
2 2 2
∑ ∑(xij ) = ∑ ∑ (zim ) + ∑ ∑(xij − ∑ zim ϕjm )
n n n
i=1 j=1 i=1 m=1 i=1 j=1 m=1
Var. of data Var. of first M PCs MSE of M-dimensional approximation
Bring 1
n
∑
n
i=1
∑
p
j=1
(xij − ∑
M
m=1
zim ϕjm )
2
to the left hand side
MSE of M-dimensional approximation
n p n p M n M
1 1 1
2 2 2
∑ ∑(xij ) − ∑ ∑(xij − ∑ zim ϕjm ) = ∑ ∑ (zim )
n n n
i=1 j=1 i=1 j=1 m=1 i=1 m=1
Var. of data MSE of M-dimensional approximation Var. of first M PCs
12.2.3 The Proportion of Variance Explained
n p n p M n M
1 1 1
2 2 2
∑ ∑(xij ) − ∑ ∑(xij − ∑ zim ϕjm ) = ∑ ∑ (zim )
n n n
i=1 j=1 i=1 j=1 m=1 i=1 m=1
Var. of data MSE of M-dimensional approximation Var. of first M PCs
Divide by 1
n
∑
n
i=1
∑
p
j=1
(xij )
2
≠ 0 on both side
Var. of data
n p M n M n
2 1 2 M 1 2
∑ ∑ (xij − ∑ zim ϕjm ) ∑ ∑ (zim ) ∑ (zim )
i=1 j=1 m=1 n i=1 m=1 n i=1
1 − n p
= = ∑
1 n p 1 n p
2 2 2
∑ ∑ (xij ) ∑ ∑ (xij ) ∑ ∑ (xij )
i=1 j=1 n i=1 j=1 m=1 n i=1 j=1
1 n 2
Recall that P V E V ariance explained by the mth principal component ∑ z
n i=1 im
m = = 1 n p
T otal variance present in a data set ∑ ∑ (xij )
2
n i=1 j=1
12.2.3 The Proportion of Variance Explained
We have
n p M 2 n
M 1 2 M
∑ ∑ (xij − ∑ zim ϕjm ) ∑ (zim )
i=1 j=1 m=1 n i=1
1 − n p
= ∑ = ∑ P V Em
1 n p
2 2
∑ ∑ (xij ) ∑ ∑ (xij )
i=1 j=1 m=1 n i=1 j=1 m=1
Using T SS as the total sum of squared elements of X, and RSS as the residual sum of squares of the M-
dimensional approximation given by the M principal components.
M
RSS
2
∑ P V Em = 1 − = R
T SS
m=1
We can interpret the cumulative P V E as the R of the approximation for X given by the first M principal
2
components
12.2.4 More on PCA
Why do we need to scale the variables have standard deviation one ?
12.2.4 More on PCA
Each principal component loading vector is unique, up to a sign flip or
Given j and j , ϕ
′
′
j m = kϕjm where k = 1 or −1.
12.2.4 More on PCA
How many principal components is enough ? Typically decided using the scree plot though it can be
subjective.
12.2.4 More on PCA
Image from Nature's Points of Significance Series Principal component analysis
12.2.4 More on PCA
Image from Ten quick tips for effective
dimensionality reduction
12.3 Missing Values and Matrix Completion
It is possible for datasets to have missing values. Unfortunately, the statistical learning methods that we
have seen in this book cannot handle missing values.
We could remove data with missing values but it is wasteful.
Another way is set the missing x to be the mean of the jth column.
ij
Although this is a common and convenient strategy, often we can do better by exploiting the correlation
between the variables by using principal components.
12.3 Missing Values and Matrix Completion
Full Data Missing Data Initialisation 2a function Iteration 1 Iteration 1 cont
SBP DBP
data <- data.frame(
SBP = c(-3,-1,-1, 1, 1, 3),
-3.00 -4.00
DBP = c(-4,-2, 0, 0, 2, 4)) -1.00 -2.00
-1.00 0.00
1.00 0.00
1.00 2.00
3.00 4.00
12.3 Missing Values and Matrix Completion
iter <- 1 #> Iter: 2
rel_err <- mss_old_pca #> Rel. Err: 0.05686009
thresh <- 1e-2 #> Iter: 3
while(rel_err > thresh) { #> Rel. Err: 0.2107907
iter <- iter + 1 #> Iter: 4
# Step 2(a) #> Rel. Err: 0.3222623
Xapp <- fit_pca(initialisation_data_pca , #> Iter: 5
# Step 2(b) #> Rel. Err: 0.2031161
initialisation_data_pca[ismiss] <- Xapp[is #> Iter: 6
# Step 2(c) #> Rel. Err: 0.08375675
mss_pca <- mean(((missing_data - Xapp)[!is #> Iter: 7
rel_err <- mss_old_pca - mss_pca #> Rel. Err: 0.03220227
mss_old_pca <- mss_pca #> Iter: 8
cat("Iter:", iter, "\n", #> Rel. Err: 0.01423225
"MSS:", mss_pca, "\n") #> Iter: 9
} #> Rel. Err: 0.00830349
12.3 Missing Values and Matrix Completion
Actual Data Imputed Data
SBP DBP SBP DBP
-3.00 -4.00 -3.00 4.38
-1.00 -2.00 1.28 -2.00
-1.00 0.00 -1.00 0.00
1.00 0.00 1.00 -1.29
1.00 2.00 1.00 2.00
3.00 4.00 -2.57 4.00
12.3 Missing Values and Matrix Completion
Actual Data Soft Imputed Data
SBP DBP SBP DBP
-3.00 -4.00 -3.00 -4.19
-1.00 -2.00 -1.42 -2.00
-1.00 0.00 -1.00 0.00
1.00 0.00 1.00 1.40
1.00 2.00 1.00 2.00
3.00 4.00 2.85 4.00
12.4 Clustering Methods
Clustering is to split into distinct homogeneous groups such that
Observations within each group are quite similar to each other
Observations in different groups are quite different from each other.
In this section we focus on perhaps the two best-known clustering approaches:
K-means Clustering
Hierarchical Clustering.
12.4 Clustering Methods
Image from Francesco Archetti's slides
12.4.1 K-Means Clustering
An approach to partition a data set into K distinct, To measure how much two observations x and
i
non-overlapping clusters. x are different from each other, the squared
i
′
Euclidean distance is used
i be the number of observations
x and x be two different observations
p
′
i i
2
j be the number of features
∑(xij − xi′ j )
j=1
Image from Wikipedia
12.4.1 K-Means Clustering
kbe the number of clusters
C be the kth cluster
k
Sum of all of pairwise squared Euclidean distances between the observations in the kth cluster
p
2
∑ (∑(xij − xi′ j ) )
′
i,i ∈Ck j=1
|Ck | be the number of samples in the kth cluster
The within-cluster variation for cluster C denote as W (C ) is the mean measurement of how much the
k k
observations within the cluster C differ from each other.
k
p
1
2
W (Ck ) = ∑ (∑(xij − xi′ j ) )
|Ck |
′
i,i ∈Ck j=1
12.4.1 K-Means Clustering
An approach to partition a data set into K distinct, non-overlapping clusters such that the sum of within-
cluster variation is as small as possible.
K
minimize{∑ W (Ck )}
C1 ,...,Ck
k=1
K p
1 2
minimize{∑ ∑ (∑(xij − xi′ j ) )}
C1 ,...,Ck |Ck | ′
k=1 i,i ∈Ck j=1
12.4.1 K-Means Clustering
Using k = 3 and j = 2 as an example, the algorithm starts by randomly assign a cluster number/group
C , C and C to each of the observations.
1 2 k
12.4.1 K-Means Clustering
We proceed with iteration 1.
For each cluster group, calculate the cluster's
centroid. The cluster centroids are computed as
the mean of the observations assigned to each
cluster.
1
x̄k = ∑ xi
|Ck |
i∈Ck
where |C | is the number of observations in
k
cluster Ck
12.4.1 K-Means Clustering
Next, for each given observation, calculate the
distance between the observation and the
centroids.
Reassign the given observation to the cluster that
corresponds to the shortest distance centroid.
Calculate ∑ K
k=1
W (Ck ) for that iteration.
12.4.1 K-Means Clustering
Repeat the iteration step on the newly assigned clusters until there are minimal changes to ∑ K
k=1
W (Ck ) .
12.4.1 K-Means Clustering
Caveats
We must decide how many clusters we expect in the data.
Results obtained will depend on the initial (random) cluster assignment of each observation which may
give inconsistent results.
For this reason, it is important to run the algorithm multiple times from different random initial
configurations. Then one selects the best solution, i.e. that for which ∑ W (C ) is the smallest.
K
k=1 k
Hierarchical clustering is an alternative approach to overcome some of the weaknesses of K-Means
Clustering
12.4.2 Hierarchical Clustering
Hierarchical clustering results in a tree-based representation of the observations, called a dendrogram. Each
leaf of the dendrogram represents one observation. Clusters are obtained by cutting the dendrogram at a
given height.
12.4.2 Hierarchical Clustering
In practice, people often look at the dendrogram and select by eye a sensible number of clusters, based on
the heights of the fusion and the number of clusters desired.
12.4.2 Hierarchical Clustering
Each leaf of the dendrogram represents one observation. Starting out at the bottom of the dendrogram,
each of the n observations is treated as its own cluster.
Image from Kavana Rudresh GitHub page for R-Ladies Philly
12.4.2 Hierarchical Clustering
The algorithm then measures all possible pairwise distance within the n observations.
The two observations with the smallest distance is then classified as one cluster, giving a remaining of
‘‘ n − 1 " observations
Image from Kavana Rudresh GitHub page for R-Ladies Philly
12.4.2 Hierarchical Clustering
The algorithm continues until all observations are used.
Height of the blocks represents the distance between clusters.
Image from Kavana Rudresh GitHub page for R-Ladies Philly
12.4.2 Hierarchical Clustering
How did we determine that the cluster {p5, p6} should be fused with the cluster {p4} ?
The concept of dissimilarity between a pair of observations needs to be extended to a pair of groups of
observations. This extension is achieved by developing the notion of linkage,
Image from Kavana Rudresh GitHub page for R-Ladies Philly
12.4.2 Hierarchical Clustering
Single-linkage: the distance between two clusters
is defined as the shortest distance between two
points in each cluster
Complete-linkage: the distance between two
clusters is defined as the longest distance between
two points in each cluster.
Average-linkage: the distance between two
clusters is defined as the average distance
between each point in one cluster to every point in
the other cluster.
Centroid-linkage: finds the centroid of cluster 1
and centroid of cluster 2, and then calculates the
distance between the two before merging. Image from Kavana Rudresh GitHub page for R-
Ladies Philly
12.4.2 Hierarchical Clustering
Mustafa Murat's Hierarchical clustering slides
12.4.2 Hierarchical Clustering
12.4.2 Hierarchical Clustering
Image from Kavana Rudresh GitHub page for R-Ladies Philly
12.4.3 Practical Issues in Clustering
Image from Kavana Rudresh GitHub page for R-Ladies Philly
12.4.3 Practical Issues in Clustering
Mustafa Murat's Hierarchical clustering slides
12.4.2 Hierarchical Clustering
Different dissimilarity measure gives different clusters.
12.4.3 Practical Issues in Clustering
Should the observations or features be standardized to have standard deviation of one ?
In the case of hierarchical clustering,
What dissimilarity measure should be used ?
What type of linkage should be used ?
Where should we cut the dendrogram in order to obtain clusters ?
In the case of K-means clustering,
How many clusters should we look for in the data ?
12.4.3 Practical Issues in Clustering
Kmeans and hierarchical clustering force every observation into a cluster, the clusters found may be heavily
distorted due to the presence of outliers that do not belong to any cluster.
Clustering methods generally are not very robust to perturbations to the data. We recommend clustering
subsets of the data in order to get a sense of the robustness of the clusters obtained.
12.4.3 Practical Issues in Clustering
EduClust webpage