Unsupervised learning: PCA and k-means
clustering
Seth Flaxman1
Imperial College London
3 July 2019
1
Based on slides from Simon Rogers & Maurizio Filippone
A problem - too many features
I Aim: To build a classifier that can diagnose leukaemia using
Gene expression data.
I Data: 27 healthy samples, 11 leukaemia samples (N = 38). Each
sample is the expression (activity) level for 3751 genes. (Also have
an independent test set)
I In general, the number of parameters will increase with the number
of features – d = 3751.
I e.g. Logistic regression – w would have length 3751!
I Fitting lots of parameters is hard
Features
I For visualisation, most examples we’ve seen have had only 2
features x = [x1 , x2 ]T .
I Now, we’ve been given lots (3751) to start with.
I We need to reduce this number.
Features
I For visualisation, most examples we’ve seen have had only 2
features x = [x1 , x2 ]T .
I Now, we’ve been given lots (3751) to start with.
I We need to reduce this number.
I 2 general schemes:
I Use a subset of the originals.
I Make new ones by combining the originals.
Making new features
I An alternative to choosing features is making new ones.
Making new features
I An alternative to choosing features is making new ones.
I Cluster:
I Cluster the features (turn our clustering problem around)
I If we use say K-means, our new features will be the K mean
vectors.
Making new features
I An alternative to choosing features is making new ones.
I Cluster:
I Cluster the features (turn our clustering problem around)
I If we use say K-means, our new features will be the K mean
vectors.
I Projection/combination
I Reduce the number of features by projecting into a lower
dimensional space.
I Do this by making new features that are combinations (linear)
of the old ones.
Projection
A 3-dimensional
object
A 2-dimensional
projection
Projection
I We can project data (d dimensions) into a lower number of
dimensions (m).
I Z = XW
I X is N × d
I W is d × m
I Z is N × m – an m-dimensional representation of our N
objects.
I W defines the projection
I Changing W is like changing where the light is coming from
for the shadow (or rotating the hand).
I (X is the hand, Z is the shadow)
I Once we’ve chosen W we can project test data into this new
space too: Znew = Xnew W
Choosing W
I Different W will give us different projections (imagine moving
the light).
I Which should we use?
Choosing W
I Different W will give us different projections (imagine moving
the light).
I Which should we use?
I Not all will represent our data well...
This doesn't look
like a hand!
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):
zj = Xwj
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):
zj = Xwj
I PCA chooses wj to maximise the variance of zj
N N
1 X 1 X
(zjn − µj )2 , µj = zjn
N N
n=1 n=1
Principal Components Analysis
I Principal Components Analysis (PCA) is a method for
choosing W.
I It finds the columns of W one at a time (define the jth
column as wj ).
I Each d × 1 column defines one new dimension.
I Consider one of the new dimensions (columns of Z):
zj = Xwj
I PCA chooses wj to maximise the variance of zj
N N
1 X 1 X
(zjn − µj )2 , µj = zjn
N N
n=1 n=1
I Once the first one has been found, w2 is found that maximises
the variance and is orthogonal to the first one etc etc.
PCA – a visualisation
1
x2
0
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
I Original data in 2-dimensions.
I We’d like a 1-dimensional projection.
PCA – a visualisation
1 σ z2 = 0.39
x2
0
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
I Pick some arbitrary w.
I Project the data onto it.
I Compute the variance (on the line).
I The position on the line is our 1 dimensional representation.
PCA – a visualisation
1 σ z2 = 0.39
x2
0
−1
−2
σ z2 = 1.2
−3
−3 −2 −1 0 1 2 3
x1
I Pick some arbitrary w.
I Project the data onto it.
I Compute the variance (on the line).
I The position on the line is our 1 dimensional representation.
PCA – a visualisation
2
σ z2 = 1.9
1 σ z2 = 0.39
x2
0
−1
−2
σ z2 = 1.2
−3
−3 −2 −1 0 1 2 3
x1
I Pick some arbitrary w.
I Project the data onto it.
I Compute the variance (on the line).
I The position on the line is our 1 dimensional representation.
PCA – analytic solution
I Could search for w1 , . . . , wM
I But, analytic solution is available.
I w are the eigenvectors of the covariance matrix of X.
I R: prcomp or princomp
PCA – analytic solution
1
x2
−1
−2
σ z2 = 1.9
−3
−3 −2 −1 0 1 2 3
x1
PCA – analytic solution
1
x2
−1
−2
σ z2 = 1.9
−3
−3 −2 −1 0 1 2 3
x1
I What would be the second component?
PCA – leukaemia data
20
10
0
z2
−10
−20
−30
−40 −20 0 20 40
z1
First two principal components in our leukaemia data (points
labeled by class).
Summary
I Sometimes we have too much data (too many dimensions).
I Need to select features.
I Features can be dimensions that already exist.
I Or we can make new ones.
I We’ve seen one example of each.
Clustering
I What if we just have xn ?
Clustering
I For example:
I xn is a binary vector indicating products customer n has
bought.
I Can group customers that buy similar products.
I Can group products bought together.
Clustering
I For example:
I xn is a binary vector indicating products customer n has
bought.
I Can group customers that buy similar products.
I Can group products bought together.
I Known as Clustering
I And is an example of unsupervised learning.
Clustering
5 5
4 4
3 3
2 2
1 1
0 0
−1 −1
−2 −2
−3 −3
0 2 4 6 0 2 4 6
I In this example each object has two attributes:
xn = [xn1 , xn2 ]T
I Left: data.
I Right: data after clustering (points coloured according to
cluster membership).
What we’ll cover
I K-means
I But note: there are dozens and dozens of other clustering
methods out there!
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
I Each xn is assigned to its closest cluster:
6
2
x2
−2
−4
−6
−2 0 2 4 6
x1
K-means
I Assume that there are K clusters.
I Each cluster is defined by a position in the input space:
µk = [µk1 , µk2 ]T
I Each xn is assigned to its closest cluster:
6
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Distance is normally Euclidean distance:
dnk = (xn − µk )T (xn − µk )
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
4. Update µk to average of xn s assigned to µk :
PN
znk xn
µk = Pn=1
N
n=1 znk
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
4. Update µk to average of xn s assigned to µk :
PN
znk xn
µk = Pn=1
N
n=1 znk
5. Return to 2 until assignments do not change.
How do we find µk ?
I No analytical solution – we can’t write down µk as a function
of X.
I Use an iterative algorithm:
1. Guess µ1 , µ2 , . . . , µK
2. Assign each xn to its closest µk
3. znk = 1 if xn assigned to µk (0 otherwise)
4. Update µk to average of xn s assigned to µk :
PN
znk xn
µk = Pn=1
N
n=1 znk
5. Return to 2 until assignments do not change.
I Algorithm will converge....it will reach a point where the
assignments don’t change.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Cluster means randomly assigned (top left).
I Points assigned to their closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Cluster means updated to mean of assigned points.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Points re-assigned to closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Cluster means updated to mean of assigned points.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Assign point to closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Assign point to closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Assign point to closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Assign point to closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Assign point to closest mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Update mean.
K-means – example
2
x2
−2
−4
−6
−2 0 2 4 6
x1
I Solution at convergence.
When does K-means break?
1.5
x2 0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x1
I Data has clear cluster structure.
I Outer cluster can not be represented as a single point.
When does K-means break?
1.5
x2 0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
x1
I Data has clear cluster structure.
I Outer cluster can not be represented as a single point.