CS 229 – Machine Learning https://stanford.
edu/~shervine
VIP Cheatsheet: Unsupervised Learning
Afshine Amidi and Shervine Amidi
August 12, 2018
Introduction to Unsupervised Learning k-means clustering
r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data We note c(i) the cluster of data point i and µj the center of cluster j.
{x(1) ,...,x(m) }.
r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the algorithm repeats the following step until convergence:
following inequality: m
X
E[f (X)] > f (E[X]) 1{c(i) =j} x(i)
i=1
c(i) = arg min||x(i) − µj ||2 and µj = m
j X
1{c(i) =j}
Expectation-Maximization
i=1
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
problems difficult, and are often denoted z. Here are the most common settings where there are
latent variables:
Setting Latent variable z x|z Comments
Mixture of k Gaussians Multinomial(φ) N (µj ,Σj ) µj ∈ Rn , φ ∈ Rk
Factor analysis N (0,I) N (µ + Λz,ψ) µj ∈ Rn
r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at
r Distortion function – In order to see if the algorithm converges, we look at the distortion
estimating the parameter θ through maximum likelihood estimation by repeatedly constructing function defined as follows:
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
m
X
J(c,µ) = ||x(i) − µc(i) ||2
• E-step: Evaluate the posterior probability Qi (z (i) ) that each data point x(i) came from
i=1
a particular cluster z (i) as follows:
Qi (z (i) ) = P (z (i) |x(i) ; θ) Hierarchical clustering
r Algorithm – It is a clustering algorithm with an agglomerative hierarchical approach that
• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points build nested clusters in a successive manner.
x(i) to separately re-estimate each cluster model as follows: r Types – There are different sorts of hierarchical clustering algorithms that aims at optimizing
different objective functions, which is summed up in the table below:
Xˆ
P (x(i) ,z (i) ; θ)
Ward linkage Average linkage Complete linkage
θi = argmax Qi (z (i) ) log dz (i) Minimize within cluster Minimize average distance Minimize maximum distance
z (i) Qi (z (i) )
θ
i distance between cluster pairs of between cluster pairs
Stanford University 1 Fall 2018
CS 229 – Machine Learning https://stanford.edu/~shervine
m
Clustering assessment metrics 1 X T
• Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
In an unsupervised learning setting, it is often hard to assess the performance of a model since i=1
we don’t have the ground truth labels as was the case in the supervised learning setting.
• Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the
r Silhouette coefficient – By noting a and b the mean distance between a sample and all
other points in the same class, and between a sample and all other points in the next nearest orthogonal eigenvectors of the k largest eigenvalues.
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
b−a among all k-dimensional spaces.
s=
max(a,b)
r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between
and within-clustering dispersion matrices respectively defined as
k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows: Independent component analysis
Tr(Bk ) N −k It is a technique meant to find the underlying generating sources.
s(k) = ×
Tr(Wk ) k−1 r Assumptions – We assume that our data x has been generated by the n-dimensional source
vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
Principal component analysis x = As
It is a dimension reduction technique that finds the variance maximizing directions onto which The goal is to find the unmixing matrix W = A−1 by an update rule.
to project the data.
r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if following the steps below:
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
• Write the probability of x = As = W −1 s as:
Az = λz n
Y
p(x) = ps (wiT x) · |W |
r Spectral theorem – Let A ∈ Rn×n .
If A is symmetric, then A is diagonalizable by a real i=1
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
• Write the log likelihood given our training data {x(i) , i ∈ [[1,m]]} and by noting g the
∃Λ diagonal, A = U ΛU T sigmoid function as:
m n
!
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of X X
0
matrix A. l(W ) = log g (wjT x(i) ) + log |W |
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction i=1 j=1
technique that projects the data on k dimensions by maximizing the variance of the data as
follows: Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:
• Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
1 − 2g(w1T x(i) )
1 − 2g(w2 x ) (i) T
T (i)
W ←− W + α x + (W T )−1
(i)
xj − µ j
m m ..
(i) 1 X (i) 1 X (i) .
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m 1 − 2g(wn T x(i) )
i=1 i=1
Stanford University 2 Fall 2018