0% found this document useful (0 votes)
81 views2 pages

Unsupervised Learning Cheatsheet

This document provides an overview of several unsupervised learning techniques including: 1) K-means clustering which aims to partition observations into k clusters by minimizing within-cluster distances. 2) Expectation-maximization which estimates parameters through maximum likelihood by constructing a lower bound on the likelihood. 3) Hierarchical clustering which builds nested clusters through successive merging in either an agglomerative or divisive manner. 4) Principal component analysis which projects data onto a lower dimensional space while preserving as much variance as possible.

Uploaded by

El Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views2 pages

Unsupervised Learning Cheatsheet

This document provides an overview of several unsupervised learning techniques including: 1) K-means clustering which aims to partition observations into k clusters by minimizing within-cluster distances. 2) Expectation-maximization which estimates parameters through maximum likelihood by constructing a lower bound on the likelihood. 3) Hierarchical clustering which builds nested clusters through successive merging in either an agglomerative or divisive manner. 4) Principal component analysis which projects data onto a lower dimensional space while preserving as much variance as possible.

Uploaded by

El Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CS 229 – Machine Learning https://stanford.

edu/~shervine

VIP Cheatsheet: Unsupervised Learning

Afshine Amidi and Shervine Amidi

August 12, 2018

Introduction to Unsupervised Learning k-means clustering


r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data We note c(i) the cluster of data point i and µj the center of cluster j.
{x(1) ,...,x(m) }.
r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk ∈ Rn , the k-means
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the algorithm repeats the following step until convergence:
following inequality: m
X
E[f (X)] > f (E[X]) 1{c(i) =j} x(i)
i=1
c(i) = arg min||x(i) − µj ||2 and µj = m
j X
1{c(i) =j}
Expectation-Maximization
i=1
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
problems difficult, and are often denoted z. Here are the most common settings where there are
latent variables:

Setting Latent variable z x|z Comments

Mixture of k Gaussians Multinomial(φ) N (µj ,Σj ) µj ∈ Rn , φ ∈ Rk


Factor analysis N (0,I) N (µ + Λz,ψ) µj ∈ Rn

r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at


r Distortion function – In order to see if the algorithm converges, we look at the distortion
estimating the parameter θ through maximum likelihood estimation by repeatedly constructing function defined as follows:
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
m
X
J(c,µ) = ||x(i) − µc(i) ||2
• E-step: Evaluate the posterior probability Qi (z (i) ) that each data point x(i) came from
i=1
a particular cluster z (i) as follows:

Qi (z (i) ) = P (z (i) |x(i) ; θ) Hierarchical clustering


r Algorithm – It is a clustering algorithm with an agglomerative hierarchical approach that
• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points build nested clusters in a successive manner.
x(i) to separately re-estimate each cluster model as follows: r Types – There are different sorts of hierarchical clustering algorithms that aims at optimizing
different objective functions, which is summed up in the table below:

Xˆ 
P (x(i) ,z (i) ; θ)
 Ward linkage Average linkage Complete linkage
θi = argmax Qi (z (i) ) log dz (i) Minimize within cluster Minimize average distance Minimize maximum distance
z (i) Qi (z (i) )
θ
i distance between cluster pairs of between cluster pairs

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

m
Clustering assessment metrics 1 X T
• Step 2: Compute Σ = x(i) x(i) ∈ Rn×n , which is symmetric with real eigenvalues.
m
In an unsupervised learning setting, it is often hard to assess the performance of a model since i=1
we don’t have the ground truth labels as was the case in the supervised learning setting.
• Step 3: Compute u1 , ..., uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the
r Silhouette coefficient – By noting a and b the mean distance between a sample and all
other points in the same class, and between a sample and all other points in the next nearest orthogonal eigenvectors of the k largest eigenvalues.
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
b−a among all k-dimensional spaces.
s=
max(a,b)

r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between


and within-clustering dispersion matrices respectively defined as

k m
X X
Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T , Wk = (x(i) − µc(i) )(x(i) − µc(i) )T
j=1 i=1

the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows: Independent component analysis
Tr(Bk ) N −k It is a technique meant to find the underlying generating sources.
s(k) = ×
Tr(Wk ) k−1 r Assumptions – We assume that our data x has been generated by the n-dimensional source
vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
Principal component analysis x = As
It is a dimension reduction technique that finds the variance maximizing directions onto which The goal is to find the unmixing matrix W = A−1 by an update rule.
to project the data.
r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
r Eigenvalue, eigenvector – Given a matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if following the steps below:
there exists a vector z ∈ Rn \{0}, called eigenvector, such that we have:
• Write the probability of x = As = W −1 s as:
Az = λz n
Y
p(x) = ps (wiT x) · |W |
r Spectral theorem – Let A ∈ Rn×n .
If A is symmetric, then A is diagonalizable by a real i=1
orthogonal matrix U ∈ Rn×n . By noting Λ = diag(λ1 ,...,λn ), we have:
• Write the log likelihood given our training data {x(i) , i ∈ [[1,m]]} and by noting g the
∃Λ diagonal, A = U ΛU T sigmoid function as:
m n
!
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of X X  
0
matrix A. l(W ) = log g (wjT x(i) ) + log |W |
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction i=1 j=1
technique that projects the data on k dimensions by maximizing the variance of the data as
follows: Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:
• Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
1 − 2g(w1T x(i) )
  
1 − 2g(w2 x ) (i) T
T (i)
W ←− W + α  x + (W T )−1 

(i)
xj − µ j
m m ..
(i) 1 X (i) 1 X (i) .
xj ← where µj = xj and σj2 = (xj − µj )2
σj m m 1 − 2g(wn T x(i) )
i=1 i=1

Stanford University 2 Fall 2018

You might also like