UNSUPERVISED LEARNING OF
DISTRIBUTION
Bor-shen Lin
[email protected]
http://www.cs.ntust.edu.tw/~bslin
LEARNING OF DISTRIBUTION
Unsupervised Learning (without output label)
Given { Xi } → learn distribution 𝑝 𝑥
Discrete variable
Learn probability weight function
𝑝 𝑥 should satisfy σ∞
−∞ 𝑝 𝑥 = 1
Continuous variable
Learn probability density function
∞
𝑝 𝑥 should satisfy −∞ 𝑝 𝑥 𝑑𝑥=1
LEARNING MODELS
Parametric Model
Discrete distribution
Gaussian distribution
Gaussian mixture model
Non-parametric Model (Instance-based Learning)
Nearest Neighbor Model
Kernel Model
LEARNING OF PARAMETRIC MODEL
1. Learning of Discrete Distribution
2. Learning of Gaussian Distribution
3. Learning of Gaussian Mixture Model (GMM)
ESTIMATION OF PARAMETERS
q represents a set of parameters for the probability
density/mass function P(X|q)
X is a set of i. i. d. observations X1, X2, …, Xn
θˆ ML = arg max P( X | θ) • Maximum Likelihood Estimation
θ
θˆ MAP = arg max P(θ | X) • Maximum A Posteriori Estimation
θ
P(θ, X)
= arg max
θ P ( X)
= arg max P(θ) P( X | θ)
θ
1. LEARNING OF DISCRETE DISTRIBUTION
X consists of X 1 , X 2 ,..., X N , i.i.d . with p.w. f . as .
P( X = k | θ) = wk , k = 1,2,..., n
where w1 + w2 + ... + wn = 1 and θ consists of w1 , w2 ,..., wn .
N
then P( X | θ) = P( X i | θ),
i =1
= P ( X | w1 , w2 ,..., wn ) = w1 1 w2 2 wn n ,
c c c
where c k is the number of occurences for k in X
1. ESTIMATION OF DISCRETE DISTRIBUTION
θˆ ML = arg max P ( X | θ) is an optimization problem with constra int w1 + w2 + ... + wn = 1,
θ
n
which can be solved with Lagrange multipler L( X , q ) P ( X | θ) + ( wk − 1)
k =1
c1 / w1 1
L
q L( X , q ) = 0, ( = 0 wk ) w1 w2 wn = − ,
c1 c2 cn
wk c / w 1
n n
N n n
c1 c2 cn
= == ck = wk = ( given wk = 1)
w1 w2 wn k =1 k =1 k =1
ck ck
wˆ k = =
N
c
k =1
k
• Use occurrence count to estimate the
probability weights for a p. w. f.
CONSTRAINT OPTIMIZATION WITH
LAGRANGE MULTIPLIER
Maximize P(w) with the constraint: Smwm = 1
P ( w )
wk
wk
ˆk = M
w
P ( w )
m =1
wm
wm
2. LEARNING OF GAUSSIAN DISTRIBUTION
( x− )2
1 −
X 1 , X 2 ,..., X N are i.i.d . with p.d . f . as .P ( X | θ) = e 2 2
, θ = ( , 2 )
2
N
then P ( X | θ) = P ( X i | θ), X consists of X 1 , X 2 ,..., X N
i =1
N
2 log( P ( X | θ) = 2 log( P ( X | θ) = 2 log( P ( X i | θ))
i =1
( X i − )2
N −
1
− N
( X i − )2
= 2 log(( 2 ) e 2 2 2 2
) = − log( 2 ) +
2
i =1 i =1 2
N
( X i − )2 N
( X i − )2
= − log( 2 ) + log( ) +
2
= − N log( 2 ) − log( ) +
2
i =1 2
i =1 2
l (θ) 2 log( P ( X | θ) + N log( 2 ) is monotonic with P ( X | θ)
GAUSSIAN DISTRIBUTION
1 ( x − ) 2
−
1 2 2
N ( x; , )
2
e
2 2
E ( X ) = ,Var ( X ) = 2
N(x; μ,σ2)
x
μ
ESTIMATION OF GAUSSIAN DISTRIBUTION
θˆ ML = arg max P( X | θ) = arg max l (θ)
θ θ
l ( , 2 ) N ( X i − )2 N
= 0, = 2 ( X i − ) = 0
i =1 2
i =1
1 N
ˆ = X i • Every observation Xi contributes to the estimation
N i =1 of (weight as 1/N)
l ( , 2 ) N ( X − ) 2
2
= 0, log( 2
) + i
=0
2
i =1 2
N
1 ( X i − )2 N
2 − 4 = 0, N = ( X i − )
2
2
i =1 i =1
1 N
ˆ = ( X i − ) • Every observation Xi contributes to the
2
2
N i =1 estimation of 2(weight as 1/N)
GAUSSIAN MIXTURE MODEL (GMM)
M
p( x) = c k N ( x; k , k )
2
k =1
M M
p ( x)dx = c k N ( x; k , k )dx = c k = 1.0
2
− −
k =1 k =1
p(x)
μ1 μ2 μ3 μ4 x
PARTITION GAUSSIAN MODEL(PGM)
p ( x) max N ( x; k , k )
2
k
−
p ( x)dx 1.0
p(x)
μ1 μ2 μ3 μ4 x
MULTI-DIMENSIONAL GAUSSIAN DISTRIBUTION
1
1 − (x −μ)t Σ −1 (x −μ )
N (x; μ, Σ) = e 2
(2 ) | Σ |
n
μ E ( X ) = x P ( x ) dx n 1
Σ E (( X − μ)( X − μ) t ) n n
• x : n dimensional vector
• Each Gaussian: μ as mean vector, 𝜮 as covariance matrix
• Stochastically independent when 𝜮 is diagonal
• Multi-dimensional GMM: θ = {(𝑐𝑚 , 𝝁𝑚 , 𝜮𝑚 )}
• # of parameters : M(1+n+n2)
3. LEARNING OF GMM
WITH EXPECTATION MAXIMIZATION (EM)
M
X 1 , X 2 ,..., X N are i.i.d . with p.d . f . P( X | θ) = cm P( X ; m , m ),
2
m =1
( x−m )2
−
2 m 2 M
e
where n( X ; m , m ) = , θ = {cm , m , m } and cm = 1.
2 2
2 m m =1
N
Then P( X | θ) = P( X i | θ), X consists of X 1 , X 2 ,..., X N .
i =1
M
The Lagrange function is L(θ) = log P ( X | θ) + ( cm − 1)
m =1
N M
= log P( X i | θ) + ( cm − 1).
i =1 m =1
RE-ESTIMATION OF GMM PARAMETERS
P ( X i ; m , m )
2
( X i − m )
c c P ( X ; , 2
)
L(θ) N m m2
m N m i m m
= = =0
m i =1 P ( X i | θ) i =1 P ( X i | θ)
P ( X i , Cm ) cm P ( X i ; m , m ), θ m (cm , m , m ),
2
P ( X i , Cm )
l (m, i ) = P (Cm | X i )
P( X i )
N N N
l (m, i )( X i − m ) = 0 m l (m, i ) = l (m, i ) X i
i =1 i =1 i =1
N
l (m, i) X i • l(m,i): estimated probability that Xi is
ˆ m = i =1
N produced by m-th mixture (weight)
l (m, i)
i =1 • Denominator for normalization
RE-ESTIMATION OF GMM PARAMETERS
P ( X i ; m , m )
2
cm
L (θ) N
2 m
2 m
= i =1 P ( X i | θ)
=0
( X i − m )2
P ( X i | θ m ) (1 − )
N
m 2
i =1 P ( X i | θ)
=0
N N
( X i − m )2
l (m, i ) = l (m, i )
i =1 i =1 m2
N
l (m, i )( X i − m )2
ˆ m = i =1
2
N
l (m, i )
i =1
CONCEPT
𝜇𝑚 is a independent variable that may be
adjusted freely no matter what others
parameters are.
𝜇𝑚 has a unique global maximum.
𝜕𝐿(𝜃)
The global maximum is located at = 0.
𝜕𝜇𝑚
A better value of 𝜇𝑚 is guaranteed independent
through the iteration formula of 𝜇Ƹ 𝑚 .
Both 𝜇𝑚 and 𝜎𝑚 are independent variables.
𝑐𝑚 ’s are dependent variables, since σ𝑚 𝑐𝑚 = 1.
RE-ESTIMATION OF GMM PARAMETERS
L(θ) P( X i ; m , m )
N 2
= + =0 m
cm i =1 P ( X i | θ)
P( X i | θ m )
P ( X i | θ m ) cm P ( X i ; m , m ), θ m (cm , m , m ), l ( m, i )
2
P ( X i | θ)
N
P( X i ; m , m )
N N 2
P( X i | θ m ) l (m, i)
= = i =1
= m
i =1 P ( X i | θ) i =1 cm P ( X i | θ) cm
N
M M l (m, i) M N
1 = cm = i =1
= l ( m, i )
m =1 m =1 m =1 i =1
N N
l (m, i) l (m, i)
cˆm = i =1
= i =1
M N
l (m, i)
m =1 i =1
EM REESTIMATES OF GMM PARAMETERS
(MULTI-DIMENSIONAL)
σ𝑖 𝑙 𝑚,𝑖 𝒙𝑖
ෝ𝑚 =
𝝁 σ𝑖 𝑙(𝑚,𝑖)
σ𝑖 𝑙 𝑚,𝑖 (𝒙𝑖 −𝝁𝑚 )(𝒙𝑖 −𝝁𝑚 )𝑡
𝑚 =
𝚺 σ𝑖 𝑙(𝑚,𝑖)
σ𝑖 𝑙(𝑚,𝑖)
𝑐𝑚 = σ
𝑚 σ𝑖 𝑙(𝑚,𝑖)
LEARNING FOR GMM
Initial model
• random
Initial model
q = { (cm, m, Sm)} • splitting
Compute YES
l(m,i) Expectation
Training data
Estimate converged ?
{ Xi }
q={qm}
NO
Final Model
መ qm} ={(cm, m, Sm)}
𝜃={ 𝜃 ∗ ={qm}
STEPS OF GMM TRAINING
1. Set the initial mixture as the mean and
covariance of all training data {Xi} (M = 1).
2. Split the largest cluster into two clusters
from the mean, equal weights
Other algorithm: LBG(1→2→4→8→…)
3. Re-estimate the model parameters iteratively
until converged.
4. Repeat steps 2 & 3 till M mixtures.
DISTANCE BETWEEN GAUSSIANS - 1
Bhattacharyya Divergence DB(f, g)
DB(f, g) estimation of classification error
DB ( f , g ) − log f ( x) g ( x) dx
1 −1 1 Σ f + Σg 1
= (μ f − μ g ) ( Σ f + Σ g ) (μ f − μ g ) + log
t
− log Σ f Σ g
4 2 2 4
1 1 − DB ( f , g )
Then Bayes error Be ( f , g ) min( f ( x), g ( x))dx e
2 2
DISTANCE BETWEEN GAUSSIANS - 2
Kullback-Leibler Divergence (KLD)
f ( x)
DKL ( f , g ) f ( x) log dx
g ( x)
1 Σg −1 −1
= log + Tr Σ g Σ f − d + (μ f − μ g ) ( Σ f − Σ g ) (μ f − μ g )
t
2
Σf
Then DKL ( f , g ) 2 DB ( f , g )
SIMILARITY BETWEEN TWO GMMS
GMM 1
Pairwise distances
Bhattacharyya distance
Weighted average
GMM 2
APPLICATIONS OF GMM
Parametric model of continuous variables
Unsupervised learning of p(X)
Clustering (unsupervised) C X
Regarding every mixture as a cluster
Classification (supervised)
Train p(X|Cm) for all m’s (classes)
Examples
Language identification
Gender identification
Speaker recognition
Image classification/tagging
GMM-BASED CLUSTERING
1. Every mixture of a GMM regarded as a cluster
Similar to k-Means clustering, however the variances
are used for distance normalization when computing
the probability (exponential term)
(simple k-Means uses Euclidean distance)
A GMM is trained, and each point 𝑋 is assigned to a
cluster according to:
𝑐𝑘 𝑝𝑘 (𝑋)
𝑘 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 𝑙(𝑋, 𝑘 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 σ𝑖 𝑐𝑖 𝑝𝑖 (𝑋)
2. A GMM may also be regarded as a point
Clustering of GMMs based on distances
Example: speaker clustering (groups)
training GMMs for all speakers
GMM-BASED CLASSIFIER
Train GMMs of p(x|Ci) for i=0,1 respectively
ML Detector
C* = argmaxi p(x|Ci)
MAP Detector (Given the prior distribution)
C* = argmaxi p(Ci|x)
= argmaxi p(Ci)p(x|Ci)
ML GMM
{ xi } of Ck Estimation p(x|Ck)
DISCRIMINATIVE TRAINING FOR GMM
ML training
The objective functions to be maximized is the
likelihood function for every class
Every GMM are trained with the data of its class
A sample of class k will influence the distribution of
that class, i.e. p(x|Ck), only
Minimum classification error (MCE) training
The objective function to be minimized is the overall
classification errors
The GMMs for different classes are trained jointly
instead of individually
Every sample will influence the distributions of all
classes, i.e. p(x|Cj) for all j.
MCE TRAINING
𝑝𝑘 (𝑥) is a GMM with parameters { (𝑐𝑘𝑚 , 𝜇𝑘𝑚 , Σ𝑘𝑚 )}
𝑝𝑘 𝑥 = σ𝑀 𝑚=1 𝑐𝑘𝑚 𝑁𝑘𝑚 𝑥
𝑔𝑘 𝑥 = log 𝑝𝑘 (𝑥)
𝑑𝑘 𝑥 = −𝑔𝑘 𝑥 + 𝑔𝑘 ത 𝑥
1 1/𝜂
σ𝑗≠𝑘 𝑒 𝜂𝑔𝑗 𝑥
= −𝑔𝑘 𝑥 + 𝑙𝑜𝑔
𝑀−1
1
𝑙𝑘 𝑥 = (sigmoid)
1+𝑒 −𝛾𝑑𝑘 +𝜃
𝐿(𝑋) = σ𝐾𝑘=1 σ𝑥𝑖 ∈𝐶𝑘 𝑙𝑘 𝑥𝑖
Minimizing l leads to the minimization of
classification errors
The parameters can be obtained by gradient
probabilistic descent (GPD) 𝑑Λ = −𝜖𝛻𝐿
MCE FORMULA – DIAGONAL COVARIANCE
𝑐𝑗𝑚 𝑝𝑗𝑚
For 𝑥𝑖 ∈ 𝐶𝑘 , 𝜃𝑗𝑚 ≡ , 𝑟𝑘 ≡ 𝛾𝑙𝑘 1 − 𝑙𝑘
𝑝𝑗
𝑥𝑙 −𝜇𝑘𝑚𝑙
𝑑𝜇𝑘𝑚𝑙 = 𝜀𝑟𝑘 𝜃𝑘𝑚
𝜎𝑘𝑚𝑙 2
𝑝𝑗 𝑥𝑙 −𝜇𝑗𝑚𝑙
𝑑𝜇𝑗𝑚𝑙 = −𝜀𝑟𝑘 σ 𝜃 2 𝑓𝑜𝑟 𝑗 ≠ 𝑘
𝑝 𝑗𝑚 𝜎
𝑛≠𝑘 𝑛 𝑗𝑚𝑙
(𝑥𝑙 −𝜇𝑘𝑚𝑙 )2
1
𝑑𝜎𝑘𝑚𝑙 = 𝜀𝑟𝑘 𝜃𝑘𝑚 −1 𝑑𝜎𝑗𝑚𝑙 =
𝜎𝑘𝑚𝑙 𝜎𝑘𝑚𝑙 2
𝑝𝑗 1 (𝑥𝑙 −𝜇𝑗𝑚𝑙 )2
− 𝜀𝑟𝑘 σ 𝜃 −1 𝑓𝑜𝑟 𝑗 ≠ 𝑘
𝑝 𝑗𝑚 𝜎
𝑛≠𝑘 𝑛 𝑗𝑚𝑙𝜎 2
𝑗𝑚𝑙
Minimum classification error rate for speech
recognition, IEEE Trans. on Speech and Audio
Processing, 1997.
INSTANCE-BASED LEARNING
Weakness of parametric model
restricted family of function might over-simplify the
real world
Non-parametric learning
All samples are stored and used for model
Instance-based learning or memory-based learning
Complexity is increased as the data set grows.
Estimation of p(x)
A type of unsupervised learning
1. Nearest Neighbor Model
2. Kernel Model
NEAREST-NEIGHBOR MODELS
Estimation of density
K=3
Use the largest distance for the k
nearest-neighbors x x
x x
The larger the distance, the lower x
the density of the point 𝒙 D x
x D
k is low → 𝑝(𝒙) highly variable
k is large → 𝑝(𝒙) smooth
𝑘
Distance measure 𝑝(𝒙) ∝
𝐷𝑛
Euclidean distance might not be
appropriate (e.g. D = 𝒅𝑡 Σ −1 𝒅)
Should consider the physical
meanings of different dimensions
KERNEL MODELS
𝑝 𝑥 is estimated with the normalized sum of the
kernel functions for all training instances {𝒙𝑖 }
1
𝑝 𝑥 = σ 𝐾(𝒙, 𝒙𝑖 )
𝑁 𝑖
𝐾(𝒙, 𝒙𝑖 ) is the measure of similarity that depends on
𝐷(𝒙, 𝒙𝒊 )
𝐷(𝒙,𝒙𝒊 )2
1 −
A popular kernel: 𝐾 𝒙, 𝒙𝑖 = 𝑑 𝑒 2𝑤2
2𝜋𝑤 2