0% found this document useful (0 votes)

17 views22 pages

Chapter 11

Uploaded by

azbcs4th

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views22 pages

Chapter 11

Uploaded by

azbcs4th

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

6265

Density Estimation with Gaussian Mixture

Models

6266 In earlier chapters, we covered already two fundamental problems in

6267 machine learning: regression (Chapter 9) and dimensionality reduction
6268 (Chapter 10). In this chapter, we will have a look at a third pillar of ma-
6269 chine learning: density estimation. On our journey, we introduce impor-
6270 tant concepts, such as the Expectation Maximization (EM) algorithm and
6271 a latent variable perspective of density estimation with mixture models.
6272 When we apply machine learning to data we often aim to represent
6273 data in some way. A straightforward way is to take the data points them-
6274 selves as the representation of the data, see Figure 11.1 for an example.
6275 However, this approach may be unhelpful if the dataset is huge or if we
6276 are interested in representing characteristics of the data. In density esti-
6277 mation, we represent the data compactly using a density from a paramet-
6278 ric family, e.g., a Gaussian or Beta distribution. For example, we may be
6279 looking for the mean and variance of a dataset in order to represent the
6280 data compactly using a Gaussian distribution. The mean and variance can
6281 be found using tools we discussed in Section 8.2: maximum likelihood or
6282 maximum-a-posteriori estimation. We can then use the mean and variance
6283 of this Gaussian to represent the distribution underlying the data, i.e., we
6284 think of the dataset to be a typical realization from this distribution if we
6285 were to sample from it.

Figure 11.1
Two-dimensional
dataset that cannot 4
be meaningfully
represented by a 2
Gaussian.

0
x2

−2

−4

−5 0 5
x1

340
Draft chapter (November 8, 2018) from “Mathematics for Machine Learning” c 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
11.1 Gaussian Mixture Model 341

6286 In practice, the Gaussian (or similarly all other distributions we encoun-
6287 tered so far) have limited modeling capabilities. For example, a Gaussian
6288 approximation of the density that generated the data in Figure 11.1 would
6289 be a poor approximation. In the following, we will look at a more ex-
6290 pressive family of distributions, which we can use for density estimation:
6291 mixture models. mixture models
Mixture models can be used to describe a distribution p(x) by a convex
combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.2)
k=1

6292 where the components pk are members of a family of basic distributions,

6293 e.g., Gaussians, Bernoullis or Gammas, and the πk are mixture weights. mixture weights
6294 Mixture models are more expressive than the corresponding base distri-
6295 butions because they allow for multimodal data representations, i.e., they
6296 can describe datasets with multiple “clusters”, such as the example in Fig-
6297 ure 11.1.
6298 We will focus on Gaussian mixture models (GMMs), where the basic
6299 distributions are Gaussians. For a given dataset, we aim to maximize the
6300 likelihood of the model parameters to train the GMM. For this purpose
6301 we will use results from Chapter 5, Section 7.2 and Chapter 6. However,
6302 unlike other application we discussed earlier (linear regression or PCA),
6303 we will not find a closed-form maximum likelihood solution. Instead, we
6304 will arrive at a set of dependent simultaneous equations, which we can
6305 only solve iteratively.

6306 11.1 Gaussian Mixture Model

A Gaussian mixture model is a density model where
we combine a finite Gaussian mixture
number of K Gaussian distributions N x | µk , Σk so that model

K
X
p(x | θ) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.4)
k=1

where we defined θ := {µk , Σk , πk : k = 1, . . . , K} as the collection of

all parameters of the model. This convex combination of Gaussian distri-
bution gives us significantly more flexibility for modeling complex densi-
ties than a simple Gaussian distribution (which we recover from (11.3) for
K = 1). An illustration is given in Figure 11.2, displaying the weighted

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
342 Density Estimation with Gaussian Mixture Models

Figure 11.2 0.30

Component 1
Gaussian mixture Component 2
model. The 0.25 Component 3
Gaussian mixture GMM density
distribution (black) 0.20
is composed of a

p(x)
0.15
convex combination
of Gaussian
0.10
distributions and is
more expressive 0.05
than any individual
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8
weighted Gaussian x
components.

components and the mixture density, which is given as

p(x | θ) = 0.5N x | − 2, 12 + 0.2N x | 1, 2 + 0.3N x | 4, 1 . (11.5)

6307

6308 11.2 Parameter Learning via Maximum Likelihood

6309 Assume we are given a dataset X = {x1 , . . . , xN } where xn , n = 1, . . . , N
6310 are drawn i.i.d. from an unknown distribution p(x). Our objective is to
6311 find a good approximation/representation of this unknown distribution
6312 p(x) by means of a Gaussian mixture model (GMM) with K mixture com-
6313 ponents. The parameters of the GMM are the K means µk , the covariances
6314 Σk and mixture weights πk . We summarize all these free parameters in
6315 θ := {πk , µk , Σk : k = 1, . . . , K}.

Example 11.1 (Initial setting)

Figure 11.3 Initial 0.30 π1 N (x|µ1 , σ12 )

setting: GMM π2 N (x|µ2 , σ22 )
0.25 π3 N (x|µ3 , σ32 )
(black) with
GMM density
mixture three 0.20
mixture components
p(x)

(dashed) and seven 0.15

data points (discs).
0.10

0.05

0.00
−5 0 5 10 15
x

Throughout this chapter, we will have a simple running example that

helps us illustrate and visualize important concepts.

We consider a one-dimensional dataset X = {−3, −2.5, −1, 0, 2, 4, 5}

consisting of seven data points and wish to find a GMM with K = 3
components that models the density of the data. We initialize the mixture
components as

p1 (x) = N x | − 4, 1 (11.6)

p2 (x) = N x | 0, 0.2 (11.7)

p3 (x) = N x | 8, 3 (11.8)
and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
model (and the data points) are shown in Figure 11.3.

In the following, we detail how to obtain a maximum likelihood esti-

mate θ ML of the model parameters θ . We start by writing down the like-
lihood, i.e., the predictive distribution of the training data given the pa-
rameters. We exploit our i.i.d. assumption, which leads to the factorized
likelihood
N
Y K
X
p(X | θ) = p(xn | θ) , p(xn | θ) = πk N xn | µk , Σk , (11.9)
n=1 k=1

where every individual likelihood term p(xn | θ) is a Gaussian mixture

density. Then, we obtain the log-likelihood as
N
X N
X K
X
log p(X | θ) = log p(xn | θ) = log πk N xn | µk , Σk . (11.10)
n=1 n=1 k=1
| {z }
=:L

6316 We aim to find parameters θ ∗ML that maximize the log-likelihood L defined
6317 in (11.10). Our “normal” procedure would be to compute the gradient
6318 dL/dθ of the log-likelihood with respect to the model parameters θ , set
6319 it to 0 and solve for θ . However, unlike our previous examples for maxi-
6320 mum likelihood estimation (e.g., when we discussed linear regression in
6321 Section 9.2), we cannot obtain a closed-form solution. However, we can
6322 exploit an iterative scheme to find good model parameters θ ML , which will
6323 out to be the EM algorithm for Gaussian mixture models. The key idea is
6324 to update one model parameter at a time while keeping the others fixed.
Remark. If we were to consider a single Gaussian as the desired density,
the sum over k in (11.10) vanishes, and the log can be applied directly to
the Gaussian component, such that we get
log N x | µ, Σ = − D2 log(2π) − 12 log det(Σ) − 12 (x − µ)> Σ−1 (x − µ).

(11.11)
6325 This simple form allows us find closed-form maximum likelihood esti-
6326 mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
344 Density Estimation with Gaussian Mixture Models

6327 the log into the sum over k so that we cannot obtain a simple closed-form
6328 maximum likelihood solution. ♦
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when
we optimize the log-likelihood in (11.10) with respect to the GMM param-
eters µk , Σk , πk :
N
∂L X ∂ log p(xn | θ)
= 0> ⇐⇒ = 0> , (11.12)
∂µk n=1
∂µ k
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0, (11.13)
∂Σk n=1
∂Σk
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk
For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj

6329 In the following, we will compute the partial derivatives (11.12)–(11.14).

6330 But before we do this, we introduce a quantity that will play a central role
6331 in the remainder of this chapter: responsibilities.

6332 11.2.1 Responsibilities

We define the quantity

πk N xn | µk , Σk
rnk := PK (11.17)
j=1 πj N xn | µj , Σj
responsibility as the responsibility of the k th mixture component for the nth data point.
The responsibility rnk of the k th mixture component for data point xn is
proportional to the likelihood

p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.18)
6333 of the mixture component given the data point. Therefore, mixture com-
6334 ponents have a high responsibility for a data point when the data point
r n follows a 6335 could be a plausible sample from that mixture component. Note that
Boltzmann/Gibbs 6336 r n := [rn1 , . . . , rnK ]> ∈ RK is a (normalized) probability vector, i.e.,
distribution.

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 345
P
6337
k rnk = 1 with rnk > 0. This probability vector distributes probabil-
6338 ity mass among the K mixture components, and we can think of r n as a
6339 “soft assignment” of xn to the K mixture components. Therefore, the re- The responsibility
6340 sponsibility rnk from (11.17) represents the probability that xn has been rnk is the
probability that the
6341 generated by the k th mixture component.
kth mixture
component
Example 11.2 (Responsibilities) generated the nth
data point.
For our example from Figure 11.3 we compute the responsibilities rnk
1.0 0.0 0.0
 
 1.0 0.0 0.0 
 
0.057 0.943 0.0 
0.001 0.999 0.0  ∈ RN ×K .
 
  (11.19)
 0.0 0.066 0.934
 
 0.0 0.0 1.0 
0.0 0.0 1.0
Here, the nth row tells us the responsibilities of all mixture components
for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
the k th mixture component. We can see that the third mixture component
(third column) is not responsible for any of the first four data points, but
takes much responsibility of the remaining data points. The sum of all
entries of a column gives us the values Nk , i.e., the total responsibility of
the k th mixture component. In our example, we get N1 = 2.058, N2 =
2.008, N3 = 2.934.

6342 In the following, we determine the updates of the model parameters

6343 µk , Σk , πk for given responsibilities. We will see that the update equa-
6344 tions all depend on the responsibilities, which makes a closed-form solu-
6345 tion to the maximum likelihood estimation problem impossible. However,
6346 for given responsibilities we will be updating one model parameter at a
6347 time, while keeping the others fixed. After this, we will re-compute the
6348 responsibilities. Iterating these two steps will eventually converge to a lo-
6349 cal optimum and is a specific instantiation of the EM algorithm. We will
6350 discuss this in some more detail in Section 11.3.

6351 11.2.2 Updating the Means

Theorem 11.1 (Update of the GMM Means). The update of the mean pa-
rameters µk , k = 1, . . . , K , of the GMM is given by
PN
new n=1 rnk xn
µk = P N
, (11.20)
n=1 rnk

6352 where the responsibilities rnk are defined in (11.17).

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
346 Density Estimation with Gaussian Mixture Models

6353 Remark. The update of the means µk of the individual mixture compo-
6354 nents in (11.20) depends on all means, covariance matrices Σk and mix-
6355 ture weights πk via rnk given in (11.17). Therefore, we cannot obtain a
6356 closed-form solution for all µk at once. ♦
Proof From (11.15), we see that the gradient of the log-likelihood with
respect to the mean parameters µk , k = 1, . . . , K , requires us to compute
the partial derivative
K

∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk
= πj = πk (11.21)
∂µk j=1
∂µk ∂µk
= πk (xn − µk )> Σ−1

k N xn | µk , Σk (11.22)
6357 where we exploited that only the k th mixture component depends on µk .
We use our result from (11.22) in (11.15) and put everything together
so that the desired partial derivative of L with respect to µk is given as
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.23a)
∂µk n=1
∂µk n=1
p(x n | θ) ∂µk
N
X
> −1
πk N xn | µk , Σk
= (xn − µk ) Σk PK (11.23b)
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
N
X
= rnk (xn − µk )> Σ−1
k . (11.23c)
n=1

6358 Here, we used the identity from (11.16) and the result of the partial
6359 derivative in (11.22) to get to (11.23b). The values rnk are the respon-
6360 sibilities we defined in (11.17).
∂L(µnew )
We now solve (11.23c) for µnewk so that ∂µk = 0> and obtain
k

N N PN N
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k = P n=1
= rnk xn ,
n=1 n=1
N
rnk Nk n=1
n=1
(11.24)
where we defined
N
X
Nk := rnk (11.25)
n=1

6361 as the total responsibility of the k th mixture component for the entire
6362 dataset. This concludes the proof of Theorem 11.1.
Intuitively, (11.20) can be interpreted as an importance-weighted Monte-
Carlo estimate of the mean, where the importance weights of data point
xn are the responsibilities rnk of the k th cluster for xn , k = 1, . . . , K .

Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
the corresponding mixture component has a high responsibility, i.e., a high
mixture component
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
bution given by pulled toward
individual data
r k := [r1k , . . . , rN k ]> /Nk , (11.26) points with the
weights given by the
which is a normalized probability vector, i.e., corresponding
responsibilities.
µk ← Erk [X ] . (11.27)
x2 x3
r2
Example 11.3 (Mean Updates) r1 r3
x1
µk

Figure 11.5 Effect

p(x)

0.15 0.15 mean values; (b)

0.10 0.10 GMM after updating
0.05 0.05 the mean values µk
while retaining the
0.00 0.00
−5 0 5 10 15 −5 0 5 10 15 variances and
x x mixture weights.
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the mean values. after updating the mean values.

In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.28)
µ2 : 0 → −0.4 (11.29)
µ3 : 8 → 3.7 (11.30)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .

6363 The update of the mean parameters in (11.20) look fairly straight-
6364 forward. However, note that the responsibilities rnk are a function of
6365 πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.20) depend
6366 on all parameters of the GMM, and a closed-form solution, which we ob-
6367 tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
6368 be obtained.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
348 Density Estimation with Gaussian Mixture Models

6369 11.2.3 Updating the Covariances

Theorem 11.2 (Updates of the GMM Covariances). The update of the co-
variance parameters Σk , k = 1, . . . , K of the GMM is given by
N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )> , (11.31)
Nk n=1

6370 where rnk and Nk are defined in (11.17) and (11.25), respectively.

Proof To prove Theorem 11.2 our approach is to compute the partial

derivatives of the log-likelihood L with respect to the covariances Σk , set
them to 0 and solve for Σk . We start with our general approach
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = . (11.32)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk

We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-

tial derivative ∂p(xn | θ)/∂Σk , we write down the definition of the Gaus-
sian distribution p(xn | θ), see (11.9), and drop all terms but the k th. We
then obtain
∂p(xn | θ)
(11.33a)
∂Σk

∂ −
D
−
1
1 > −1

= (2π) 2 det(Σk ) exp − 2 (xn − µk ) Σk (xn − µk )
2
∂Σk
(11.33b)

D ∂ 1
= πk (2π)− 2 det(Σk )− 2 exp − 12 (xn − µk )> Σ−1

k (xn − µk )
∂Σk

1 ∂
− 1 > −1

+ det(Σk ) 2 exp − 2 (xn − µk ) Σk (xn − µk ) . (11.33c)
∂Σk
We now use the identities
∂ 1 1
(5.100) 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.34)
∂Σk 2
∂ (5.105)
(xn − µk )> Σ−1 −1 > −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk
∂Σk
(11.35)

and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.32) as
∂p(xn | θ)
= πk N xn | µk , Σk
∂Σk
· − 12 (Σ−1 −1 > −1

k − Σk (xn − µk )(xn − µk ) Σk ) . (11.36)

Putting everything together, the partial derivative of the log-likelihood

with respect to Σk is given by

N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.37a)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk
N
X πk N xn | µk , Σk

= PK
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk

· − 21 (Σ−1 −1 > −1

k − Σk (xn − µk )(xn − µk ) Σk ) (11.37b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.37c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk ) >
Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.37d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
X
Nk Σ−1
k = Σk
−1
rnk (xn − µk )(xn − µk )> Σ−1k (11.38a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.38b)
n=1

By solving for Σk , we obtain

N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )> , (11.39)
Nk n=1
6371 where r k is the probability vector defined in (11.26). This gives us a sim-
6372 ple update rule for Σk for k = 1, . . . , K and proves Theorem 11.2.
6373 Similar to the update of µk in (11.20), we can interpret the update of
6374 the covariance in (11.31) as an importance-weighted expected value of
6375 the square of the centered data X̃k := {x1 − µk , . . . , xN − µk }.

Example 11.4 (Variance Updates)

In our example from Figure 11.3, the variances are updated as follows:
σ12 : 1 → 0.14 (11.40)
σ22 : 0.2 → 0.44 (11.41)
σ32 : 3 → 1.53 (11.42)

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
350 Density Estimation with Gaussian Mixture Models

Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.
Figure 11.6 Effect
of updating the 0.30 π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 )

(a) GMM before GMM density 0.25 GMM density

0.20
updating the 0.20
p(x)

p(x)
variances; (b) GMM 0.15
0.15
after updating the 0.10
0.10
variances while 0.05 0.05
retaining the means
0.00 0.00
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
weights. x x

(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances.

6376 Similar to the update of the mean parameters, we can interpret (11.31)
6377 as a Monte-Carlo estimate of the weighted covariance of data points xn
6378 associated with the k th mixture component, where the weights are the
6379 responsibilities rnk . As with the updates of the mean parameters, this up-
6380 date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities
6381 rnk , which prohibits a closed-form solution.

6382 11.2.4 Updating the Mixture Weights

Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights
of the GMM are updated as
Nk
πknew = (11.43)
N
6383 for k = 1, . . . , K , where N is the number of data points and Nk is defined
6384 in (11.25).

Proof To find the partial derivative of the log-likelihood with respect

P to
the weight parameters πk , k = 1, . . . , K , we take the constraint k πk =
1 into account by using Lagrange multipliers (see Section 7.2). The La-
grangian is
K
!
X
L=L+λ πk − 1 (11.44a)
k=1

where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivative with respect to πk as
N
∂L X N xn | µk , Σk
= PK +λ (11.45a)
∂πk n=1 j=1 πj N xn | µj , Σj
N
1 X πk N xn | µk , Σk Nk
= +λ = + λ, (11.45b)
πk n=1 K j=1 πj N xn | µj , Σj
πk
P
| {z }
=Nk

and the partial derivative with respect to the Lagrange multiplier λ as

K
∂L X
= πk − 1 . (11.46)
∂λ k=1
Setting both partial derivatives to 0 (necessary condition for optimum)
yields the system of equations
Nk
πk = − , (11.47)
λ
K
X
1= πk . (11.48)
k=1

Using (11.47) in (11.48) and solving for πk , we obtain

K K
X X Nk N
πk = 1 ⇐⇒ − = 1 ⇐⇒ − = 1 ⇐⇒ λ = −N .
k=1 k=1
λ λ
(11.49)
This allows us to substitute −N for λ in (11.47) to obtain
Nk
, πknew = (11.50)
N
6385 which gives us the update for the weight parameters πk and proves Theo-
6386 rem 11.3.
6387 We can identify the mixture weight in (11.43) as the ratio of the to-
6388 tal responsibility
P of the k th cluster and the number of data points. Since
6389 N = k Nk the number of data points can also be interpreted as the to-
6390 tal responsibility of all mixture components together, such that πk is the
6391 relative importance of the k th mixture component for the dataset.
PN
6392 Remark. Since Nk = i=1 rnk , the update equation (11.43) for the mix-
6393 ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
6394 sponsibilities rnk . ♦

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
352 Density Estimation with Gaussian Mixture Models

Example 11.5 (Weight Parameter Updates)

Figure 11.7 Effect

p(x)
0.15
mixture weights; (b) 0.15
0.10
GMM after updating 0.10

the mixture weights 0.05 0.05

while retaining the 0.00 0.00

−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
means and x x
variances. Note the
(a) GMM density and individual components (b) GMM density and individual components
different scalings of
prior to updating the mixture weights. after updating the mixture weights.
the vertical axes.
In our running example from Figure 11.3, the mixture weights are up-
dated as follows:
1
π1 : 3
→ 0.29 (11.51)
1
π2 : 3
→ 0.29 (11.52)
1
π3 : 3
→ 0.42 (11.53)
Here we see that the third component gets more weight/importance,
while the other components become slightly less important. Figure 11.7
illustrates the effect of updating the mixture weights. Figure 11.7(a) is
identical to Figure 11.6(b) and shows the GMM density and its individual
components prior to updating the mixture weights. Figure 11.7(b) shows
the GMM density after updating the mixture weights.
Overall, having updated the means, the variances and the weights once,
we obtain the GMM shown in Figure 11.7(b). Compared with the ini-
tialization shown in Figure 11.3, we can see that the parameter updates
caused the GMM density to shift some of its mass toward the data points.
After updating the means, variances and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from
Figure 11.3. This is also evidenced by the log-likelihood values, which in-
creased from 28.3 (initialization) to 14.4 after one complete update cycle.

6395 11.3 EM Algorithm

6396 Unfortunately, the updates in (11.20), (11.31), and (11.43) do not consti-
6397 tute a closed-form solution for the updates of the parameters µk , Σk , πk
6398 of the mixture model because the responsibilities rnk depend on those pa-
6399 rameters in a complex way. However, the results suggest a simple iterative
6400 scheme for finding a solution to the parameters estimation problem via
EM algorithm 6401 maximum likelihood. The Expectation Maximization algorithm (EM algo-

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.3 EM Algorithm 353

6402 rithm) was proposed by Dempster et al. (1977) and is a general iterative
6403 scheme for learning parameters (maximum likelihood or MAP) in mixture
6404 models and, more generally, latent-variable models.
6405 In our example of the Gaussian mixture model, we choose initial values
6406 for µk , Σk , πk and alternate until convergence between
6407 • E-step: Evaluate the responsibilities rnk (posterior probability of data
6408 point n belonging to mixture component k ).
6409 • M-step: Use the updated responsibilities to re-estimate the parameters
6410 µk , Σk , πk .
6411 Every step in the EM algorithm increases the log-likelihood function (Neal
6412 and Hinton, 1999). For convergence, we can check the log-likelihood or
6413 the parameters directly. A concrete instantiation of the EM algorithm for
6414 estimating the parameters of a GMM is as follows:
6415 1 Initialize µk , Σk , πk
2 E-step: Evaluate responsibilities rnk for every data point xn using cur-
rent parameters πk , µk , Σk :

πk N xn | µk , Σk
rnk = P . (11.54)
j πj N xn | µj , Σj

3 M-step: Re-estimate parameters πk , µk , Σk using the current responsi-

bilities rnk (from E-step):
N
1 X
µk = rnk xn , (11.55)
Nk n=1
N
1 X
Σk = rnk (xn − µk )(xn − µk )> , (11.56)
Nk n=1
Nk
πk = . (11.57)
N

Example 11.6 (GMM Fit)

Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from
Negative log-likelihood

26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2.
0.20
22
p(x)

0.15
20
0.10
18
0.05 16
0.00 14
−5 0 5 10 15 0 1 2 3 4 5
x
Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this mixture EM iterations.
density.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
354 Density Estimation with Gaussian Mixture Models
Figure 11.9 10
Illustration of the 104

Negative log-likelihood
EM algorithm for 5
fitting a Gaussian
mixture model with
x2

0
three components to 6 × 103

a two-dimensional −5
dataset. (a) Dataset;
4 × 103
(b) Negative log −10
−10 −5 0 5 10 0 20 40 60
likelihood (lower is x1 EM iteration
better) as a function
(a) Dataset. (b) Negative log-likelihood.
of the EM iterations.
The red dots
10 10
indicate the
iterations for which
5 5
the corresponding
GMM fits are shown
x2

x2
0 0
in (c)–(f). The
yellow discs indicate
−5 −5
the mean of the
Gaussian
−10 −10
distribution. −10 −5 0 5 10 −10 −5 0 5 10
x1 x1

(c) EM initialization. (d) EM after 1 iteration.

10 10

5 5
x2

0 0

−5 −5

−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1

(e) EM after 10 iterations (f) EM after 63 iterations

When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as

p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
(11.58)
+ 0.43N x | 3.64, 1.63 .

6416 We applied the EM algorithm to the two-dimensional dataset shown

6417 in Figure 11.1 with K = 3 mixture components. Figure 11.9 illustrates
6418 a some steps of the EM algorithm and the negative log-likelihood as a
6419 function of the EM iteration (Figure 11.9(b)).

Figure 11.10
4 Dataset colored
according to the
2 responsibilities of
the mixture
0 components when
x2

EM converges.
−2 While a single
mixture component
−4 is clearly
responsible for the
−5 0 5 data on the left, the
x1 overlap of the two
data clusters on the
right could have
been generated by
6420 Figure 11.10 visualizes the final responsibilities of the mixture compo-
two mixture
6421 nents for the data points. It becomes clear that there are data points that components.
6422 cannot be uniquely assigned to a single (either blue or yellow) compo-
6423 nent, such that the responsibilities of these two clusters for those points
6424 are around 0.5.

6425 11.4 Latent Variable Perspective

6426 We can look at the GMM from the perspective of a discrete latent variable
6427 model, i.e., where the latent variable z can attain only a finite set of val-
6428 ues. This is in contrast to PCA where the latent variables were continuous-
6429 valued numbers in RM .
6430 The advantages of the probabilistic perspective are that (i) it will justify
6431 some ad-hoc decisions we made in the previous sections, (ii) it allows for
6432 a concrete interpretation of the responsibilities as posterior probabilities,
6433 (iii) the iterative algorithm for updating the model parameters can be de-
6434 rived in a principled manner as the EM algorithm for maximum likelihood
6435 parameter estimation in latent variable models.

6436 11.4.1 Generative Process and Probabilistic Model

6437 To derive the probabilistic model for GMMs, it is useful to think about the
6438 generative process, i.e., the process that allows us to generate data, using
6439 a probabilistic model.
We assume a mixture model with K components and that a data point
x can be generated by exactly one mixture component. We introduce a
binary indicator variable zk ∈ {0, 1} with two states (see Section 6.2) that
indicates whether the k th mixture component generated that data point
so that

p(x | zk = 1) = N x | µk , Σk . (11.59)
6440 We define z := [z1 , . . . , zK ]> ∈ RK as a probability vector consisting of

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
356 Density Estimation with Gaussian Mixture Models

Sometimes this kind 6441 K −1 many 0s and exactly one 1. For example, for K = 3, a valid z would
of probability 6442 be z = [z1 , z2 , z3 ]> = [0, 1, 0]> , which would select the second mixture
distribution is called PK
“multinoulli”, a
6443 component since z2 = 1. The properties of z imply that k=1 zk = 1.
generalization of 6444 Therefore, z is a one-hot encoding (also: 1-of-K representation).
the Bernoulli Thus far, we assumed that the indicator variables zk are known. How-
distribution to more ever, in practice, this is not the case, and we place a prior distribution on
than two
values (Murphy,
z . Given that we do not know which mixture component generated the
2012). data point, we treat the indicator vector z as a latent variable and place a
one-hot encoding prior
1-of-K K
X
representation >
p(z) = π = [π1 , . . . , πK ] , πk = 1 , (11.60)
k=1

on it. Then the k th entry

πk = p(zk = 1) (11.61)
6445 of this probability vector describes the probability that the k th mixture
Figure 11.11 6446 component generated data point x.
Graphical model for
6447
a GMM with a single
Remark (Sampling from a GMM). The construction of this latent variable
data point. 6448 model (see the corresponding graphical model in Figure 11.11) lends it-
6449
π self to a very simple sampling procedure (generative process) to generate
6450 data:
6451
z 1 Sample z (i) ∼ p(z)
6452 2 Sample x(i) ∼ p(x | z (i) = 1)

µk 6453 In the first step, we select a mixture component i (via the one-hot encod-
6454 ing z ) at random according to p(z) = π ; in the second step we draw a
Σk x
k = 1, . . . , K
6455 sample from the corresponding mixture component. When we discard the
6456 samples of the latent variable so that we are left with the x(i) , we have
6457 valid samples from the GMM.
6458 This kind of sampling, where samples of random variables depend on
ancestral sampling6459 samples from the variable’s parents in the graphical model, is called ancestral
6460 sampling. ♦
Generally, a probabilistic model is defined by the joint distribution of
the data and the latent variables (see Section 8.3). With the prior p(z)
defined in (11.60)–(11.61) and the conditional p(x | z) from (11.59) we
obtain all K components of this joint distribution via

p(x, zk = 1) = p(x | zk = 1)p(zk = 1) = πk N x | µk , Σk (11.62)
for k = 1, . . . , K , so that
   
p(x, z1 = 1) π1 N x | µ1 , Σ1
p(x, z) =  .. ..
= , (11.63)
   
. .
p(x, zK = 1) πK N x | µK , ΣK
6461 which fully specifies the probabilistic model.

6462 11.4.2 Likelihood

To obtain the likelihood p(x | θ) in a latent variable model, we need to
marginalize out the latent variables. In our case, this can be done by
summing out all latent variables from the joint p(x, z) in (11.63) so that
X
p(x | θ) = p(x | θ, z)p(z | θ) , θ := {µk , Σk , πk : k = 1, . . . , K} .
z
(11.64)
We now explicitly condition on the parameters θ of the probabilistic model,
which we previously omitted. In (11.64), P we sum over all K possible one-
hot encodings of z , which is denoted by z . Since there is only a single
non-zero single entry in each z there are only K possible configurations/
settings of z . For example, if K = 3 then z can have the configurations
     
1 0 0
0 , 1 , 0 . (11.65)
0 0 1
Summing over all possible configurations of z in (11.64) is equivalent to
looking at the non-zero entry of the z -vector and write
X
p(x | θ) = p(x | θ, z)p(z | θ) (11.66a)
z
K
X
= p(x | θ, zk = 1)p(zk = 1 | θ) (11.66b)
k=1

so that the desired marginal distribution is given as

K
(11.66b)
X
p(x | θ) = p(x | θ, zk = 1)p(zk = 1|θ) (11.67a)
k=1
K
X
= πk N x | µk , Σk , (11.67b)
k=1

which we identify as the GMM model from (11.3). Given a dataset X we

immediately obtain the likelihood
N N X
K
(11.67b)
Y Y
p(X | θ) = p(xn | θ) = πk N xn | µk , Σk (11.68)
n=1 n=1 k=1

6463 which is exactly the GMM likelihood from (11.9). Therefore, the latent
6464 variable model with latent indicators zk is an equivalent way of thinking
6465 about a Gaussian mixture model.

6466 11.4.3 Posterior Distribution

Let us have a brief look at the posterior distribution on the latent variable
z . According to Bayes’ theorem, the posterior of the k th component having

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
358 Density Estimation with Gaussian Mixture Models

Figure 11.12 π
Graphical model for
a GMM with N data
points.
zn

µk
Σk xn
k = 1, . . . , K
n = 1, . . . , N

generated data point x

p(zk = 1)p(x | zk = 1)
p(zk = 1 | x) = , (11.69)
p(x)
where the marginal p(x) is given in (11.67b). This yields the posterior
distribution for the k th indicator variable zk

p(zk = 1)p(x | zk = 1) πk N x | µk , Σk
p(zk = 1 | x) = PK = PK ,
j=1 p(zj = 1)p(x | zj = 1) j=1 πj N x | µj , Σj
(11.70)
6467 which we identify as the responsibility of the k th mixture component for
6468 data point x. Note that we omitted the explicit conditioning on the GMM
6469 parameters πk , µk , Σk where k = 1, . . . , K .

6470 11.4.4 Extension to a Full Dataset

6471 Thus far, we only discussed the case where the dataset consists only of a
6472 single data point x. However, the concepts of the prior and the posterior
6473 can be directly extended to the case of N data points X := {x1 , . . . , xN }.
In the probabilistic interpretation of the GMM, every data point xn pos-
sesses its own latent variable
z n = [zn1 , . . . , znK ]> ∈ RK . (11.71)
6474 Previously (when we only considered a single data point x) we omitted
6475 the index n, but now this becomes important.
6476 We share the same prior distribution π across all latent variables z n .
6477 The corresponding graphical model is shown in Figure 11.12, where we
6478 use the plate notation.
The conditional distribution p(x1 , . . . , xN | z 1 , . . . , z N ) factorizes over
the data points and is given as
N
Y
p(x1 , . . . , xN | z 1 , . . . , z N ) = p(xn | z n ) . (11.72)
n=1

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 359

To obtain the posterior distribution p(znk = 1 | xn ) we follow the same

6479 This means that p(zk = 1 | xn ) is the (posterior) probability that the k th
6480 mixture component generated data point xn and corresponds to the re-
6481 sponsibility rnk we introduced in (11.17). Now, the responsibilities also
6482 have not only an intuitive but also a mathematically justified interpreta-
6483 tion as posterior probabilities.

6484 11.4.5 EM Algorithm Revisited

The EM algorithm that we introduced as an iterative scheme for maximum
likelihood estimation can be derived in a principled way from the latent
variable perspective. Given a current setting θ (t) of model parameters, the
E-step calculates the expected log-likelihood
Q(θ | θ (t) ) = Ez | x,θ(t) [log p(x, z | θ)] (11.74a)
Z
= log p(x, z | θ)p(z | x, θ (t) )dz , (11.74b)

6485 where the expectation of log p(x, z | θ) is taken with respect to the poste-
6486 rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
6487 of model parameters θ (t+1) by maximizing (11.74b).
6488 Although an EM iteration does increase the log-likelihood, there are
6489 no guarantees that EM converges to the maximum likelihood solution.
6490 It is possible that the EM algorithm converges to a local maximum of
6491 the log-likelihood. Different initializations of the parameters θ could be
6492 used in multiple EM runs to reduce the risk of ending up in a bad local
6493 optimum. We do not go into further details here, but refer to the excellent
6494 expositions by Rogers and Girolami (2016) and Bishop (2006).

6495 11.5 Further Reading

6496 The GMM can be considered a generative model in the sense that it is
6497 straightforward to generate new data using ancestral sampling (Bishop,
6498 2006). For given GMM parameters πk , µk , Σk , k = 1, . . . , K , we sample
6499 an index k from the probability
vector [π1 , . . . , πK ]> and then sample a
6500 data point x ∼ N µk , Σk . If we repeat this N times, we obtain a dataset
6501 that has been generated by a GMM. Figure 11.1 was generated using this
6502 procedure.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
360 Density Estimation with Gaussian Mixture Models

6503 Throughout this chapter, we assumed that the number of components

6504 K is known. In practice, this is often not the case. However, we could use
6505 nested cross-validation, as discussed in Section 8.5.1, to find good models.
6506 Gaussian mixture models are closely related to the K -means cluster-
6507 ing algorithm. K -means also uses the EM algorithm to assign data points
6508 to clusters. If we treat the means in the GMM as cluster centers and ig-
6509 nore the covariances, we arrive at K -means. As also nicely described by
6510 MacKay (2003a), K -means makes a “hard” assignments of data points
6511 to cluster centers µk , whereas a GMM makes a “soft” assignment via the
6512 responsibilities.
6513 We only touched upon the latent variable perspective of GMMs and the
6514 EM algorithm. Note that EM can be used for parameter learning in general
6515 latent variable models, e.g., nonlinear state-space models (Ghahramani
6516 and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement
6517 learning as discussed by Barber (2012). Therefore, the latent variable per-
6518 spective of a GMM is useful to derive the corresponding EM algorithm in
6519 a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012).
6520 We only discussed maximum likelihood estimation (via the EM algo-
6521 rithm) for finding GMM parameters. The standard criticisms of maximum
6522 likelihood also apply here:
6523 • As in linear regression, maximum likelihood can suffer from severe
6524 overfitting. In the GMM case, this happens when the mean of a mix-
6525 ture component is identical to a data point and the covariance tends to
6526 0. Then, the likelihood approaches infinity. Bishop (2006) and Barber
6527 (2012) discuss this issue in detail.
6528 • We only obtain a point estimate of the parameters πk , µk , Σk for k =
6529 1, . . . , K , which does not give any indication of uncertainty in the pa-
6530 rameter values. A Bayesian approach would place a prior on the param-
6531 eters, which can be used to obtain a posterior distribution on the param-
6532 eters. This posterior allows us to compute the model evidence (marginal
6533 likelihood), which can be used for model comparison, which gives us a
6534 principled way to determine the number of mixture components. Un-
6535 fortunately, closed-form inference is not possible in this setting because
6536 there is no conjugate prior for this model. However, approximations,
6537 such as variational inference, can be used to obtain an approximate
6538 posterior (Bishop, 2006).
6539 In this chapter, we discussed mixture models for density estimation.
6540 There is a plethora of density estimation techniques available. In practice,
6541 we often use histograms and kernel density estimation.
histograms 6542 Histograms provide a non-parametric way to represent continuous den-
6543 sities and have been proposed by Pearson (1895). A histogram is con-
6544 structed by “binning” the data space and count how many data points fall
6545 into each bin. Then a bar is drawn at the center of each bin, and the height
6546 of the bar is proportional to the number of data points within that bin. The

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 361
0.30 Figure 11.13
Data Histogram (orange
0.25 KDE bars) and kernel
Histogram density estimation
0.20
(blue line). The
p(x)

0.15
kernel density
estimator produces
0.10 a smooth estimate
of the underlying
0.05 density, whereas the
histogram is an
0.00 unsmoothed count
−4 −2 0 2 4 6 8
x measure of how
many data points
(black) fall into a
6547 bin size is a critical hyper-parameter, and a bad choice can lead to overfit- single bin.
6548 ting and underfitting. Cross-validation, as discussed in Section 8.1.4, can
6549 be used to determine a good bin size.
Kernel density estimation, independently proposed by Rosenblatt (1956) kernel density
and Parzen (1962), is a nonparametric way for density estimation. Given estimation
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
1 X x − xn

p(x) = k , (11.75)
N h n=1 h
6550 where k is a kernel function, i.e., a non-negative function that integrates
6551 to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
6552 lar role as the bin size in histograms. Note that we place a kernel on every
6553 single data point xn in the dataset. Commonly used kernel functions are
6554 the uniform distribution and the Gaussian distribution. Kernel density esti-
6555 mates are closely related to histograms, but by choosing a suitable kernel,
6556 we can guarantee smoothness of the density estimate. Figure 11.13 illus-
6557 trates the difference between a histogram and a kernel density estimator
6558 (with a Gaussian-shaped kernel) for a given dataset of 250 data points.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

Lec 1 Prob Bayesian Modeling
No ratings yet
Lec 1 Prob Bayesian Modeling
49 pages
ML 2
No ratings yet
ML 2
109 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
Multivariate Edgeworth-Based Entropy Estimation
No ratings yet
Multivariate Edgeworth-Based Entropy Estimation
6 pages
Gaussian Mixture Models & EM
No ratings yet
Gaussian Mixture Models & EM
38 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Tutorial 4
No ratings yet
Tutorial 4
3 pages
Pks Machine Learning Module 2 2
No ratings yet
Pks Machine Learning Module 2 2
41 pages
ML 20230316 1
No ratings yet
ML 20230316 1
9 pages
Estimation Theory Overview
No ratings yet
Estimation Theory Overview
4 pages
Bayesian Decision Theory Guide
No ratings yet
Bayesian Decision Theory Guide
39 pages
Nonparametric Classification Techniques
No ratings yet
Nonparametric Classification Techniques
20 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
Generative Learning Explained
No ratings yet
Generative Learning Explained
14 pages
EE511 Midterm Solutions Spring 2008
No ratings yet
EE511 Midterm Solutions Spring 2008
8 pages
Statistical Inference 4
No ratings yet
Statistical Inference 4
30 pages
Gaussian Conjugate Prior Overview
No ratings yet
Gaussian Conjugate Prior Overview
7 pages
19ECE357 - I Ass Oct 2022
No ratings yet
19ECE357 - I Ass Oct 2022
2 pages
Finite Mixture Models Explained
No ratings yet
Finite Mixture Models Explained
11 pages
Mle Introduction
No ratings yet
Mle Introduction
22 pages
Bayesian Models for AI Experts
No ratings yet
Bayesian Models for AI Experts
130 pages
Theory Part 22008
No ratings yet
Theory Part 22008
27 pages
Modeling Random Data & Noise in Engineering
100% (29)
Modeling Random Data & Noise in Engineering
13 pages
Tut 2
No ratings yet
Tut 2
1 page
Notes PDF
No ratings yet
Notes PDF
302 pages
Intro To Statistics For Engineers Using Python
No ratings yet
Intro To Statistics For Engineers Using Python
147 pages
MATH 3301 Statistics Course Notes
No ratings yet
MATH 3301 Statistics Course Notes
302 pages
Chapter 1 - Part1
No ratings yet
Chapter 1 - Part1
56 pages
2016 3 NS SP1
No ratings yet
2016 3 NS SP1
3 pages
$$$MGB3rdSearchable PDF
100% (1)
$$$MGB3rdSearchable PDF
577 pages
Lecture 2
No ratings yet
Lecture 2
26 pages
Notes7 Mixtures and EM
No ratings yet
Notes7 Mixtures and EM
7 pages
STA301-MidTerm MCQs Solved by Uzair
No ratings yet
STA301-MidTerm MCQs Solved by Uzair
21 pages
Unit 1
No ratings yet
Unit 1
52 pages
Generating Simulated Data: Project PHYSNET Physics Bldg. Michigan State University East Lansing, MI
No ratings yet
Generating Simulated Data: Project PHYSNET Physics Bldg. Michigan State University East Lansing, MI
10 pages
R Installation and Basic Functions Guide
No ratings yet
R Installation and Basic Functions Guide
23 pages
Unit 2
No ratings yet
Unit 2
183 pages
Lec 3 Gaussian Processes
No ratings yet
Lec 3 Gaussian Processes
30 pages
Aprendizaje NO Supervisado
No ratings yet
Aprendizaje NO Supervisado
52 pages
Czekanowski Index-Based Similarity As Alternative Correlation Measure in N-Asset Portfolio Analysis
No ratings yet
Czekanowski Index-Based Similarity As Alternative Correlation Measure in N-Asset Portfolio Analysis
1 page
Continuous Random Variables: Density, PMF F (X) Density, PDF F (Y) Y/6, 2 y 4
No ratings yet
Continuous Random Variables: Density, PMF F (X) Density, PDF F (Y) Y/6, 2 y 4
28 pages
Continuous Random Variables Guide
No ratings yet
Continuous Random Variables Guide
28 pages
Machine Learning Homework Guide
No ratings yet
Machine Learning Homework Guide
6 pages
Inverse Gaussian Distribution Overview
No ratings yet
Inverse Gaussian Distribution Overview
2 pages
Machine Learning Stats Essentials
No ratings yet
Machine Learning Stats Essentials
5 pages
Chapter 5 Sol
100% (1)
Chapter 5 Sol
14 pages
Some Uses of Student's T Characteristic Function and The Sum of T and Gaussian Random Variables
No ratings yet
Some Uses of Student's T Characteristic Function and The Sum of T and Gaussian Random Variables
15 pages
Analysis Study
No ratings yet
Analysis Study
10 pages
Digital Signal Processing by John G. Pro Part17
No ratings yet
Digital Signal Processing by John G. Pro Part17
6 pages
Lecture 3 1213 Stat New (1) احصاء
No ratings yet
Lecture 3 1213 Stat New (1) احصاء
5 pages
Intro Stats Formula Sheet
No ratings yet
Intro Stats Formula Sheet
5 pages
Sept 2017 - Final - With Memo
No ratings yet
Sept 2017 - Final - With Memo
5 pages
Cosm Question Bank 2024
No ratings yet
Cosm Question Bank 2024
5 pages
Probability and Statistics Exam
No ratings yet
Probability and Statistics Exam
8 pages
Tables For Cas Exam Mas-Ii
No ratings yet
Tables For Cas Exam Mas-Ii
18 pages
Lec22 PDF
No ratings yet
Lec22 PDF
8 pages
Module 2
No ratings yet
Module 2
18 pages
Chapter 03
No ratings yet
Chapter 03
25 pages
LinAlg Week1
No ratings yet
LinAlg Week1
8 pages
Chapter 04
No ratings yet
Chapter 04
41 pages
Chapter 12
No ratings yet
Chapter 12
25 pages
Chapter 10
No ratings yet
Chapter 10
31 pages
Chapter 07
No ratings yet
Chapter 07
24 pages
Chapter 09
No ratings yet
Chapter 09
28 pages
Chapter 06
No ratings yet
Chapter 06
53 pages
Sense Learn by Shayan Ibad
No ratings yet
Sense Learn by Shayan Ibad
62 pages
XN-L Interfacing Guide PDF
33% (3)
XN-L Interfacing Guide PDF
25 pages
11th Maths Chapter 3 Test 2 SQLQ
No ratings yet
11th Maths Chapter 3 Test 2 SQLQ
2 pages
MAT-Lecture-10-Relations On Sets
No ratings yet
MAT-Lecture-10-Relations On Sets
20 pages
English by Ankul Sir - Preview
No ratings yet
English by Ankul Sir - Preview
7 pages
Limits at Infinity: Sequential Criterion
No ratings yet
Limits at Infinity: Sequential Criterion
1 page
Speech Language Disorder in Children An Overview
No ratings yet
Speech Language Disorder in Children An Overview
8 pages
Syllabus Class 7
No ratings yet
Syllabus Class 7
1 page
Grade 9 - English - Term1 - M4 - 4.8 - Task - 2 - Assignment - Answer - Sheet
No ratings yet
Grade 9 - English - Term1 - M4 - 4.8 - Task - 2 - Assignment - Answer - Sheet
4 pages
DLP Local Demo.
No ratings yet
DLP Local Demo.
10 pages
كتاب الانجليزي الصف الثالث - compressed
No ratings yet
كتاب الانجليزي الصف الثالث - compressed
96 pages
Grammar Reference: Unit 1 1.1 Verbs: Present Simple
No ratings yet
Grammar Reference: Unit 1 1.1 Verbs: Present Simple
4 pages
Na Me Jawad Ahmad Section EE-1B Roll No 24f-6024
No ratings yet
Na Me Jawad Ahmad Section EE-1B Roll No 24f-6024
5 pages
Sacrament of Confirmation
No ratings yet
Sacrament of Confirmation
5 pages
IFS and Pagero - O2C - Setup Guide - Version 1.0C
No ratings yet
IFS and Pagero - O2C - Setup Guide - Version 1.0C
41 pages
M.A. English - 412
No ratings yet
M.A. English - 412
286 pages
Modals Practice
No ratings yet
Modals Practice
53 pages
Daftar Tenaga Pendidik Lombok Timur
100% (1)
Daftar Tenaga Pendidik Lombok Timur
49 pages
Database Permissions Extraction Script
No ratings yet
Database Permissions Extraction Script
4 pages
Cs-101: Computer Systems: Introduction To Programming Using Matlab
No ratings yet
Cs-101: Computer Systems: Introduction To Programming Using Matlab
60 pages
Adders and Subtractors Lab Report
No ratings yet
Adders and Subtractors Lab Report
15 pages
Sem 6 - Cse Iot - Webx.0 - 2024 Dec t0 2023 May - Aeraxia - in
No ratings yet
Sem 6 - Cse Iot - Webx.0 - 2024 Dec t0 2023 May - Aeraxia - in
6 pages
Oracle DataGuard for DBAs
No ratings yet
Oracle DataGuard for DBAs
57 pages
PHP & Laravel Development Guide
100% (1)
PHP & Laravel Development Guide
15 pages
Verbos Irregulares
No ratings yet
Verbos Irregulares
2 pages
Windows 7 Patch Testing Tool
No ratings yet
Windows 7 Patch Testing Tool
1 page
Atomic Structure 50 MCQs
No ratings yet
Atomic Structure 50 MCQs
2 pages
Lab Quiz 1 Review: Perfect Score
No ratings yet
Lab Quiz 1 Review: Perfect Score
4 pages
Celtiberians: Culture and History
No ratings yet
Celtiberians: Culture and History
5 pages
Imperial Aramaic: The Unicode Standard, Version 17.0
No ratings yet
Imperial Aramaic: The Unicode Standard, Version 17.0
2 pages
Bijections - Zuming Feng - IdeaMath 2008
No ratings yet
Bijections - Zuming Feng - IdeaMath 2008
4 pages

Chapter 11

Uploaded by

Chapter 11

Uploaded by

6265

Density Estimation with Gaussian Mixture

6266 In earlier chapters, we covered already two fundamental problems in

6292 where the components pk are members of a family of basic distributions,

6306 11.1 Gaussian Mixture Model

where we defined θ := {µk , Σk , πk : k = 1, . . . , K} as the collection of

Figure 11.2 0.30

components and the mixture density, which is given as

6308 11.2 Parameter Learning via Maximum Likelihood

Example 11.1 (Initial setting)

Figure 11.3 Initial 0.30 π1 N (x|µ1 , σ12 )

(dashed) and seven 0.15

Throughout this chapter, we will have a simple running example that

We consider a one-dimensional dataset X = {−3, −2.5, −1, 0, 2, 4, 5}

In the following, we detail how to obtain a maximum likelihood esti-

where every individual likelihood term p(xn | θ) is a Gaussian mixture

6329 In the following, we will compute the partial derivatives (11.12)–(11.14).

6332 11.2.1 Responsibilities

6342 In the following, we determine the updates of the model parameters

6351 11.2.2 Updating the Means

6352 where the responsibilities rnk are defined in (11.17).

Figure 11.5 Effect

0.15 0.15 mean values; (b)

6369 11.2.3 Updating the Covariances

Proof To prove Theorem 11.2 our approach is to compute the partial

We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-

Putting everything together, the partial derivative of the log-likelihood

with respect to Σk is given by

By solving for Σk , we obtain

Example 11.4 (Variance Updates)

(a) GMM before GMM density 0.25 GMM density

6382 11.2.4 Updating the Mixture Weights

Proof To find the partial derivative of the log-likelihood with respect

and the partial derivative with respect to the Lagrange multiplier λ as

Using (11.47) in (11.48) and solving for πk , we obtain

Example 11.5 (Weight Parameter Updates)

Figure 11.7 Effect

the mixture weights 0.05 0.05

while retaining the 0.00 0.00

6395 11.3 EM Algorithm

3 M-step: Re-estimate parameters πk , µk , Σk using the current responsi-

Example 11.6 (GMM Fit)

(c) EM initialization. (d) EM after 1 iteration.

(e) EM after 10 iterations (f) EM after 63 iterations

6416 We applied the EM algorithm to the two-dimensional dataset shown

6425 11.4 Latent Variable Perspective

6436 11.4.1 Generative Process and Probabilistic Model

on it. Then the k th entry

6462 11.4.2 Likelihood

so that the desired marginal distribution is given as

which we identify as the GMM model from (11.3). Given a dataset X we

6466 11.4.3 Posterior Distribution

generated data point x

6470 11.4.4 Extension to a Full Dataset

To obtain the posterior distribution p(znk = 1 | xn ) we follow the same

6484 11.4.5 EM Algorithm Revisited

6495 11.5 Further Reading

6503 Throughout this chapter, we assumed that the number of components

You might also like