Chapter 11
Chapter 11
11
Figure 11.1
Two-dimensional
dataset that cannot 4
be meaningfully
represented by a 2
Gaussian.
0
x2
−2
−4
−5 0 5
x1
340
Draft chapter (November 8, 2018) from “Mathematics for Machine Learning” c 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
11.1 Gaussian Mixture Model 341
6286 In practice, the Gaussian (or similarly all other distributions we encoun-
6287 tered so far) have limited modeling capabilities. For example, a Gaussian
6288 approximation of the density that generated the data in Figure 11.1 would
6289 be a poor approximation. In the following, we will look at a more ex-
6290 pressive family of distributions, which we can use for density estimation:
6291 mixture models. mixture models
Mixture models can be used to describe a distribution p(x) by a convex
combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.2)
k=1
K
X
p(x | θ) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.4)
k=1
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
342 Density Estimation with Gaussian Mixture Models
p(x)
0.15
convex combination
of Gaussian
0.10
distributions and is
more expressive 0.05
than any individual
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8
weighted Gaussian x
components.
6307
0.05
0.00
−5 0 5 10 15
x
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 343
6316 We aim to find parameters θ ∗ML that maximize the log-likelihood L defined
6317 in (11.10). Our “normal” procedure would be to compute the gradient
6318 dL/dθ of the log-likelihood with respect to the model parameters θ , set
6319 it to 0 and solve for θ . However, unlike our previous examples for maxi-
6320 mum likelihood estimation (e.g., when we discussed linear regression in
6321 Section 9.2), we cannot obtain a closed-form solution. However, we can
6322 exploit an iterative scheme to find good model parameters θ ML , which will
6323 out to be the EM algorithm for Gaussian mixture models. The key idea is
6324 to update one model parameter at a time while keeping the others fixed.
Remark. If we were to consider a single Gaussian as the desired density,
the sum over k in (11.10) vanishes, and the log can be applied directly to
the Gaussian component, such that we get
log N x | µ, Σ = − D2 log(2π) − 12 log det(Σ) − 12 (x − µ)> Σ−1 (x − µ).
(11.11)
6325 This simple form allows us find closed-form maximum likelihood esti-
6326 mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
344 Density Estimation with Gaussian Mixture Models
6327 the log into the sum over k so that we cannot obtain a simple closed-form
6328 maximum likelihood solution. ♦
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when
we optimize the log-likelihood in (11.10) with respect to the GMM param-
eters µk , Σk , πk :
N
∂L X ∂ log p(xn | θ)
= 0> ⇐⇒ = 0> , (11.12)
∂µk n=1
∂µ k
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0, (11.13)
∂Σk n=1
∂Σk
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk
For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 345
P
6337
k rnk = 1 with rnk > 0. This probability vector distributes probabil-
6338 ity mass among the K mixture components, and we can think of r n as a
6339 “soft assignment” of xn to the K mixture components. Therefore, the re- The responsibility
6340 sponsibility rnk from (11.17) represents the probability that xn has been rnk is the
probability that the
6341 generated by the k th mixture component.
kth mixture
component
Example 11.2 (Responsibilities) generated the nth
data point.
For our example from Figure 11.3 we compute the responsibilities rnk
1.0 0.0 0.0
1.0 0.0 0.0
0.057 0.943 0.0
0.001 0.999 0.0 ∈ RN ×K .
(11.19)
0.0 0.066 0.934
0.0 0.0 1.0
0.0 0.0 1.0
Here, the nth row tells us the responsibilities of all mixture components
for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
the k th mixture component. We can see that the third mixture component
(third column) is not responsible for any of the first four data points, but
takes much responsibility of the remaining data points. The sum of all
entries of a column gives us the values Nk , i.e., the total responsibility of
the k th mixture component. In our example, we get N1 = 2.058, N2 =
2.008, N3 = 2.934.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
346 Density Estimation with Gaussian Mixture Models
6353 Remark. The update of the means µk of the individual mixture compo-
6354 nents in (11.20) depends on all means, covariance matrices Σk and mix-
6355 ture weights πk via rnk given in (11.17). Therefore, we cannot obtain a
6356 closed-form solution for all µk at once. ♦
Proof From (11.15), we see that the gradient of the log-likelihood with
respect to the mean parameters µk , k = 1, . . . , K , requires us to compute
the partial derivative
K
∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk
= πj = πk (11.21)
∂µk j=1
∂µk ∂µk
= πk (xn − µk )> Σ−1
k N xn | µk , Σk (11.22)
6357 where we exploited that only the k th mixture component depends on µk .
We use our result from (11.22) in (11.15) and put everything together
so that the desired partial derivative of L with respect to µk is given as
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.23a)
∂µk n=1
∂µk n=1
p(x n | θ) ∂µk
N
X
> −1
πk N xn | µk , Σk
= (xn − µk ) Σk PK (11.23b)
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
N
X
= rnk (xn − µk )> Σ−1
k . (11.23c)
n=1
6358 Here, we used the identity from (11.16) and the result of the partial
6359 derivative in (11.22) to get to (11.23b). The values rnk are the respon-
6360 sibilities we defined in (11.17).
∂L(µnew )
We now solve (11.23c) for µnewk so that ∂µk = 0> and obtain
k
N N PN N
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k = P n=1
= rnk xn ,
n=1 n=1
N
rnk Nk n=1
n=1
(11.24)
where we defined
N
X
Nk := rnk (11.25)
n=1
6361 as the total responsibility of the k th mixture component for the entire
6362 dataset. This concludes the proof of Theorem 11.1.
Intuitively, (11.20) can be interpreted as an importance-weighted Monte-
Carlo estimate of the mean, where the importance weights of data point
xn are the responsibilities rnk of the k th cluster for xn , k = 1, . . . , K .
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 347
Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
the corresponding mixture component has a high responsibility, i.e., a high
mixture component
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
bution given by pulled toward
individual data
r k := [r1k , . . . , rN k ]> /Nk , (11.26) points with the
weights given by the
which is a normalized probability vector, i.e., corresponding
responsibilities.
µk ← Erk [X ] . (11.27)
x2 x3
r2
Example 11.3 (Mean Updates) r1 r3
x1
µk
p(x)
In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.28)
µ2 : 0 → −0.4 (11.29)
µ3 : 8 → 3.7 (11.30)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .
6363 The update of the mean parameters in (11.20) look fairly straight-
6364 forward. However, note that the responsibilities rnk are a function of
6365 πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.20) depend
6366 on all parameters of the GMM, and a closed-form solution, which we ob-
6367 tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
6368 be obtained.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
348 Density Estimation with Gaussian Mixture Models
6370 where rnk and Nk are defined in (11.17) and (11.25), respectively.
and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.32) as
∂p(xn | θ)
= πk N xn | µk , Σk
∂Σk
· − 12 (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) . (11.36)
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 349
· − 21 (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.37b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.37c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk ) >
Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.37d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
X
Nk Σ−1
k = Σk
−1
rnk (xn − µk )(xn − µk )> Σ−1k (11.38a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.38b)
n=1
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
350 Density Estimation with Gaussian Mixture Models
Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.
Figure 11.6 Effect
of updating the 0.30 π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 )
p(x)
variances; (b) GMM 0.15
0.15
after updating the 0.10
0.10
variances while 0.05 0.05
retaining the means
0.00 0.00
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
weights. x x
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances.
6376 Similar to the update of the mean parameters, we can interpret (11.31)
6377 as a Monte-Carlo estimate of the weighted covariance of data points xn
6378 associated with the k th mixture component, where the weights are the
6379 responsibilities rnk . As with the updates of the mean parameters, this up-
6380 date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities
6381 rnk , which prohibits a closed-form solution.
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 351
N K K
!
X X X
= log πk N xn | µk , Σk + λ πk − 1 , (11.44b)
n=1 k=1 k=1
where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivative with respect to πk as
N
∂L X N xn | µk , Σk
= PK +λ (11.45a)
∂πk n=1 j=1 πj N xn | µj , Σj
N
1 X πk N xn | µk , Σk Nk
= +λ = + λ, (11.45b)
πk n=1 K j=1 πj N xn | µj , Σj
πk
P
| {z }
=Nk
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
352 Density Estimation with Gaussian Mixture Models
p(x)
0.15
mixture weights; (b) 0.15
0.10
GMM after updating 0.10
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.3 EM Algorithm 353
6402 rithm) was proposed by Dempster et al. (1977) and is a general iterative
6403 scheme for learning parameters (maximum likelihood or MAP) in mixture
6404 models and, more generally, latent-variable models.
6405 In our example of the Gaussian mixture model, we choose initial values
6406 for µk , Σk , πk and alternate until convergence between
6407 • E-step: Evaluate the responsibilities rnk (posterior probability of data
6408 point n belonging to mixture component k ).
6409 • M-step: Use the updated responsibilities to re-estimate the parameters
6410 µk , Σk , πk .
6411 Every step in the EM algorithm increases the log-likelihood function (Neal
6412 and Hinton, 1999). For convergence, we can check the log-likelihood or
6413 the parameters directly. A concrete instantiation of the EM algorithm for
6414 estimating the parameters of a GMM is as follows:
6415 1 Initialize µk , Σk , πk
2 E-step: Evaluate responsibilities rnk for every data point xn using cur-
rent parameters πk , µk , Σk :
πk N xn | µk , Σk
rnk = P . (11.54)
j πj N xn | µj , Σj
Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from
Negative log-likelihood
26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2.
0.20
22
p(x)
0.15
20
0.10
18
0.05 16
0.00 14
−5 0 5 10 15 0 1 2 3 4 5
x
Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this mixture EM iterations.
density.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
354 Density Estimation with Gaussian Mixture Models
Figure 11.9 10
Illustration of the 104
Negative log-likelihood
EM algorithm for 5
fitting a Gaussian
mixture model with
x2
0
three components to 6 × 103
a two-dimensional −5
dataset. (a) Dataset;
4 × 103
(b) Negative log −10
−10 −5 0 5 10 0 20 40 60
likelihood (lower is x1 EM iteration
better) as a function
(a) Dataset. (b) Negative log-likelihood.
of the EM iterations.
The red dots
10 10
indicate the
iterations for which
5 5
the corresponding
GMM fits are shown
x2
x2
0 0
in (c)–(f). The
yellow discs indicate
−5 −5
the mean of the
Gaussian
−10 −10
distribution. −10 −5 0 5 10 −10 −5 0 5 10
x1 x1
10 10
5 5
x2
x2
0 0
−5 −5
−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1
When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
(11.58)
+ 0.43N x | 3.64, 1.63 .
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 355
Figure 11.10
4 Dataset colored
according to the
2 responsibilities of
the mixture
0 components when
x2
EM converges.
−2 While a single
mixture component
−4 is clearly
responsible for the
−5 0 5 data on the left, the
x1 overlap of the two
data clusters on the
right could have
been generated by
6420 Figure 11.10 visualizes the final responsibilities of the mixture compo-
two mixture
6421 nents for the data points. It becomes clear that there are data points that components.
6422 cannot be uniquely assigned to a single (either blue or yellow) compo-
6423 nent, such that the responsibilities of these two clusters for those points
6424 are around 0.5.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
356 Density Estimation with Gaussian Mixture Models
Sometimes this kind 6441 K −1 many 0s and exactly one 1. For example, for K = 3, a valid z would
of probability 6442 be z = [z1 , z2 , z3 ]> = [0, 1, 0]> , which would select the second mixture
distribution is called PK
“multinoulli”, a
6443 component since z2 = 1. The properties of z imply that k=1 zk = 1.
generalization of 6444 Therefore, z is a one-hot encoding (also: 1-of-K representation).
the Bernoulli Thus far, we assumed that the indicator variables zk are known. How-
distribution to more ever, in practice, this is not the case, and we place a prior distribution on
than two
values (Murphy,
z . Given that we do not know which mixture component generated the
2012). data point, we treat the indicator vector z as a latent variable and place a
one-hot encoding prior
1-of-K K
X
representation >
p(z) = π = [π1 , . . . , πK ] , πk = 1 , (11.60)
k=1
µk 6453 In the first step, we select a mixture component i (via the one-hot encod-
6454 ing z ) at random according to p(z) = π ; in the second step we draw a
Σk x
k = 1, . . . , K
6455 sample from the corresponding mixture component. When we discard the
6456 samples of the latent variable so that we are left with the x(i) , we have
6457 valid samples from the GMM.
6458 This kind of sampling, where samples of random variables depend on
ancestral sampling6459 samples from the variable’s parents in the graphical model, is called ancestral
6460 sampling. ♦
Generally, a probabilistic model is defined by the joint distribution of
the data and the latent variables (see Section 8.3). With the prior p(z)
defined in (11.60)–(11.61) and the conditional p(x | z) from (11.59) we
obtain all K components of this joint distribution via
p(x, zk = 1) = p(x | zk = 1)p(zk = 1) = πk N x | µk , Σk (11.62)
for k = 1, . . . , K , so that
p(x, z1 = 1) π1 N x | µ1 , Σ1
p(x, z) = .. ..
= , (11.63)
. .
p(x, zK = 1) πK N x | µK , ΣK
6461 which fully specifies the probabilistic model.
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 357
6463 which is exactly the GMM likelihood from (11.9). Therefore, the latent
6464 variable model with latent indicators zk is an equivalent way of thinking
6465 about a Gaussian mixture model.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
358 Density Estimation with Gaussian Mixture Models
Figure 11.12 π
Graphical model for
a GMM with N data
points.
zn
µk
Σk xn
k = 1, . . . , K
n = 1, . . . , N
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 359
6479 This means that p(zk = 1 | xn ) is the (posterior) probability that the k th
6480 mixture component generated data point xn and corresponds to the re-
6481 sponsibility rnk we introduced in (11.17). Now, the responsibilities also
6482 have not only an intuitive but also a mathematically justified interpreta-
6483 tion as posterior probabilities.
6485 where the expectation of log p(x, z | θ) is taken with respect to the poste-
6486 rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
6487 of model parameters θ (t+1) by maximizing (11.74b).
6488 Although an EM iteration does increase the log-likelihood, there are
6489 no guarantees that EM converges to the maximum likelihood solution.
6490 It is possible that the EM algorithm converges to a local maximum of
6491 the log-likelihood. Different initializations of the parameters θ could be
6492 used in multiple EM runs to reduce the risk of ending up in a bad local
6493 optimum. We do not go into further details here, but refer to the excellent
6494 expositions by Rogers and Girolami (2016) and Bishop (2006).
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
360 Density Estimation with Gaussian Mixture Models
Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 361
0.30 Figure 11.13
Data Histogram (orange
0.25 KDE bars) and kernel
Histogram density estimation
0.20
(blue line). The
p(x)
0.15
kernel density
estimator produces
0.10 a smooth estimate
of the underlying
0.05 density, whereas the
histogram is an
0.00 unsmoothed count
−4 −2 0 2 4 6 8
x measure of how
many data points
(black) fall into a
6547 bin size is a critical hyper-parameter, and a bad choice can lead to overfit- single bin.
6548 ting and underfitting. Cross-validation, as discussed in Section 8.1.4, can
6549 be used to determine a good bin size.
Kernel density estimation, independently proposed by Rosenblatt (1956) kernel density
and Parzen (1962), is a nonparametric way for density estimation. Given estimation
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
1 X x − xn
p(x) = k , (11.75)
N h n=1 h
6550 where k is a kernel function, i.e., a non-negative function that integrates
6551 to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
6552 lar role as the bin size in histograms. Note that we place a kernel on every
6553 single data point xn in the dataset. Commonly used kernel functions are
6554 the uniform distribution and the Gaussian distribution. Kernel density esti-
6555 mates are closely related to histograms, but by choosing a suitable kernel,
6556 we can guarantee smoothness of the density estimate. Figure 11.13 illus-
6557 trates the difference between a histogram and a kernel density estimator
6558 (with a Gaussian-shaped kernel) for a given dataset of 250 data points.
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.