0% found this document useful (0 votes)
17 views22 pages

Chapter 11

Uploaded by

azbcs4th
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

Chapter 11

Uploaded by

azbcs4th
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

6265

11

Density Estimation with Gaussian Mixture


Models

6266 In earlier chapters, we covered already two fundamental problems in


6267 machine learning: regression (Chapter 9) and dimensionality reduction
6268 (Chapter 10). In this chapter, we will have a look at a third pillar of ma-
6269 chine learning: density estimation. On our journey, we introduce impor-
6270 tant concepts, such as the Expectation Maximization (EM) algorithm and
6271 a latent variable perspective of density estimation with mixture models.
6272 When we apply machine learning to data we often aim to represent
6273 data in some way. A straightforward way is to take the data points them-
6274 selves as the representation of the data, see Figure 11.1 for an example.
6275 However, this approach may be unhelpful if the dataset is huge or if we
6276 are interested in representing characteristics of the data. In density esti-
6277 mation, we represent the data compactly using a density from a paramet-
6278 ric family, e.g., a Gaussian or Beta distribution. For example, we may be
6279 looking for the mean and variance of a dataset in order to represent the
6280 data compactly using a Gaussian distribution. The mean and variance can
6281 be found using tools we discussed in Section 8.2: maximum likelihood or
6282 maximum-a-posteriori estimation. We can then use the mean and variance
6283 of this Gaussian to represent the distribution underlying the data, i.e., we
6284 think of the dataset to be a typical realization from this distribution if we
6285 were to sample from it.

Figure 11.1
Two-dimensional
dataset that cannot 4
be meaningfully
represented by a 2
Gaussian.

0
x2

−2

−4

−5 0 5
x1

340
Draft chapter (November 8, 2018) from “Mathematics for Machine Learning” c 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
11.1 Gaussian Mixture Model 341

6286 In practice, the Gaussian (or similarly all other distributions we encoun-
6287 tered so far) have limited modeling capabilities. For example, a Gaussian
6288 approximation of the density that generated the data in Figure 11.1 would
6289 be a poor approximation. In the following, we will look at a more ex-
6290 pressive family of distributions, which we can use for density estimation:
6291 mixture models. mixture models
Mixture models can be used to describe a distribution p(x) by a convex
combination of K simple (base) distributions
K
X
p(x) = πk pk (x) (11.1)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.2)
k=1

6292 where the components pk are members of a family of basic distributions,


6293 e.g., Gaussians, Bernoullis or Gammas, and the πk are mixture weights. mixture weights
6294 Mixture models are more expressive than the corresponding base distri-
6295 butions because they allow for multimodal data representations, i.e., they
6296 can describe datasets with multiple “clusters”, such as the example in Fig-
6297 ure 11.1.
6298 We will focus on Gaussian mixture models (GMMs), where the basic
6299 distributions are Gaussians. For a given dataset, we aim to maximize the
6300 likelihood of the model parameters to train the GMM. For this purpose
6301 we will use results from Chapter 5, Section 7.2 and Chapter 6. However,
6302 unlike other application we discussed earlier (linear regression or PCA),
6303 we will not find a closed-form maximum likelihood solution. Instead, we
6304 will arrive at a set of dependent simultaneous equations, which we can
6305 only solve iteratively.

6306 11.1 Gaussian Mixture Model


A Gaussian mixture model is a density model where
 we combine a finite Gaussian mixture
number of K Gaussian distributions N x | µk , Σk so that model

K
X 
p(x | θ) = πk N x | µk , Σk (11.3)
k=1
K
X
0 6 πk 6 1 , πk = 1 , (11.4)
k=1

where we defined θ := {µk , Σk , πk : k = 1, . . . , K} as the collection of


all parameters of the model. This convex combination of Gaussian distri-
bution gives us significantly more flexibility for modeling complex densi-
ties than a simple Gaussian distribution (which we recover from (11.3) for
K = 1). An illustration is given in Figure 11.2, displaying the weighted

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
342 Density Estimation with Gaussian Mixture Models

Figure 11.2 0.30


Component 1
Gaussian mixture Component 2
model. The 0.25 Component 3
Gaussian mixture GMM density
distribution (black) 0.20
is composed of a

p(x)
0.15
convex combination
of Gaussian
0.10
distributions and is
more expressive 0.05
than any individual
component. Dashed 0.00
lines represent the −4 −2 0 2 4 6 8
weighted Gaussian x
components.

components and the mixture density, which is given as


p(x | θ) = 0.5N x | − 2, 12 + 0.2N x | 1, 2 + 0.3N x | 4, 1 . (11.5)
  

6307

6308 11.2 Parameter Learning via Maximum Likelihood


6309 Assume we are given a dataset X = {x1 , . . . , xN } where xn , n = 1, . . . , N
6310 are drawn i.i.d. from an unknown distribution p(x). Our objective is to
6311 find a good approximation/representation of this unknown distribution
6312 p(x) by means of a Gaussian mixture model (GMM) with K mixture com-
6313 ponents. The parameters of the GMM are the K means µk , the covariances
6314 Σk and mixture weights πk . We summarize all these free parameters in
6315 θ := {πk , µk , Σk : k = 1, . . . , K}.

Example 11.1 (Initial setting)

Figure 11.3 Initial 0.30 π1 N (x|µ1 , σ12 )


setting: GMM π2 N (x|µ2 , σ22 )
0.25 π3 N (x|µ3 , σ32 )
(black) with
GMM density
mixture three 0.20
mixture components
p(x)

(dashed) and seven 0.15


data points (discs).
0.10

0.05

0.00
−5 0 5 10 15
x

Throughout this chapter, we will have a simple running example that


helps us illustrate and visualize important concepts.

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 343

We consider a one-dimensional dataset X = {−3, −2.5, −1, 0, 2, 4, 5}


consisting of seven data points and wish to find a GMM with K = 3
components that models the density of the data. We initialize the mixture
components as

p1 (x) = N x | − 4, 1 (11.6)

p2 (x) = N x | 0, 0.2 (11.7)

p3 (x) = N x | 8, 3 (11.8)
and assign them equal weights π1 = π2 = π3 = 13 . The corresponding
model (and the data points) are shown in Figure 11.3.

In the following, we detail how to obtain a maximum likelihood esti-


mate θ ML of the model parameters θ . We start by writing down the like-
lihood, i.e., the predictive distribution of the training data given the pa-
rameters. We exploit our i.i.d. assumption, which leads to the factorized
likelihood
N
Y K
X 
p(X | θ) = p(xn | θ) , p(xn | θ) = πk N xn | µk , Σk , (11.9)
n=1 k=1

where every individual likelihood term p(xn | θ) is a Gaussian mixture


density. Then, we obtain the log-likelihood as
N
X N
X K
X 
log p(X | θ) = log p(xn | θ) = log πk N xn | µk , Σk . (11.10)
n=1 n=1 k=1
| {z }
=:L

6316 We aim to find parameters θ ∗ML that maximize the log-likelihood L defined
6317 in (11.10). Our “normal” procedure would be to compute the gradient
6318 dL/dθ of the log-likelihood with respect to the model parameters θ , set
6319 it to 0 and solve for θ . However, unlike our previous examples for maxi-
6320 mum likelihood estimation (e.g., when we discussed linear regression in
6321 Section 9.2), we cannot obtain a closed-form solution. However, we can
6322 exploit an iterative scheme to find good model parameters θ ML , which will
6323 out to be the EM algorithm for Gaussian mixture models. The key idea is
6324 to update one model parameter at a time while keeping the others fixed.
Remark. If we were to consider a single Gaussian as the desired density,
the sum over k in (11.10) vanishes, and the log can be applied directly to
the Gaussian component, such that we get
log N x | µ, Σ = − D2 log(2π) − 12 log det(Σ) − 12 (x − µ)> Σ−1 (x − µ).


(11.11)
6325 This simple form allows us find closed-form maximum likelihood esti-
6326 mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
344 Density Estimation with Gaussian Mixture Models

6327 the log into the sum over k so that we cannot obtain a simple closed-form
6328 maximum likelihood solution. ♦
Any local optimum of a function exhibits the property that its gradi-
ent with respect to the parameters must vanish (necessary condition), see
Chapter 7. In our case, we obtain the following necessary conditions when
we optimize the log-likelihood in (11.10) with respect to the GMM param-
eters µk , Σk , πk :
N
∂L X ∂ log p(xn | θ)
= 0> ⇐⇒ = 0> , (11.12)
∂µk n=1
∂µ k
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0, (11.13)
∂Σk n=1
∂Σk
N
∂L X ∂ log p(xn | θ)
= 0 ⇐⇒ = 0. (11.14)
∂πk n=1
∂πk
For all three necessary conditions, by applying the chain rule (see Sec-
tion 5.2.2), we require partial derivatives of the form
∂ log p(xn | θ) 1 ∂p(xn | θ)
= , (11.15)
∂θ p(xn | θ) ∂θ
where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and
1 1
= PK . (11.16)
p(xn | θ) j=1 πj N xn | µj , Σj

6329 In the following, we will compute the partial derivatives (11.12)–(11.14).


6330 But before we do this, we introduce a quantity that will play a central role
6331 in the remainder of this chapter: responsibilities.

6332 11.2.1 Responsibilities


We define the quantity

πk N xn | µk , Σk
rnk := PK  (11.17)
j=1 πj N xn | µj , Σj
responsibility as the responsibility of the k th mixture component for the nth data point.
The responsibility rnk of the k th mixture component for data point xn is
proportional to the likelihood

p(xn | πk , µk , Σk ) = πk N xn | µk , Σk (11.18)
6333 of the mixture component given the data point. Therefore, mixture com-
6334 ponents have a high responsibility for a data point when the data point
r n follows a 6335 could be a plausible sample from that mixture component. Note that
Boltzmann/Gibbs 6336 r n := [rn1 , . . . , rnK ]> ∈ RK is a (normalized) probability vector, i.e.,
distribution.

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 345
P
6337
k rnk = 1 with rnk > 0. This probability vector distributes probabil-
6338 ity mass among the K mixture components, and we can think of r n as a
6339 “soft assignment” of xn to the K mixture components. Therefore, the re- The responsibility
6340 sponsibility rnk from (11.17) represents the probability that xn has been rnk is the
probability that the
6341 generated by the k th mixture component.
kth mixture
component
Example 11.2 (Responsibilities) generated the nth
data point.
For our example from Figure 11.3 we compute the responsibilities rnk
1.0 0.0 0.0
 
 1.0 0.0 0.0 
 
0.057 0.943 0.0 
0.001 0.999 0.0  ∈ RN ×K .
 
  (11.19)
 0.0 0.066 0.934
 
 0.0 0.0 1.0 
0.0 0.0 1.0
Here, the nth row tells us the responsibilities of all mixture components
for xn . The sum of all K responsibilities for a data point (sum of every
row) is 1. The k th column gives us an overview of the responsibility of
the k th mixture component. We can see that the third mixture component
(third column) is not responsible for any of the first four data points, but
takes much responsibility of the remaining data points. The sum of all
entries of a column gives us the values Nk , i.e., the total responsibility of
the k th mixture component. In our example, we get N1 = 2.058, N2 =
2.008, N3 = 2.934.

6342 In the following, we determine the updates of the model parameters


6343 µk , Σk , πk for given responsibilities. We will see that the update equa-
6344 tions all depend on the responsibilities, which makes a closed-form solu-
6345 tion to the maximum likelihood estimation problem impossible. However,
6346 for given responsibilities we will be updating one model parameter at a
6347 time, while keeping the others fixed. After this, we will re-compute the
6348 responsibilities. Iterating these two steps will eventually converge to a lo-
6349 cal optimum and is a specific instantiation of the EM algorithm. We will
6350 discuss this in some more detail in Section 11.3.

6351 11.2.2 Updating the Means


Theorem 11.1 (Update of the GMM Means). The update of the mean pa-
rameters µk , k = 1, . . . , K , of the GMM is given by
PN
new n=1 rnk xn
µk = P N
, (11.20)
n=1 rnk

6352 where the responsibilities rnk are defined in (11.17).

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
346 Density Estimation with Gaussian Mixture Models

6353 Remark. The update of the means µk of the individual mixture compo-
6354 nents in (11.20) depends on all means, covariance matrices Σk and mix-
6355 ture weights πk via rnk given in (11.17). Therefore, we cannot obtain a
6356 closed-form solution for all µk at once. ♦
Proof From (11.15), we see that the gradient of the log-likelihood with
respect to the mean parameters µk , k = 1, . . . , K , requires us to compute
the partial derivative
K
 
∂p(xn | θ) X ∂N xn | µj , Σj ∂N xn | µk , Σk
= πj = πk (11.21)
∂µk j=1
∂µk ∂µk
= πk (xn − µk )> Σ−1

k N xn | µk , Σk (11.22)
6357 where we exploited that only the k th mixture component depends on µk .
We use our result from (11.22) in (11.15) and put everything together
so that the desired partial derivative of L with respect to µk is given as
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.23a)
∂µk n=1
∂µk n=1
p(x n | θ) ∂µk
N 
X
> −1
πk N xn | µk , Σk
= (xn − µk ) Σk PK  (11.23b)
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk
N
X
= rnk (xn − µk )> Σ−1
k . (11.23c)
n=1

6358 Here, we used the identity from (11.16) and the result of the partial
6359 derivative in (11.22) to get to (11.23b). The values rnk are the respon-
6360 sibilities we defined in (11.17).
∂L(µnew )
We now solve (11.23c) for µnewk so that ∂µk = 0> and obtain
k

N N PN N
X X rnk xn 1 X
rnk xn = rnk µnew
k ⇐⇒ µnew
k = P n=1
= rnk xn ,
n=1 n=1
N
rnk Nk n=1
n=1
(11.24)
where we defined
N
X
Nk := rnk (11.25)
n=1

6361 as the total responsibility of the k th mixture component for the entire
6362 dataset. This concludes the proof of Theorem 11.1.
Intuitively, (11.20) can be interpreted as an importance-weighted Monte-
Carlo estimate of the mean, where the importance weights of data point
xn are the responsibilities rnk of the k th cluster for xn , k = 1, . . . , K .

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 347

Therefore, the mean µk is pulled toward a data point xn with strength Figure 11.4 Update
given by rnk . The means are pulled stronger toward data points for which of the mean
parameter of
the corresponding mixture component has a high responsibility, i.e., a high
mixture component
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up- in a GMM. The
date in (11.20) as the expected value of all data points under the distri- mean µ is being
bution given by pulled toward
individual data
r k := [r1k , . . . , rN k ]> /Nk , (11.26) points with the
weights given by the
which is a normalized probability vector, i.e., corresponding
responsibilities.
µk ← Erk [X ] . (11.27)
x2 x3
r2
Example 11.3 (Mean Updates) r1 r3
x1
µk

Figure 11.5 Effect


0.30 π1 N (x|µ1 , σ12 ) π1 N (x|µ1 , σ12 ) of updating the
0.30
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
0.25 π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 ) mean values in a
0.20
GMM density GMM density GMM. (a) GMM
0.20
before updating the
p(x)

p(x)

0.15 0.15 mean values; (b)


0.10 0.10 GMM after updating
0.05 0.05 the mean values µk
while retaining the
0.00 0.00
−5 0 5 10 15 −5 0 5 10 15 variances and
x x mixture weights.
(a) GMM density and individual components (b) GMM density and individual components
prior to updating the mean values. after updating the mean values.

In our example from Figure 11.3, the mean values are updated as fol-
lows:
µ1 : −4 → −2.7 (11.28)
µ2 : 0 → −0.4 (11.29)
µ3 : 8 → 3.7 (11.30)
Here, we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .

6363 The update of the mean parameters in (11.20) look fairly straight-
6364 forward. However, note that the responsibilities rnk are a function of
6365 πj , µj , Σj for all j = 1, . . . , K , such that the updates in (11.20) depend
6366 on all parameters of the GMM, and a closed-form solution, which we ob-
6367 tained for linear regression in Section 9.2 or PCA in Chapter 10, cannot
6368 be obtained.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
348 Density Estimation with Gaussian Mixture Models

6369 11.2.3 Updating the Covariances


Theorem 11.2 (Updates of the GMM Covariances). The update of the co-
variance parameters Σk , k = 1, . . . , K of the GMM is given by
N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )> , (11.31)
Nk n=1

6370 where rnk and Nk are defined in (11.17) and (11.25), respectively.

Proof To prove Theorem 11.2 our approach is to compute the partial


derivatives of the log-likelihood L with respect to the covariances Σk , set
them to 0 and solve for Σk . We start with our general approach
N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = . (11.32)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk

We already know 1/p(xn | θ) from (11.16). To obtain the remaining par-


tial derivative ∂p(xn | θ)/∂Σk , we write down the definition of the Gaus-
sian distribution p(xn | θ), see (11.9), and drop all terms but the k th. We
then obtain
∂p(xn | θ)
(11.33a)
∂Σk
 
∂ −
D

1
1 > −1

= (2π) 2 det(Σk ) exp − 2 (xn − µk ) Σk (xn − µk )
2
∂Σk
(11.33b)

D ∂ 1
= πk (2π)− 2 det(Σk )− 2 exp − 12 (xn − µk )> Σ−1

k (xn − µk )
∂Σk

1 ∂
− 1 > −1

+ det(Σk ) 2 exp − 2 (xn − µk ) Σk (xn − µk ) . (11.33c)
∂Σk
We now use the identities
∂ 1 1
(5.100) 1
det(Σk )− 2 = − det(Σk )− 2 Σ−1
k , (11.34)
∂Σk 2
∂ (5.105)
(xn − µk )> Σ−1 −1 > −1
k (xn − µk ) = −Σk (xn − µk )(xn − µk ) Σk
∂Σk
(11.35)

and obtain (after some re-arranging) the desired partial derivative re-
quired in (11.32) as
∂p(xn | θ) 
= πk N xn | µk , Σk
∂Σk
· − 12 (Σ−1 −1 > −1
 
k − Σk (xn − µk )(xn − µk ) Σk ) . (11.36)

Putting everything together, the partial derivative of the log-likelihood

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 349

with respect to Σk is given by


N
X ∂ log p(xn | θ) N
∂L X 1 ∂p(xn | θ)
= = (11.37a)
∂Σk n=1
∂Σk n=1
p(xn | θ) ∂Σk
N
X πk N xn | µk , Σk

= PK 
n=1 j=1 πj N xn | µj , Σj
| {z }
=rnk

· − 21 (Σ−1 −1 > −1
 
k − Σk (xn − µk )(xn − µk ) Σk ) (11.37b)
N
1X
=− rnk (Σ−1 −1 > −1
k − Σk (xn − µk )(xn − µk ) Σk ) (11.37c)
2 n=1
N N
!
1 −1 X 1 −1 X
= − Σk rnk + Σk rnk (xn − µk )(xn − µk ) >
Σ−1
k .
2 n=1
2 n=1
| {z }
=Nk
(11.37d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
N
!
X
Nk Σ−1
k = Σk
−1
rnk (xn − µk )(xn − µk )> Σ−1k (11.38a)
n=1
N
!
X
⇐⇒ Nk I = rnk (xn − µk )(xn − µk )> Σ−1
k (11.38b)
n=1

By solving for Σk , we obtain


N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )> , (11.39)
Nk n=1
6371 where r k is the probability vector defined in (11.26). This gives us a sim-
6372 ple update rule for Σk for k = 1, . . . , K and proves Theorem 11.2.
6373 Similar to the update of µk in (11.20), we can interpret the update of
6374 the covariance in (11.31) as an importance-weighted expected value of
6375 the square of the centered data X̃k := {x1 − µk , . . . , xN − µk }.

Example 11.4 (Variance Updates)


In our example from Figure 11.3, the variances are updated as follows:
σ12 : 1 → 0.14 (11.40)
σ22 : 0.2 → 0.44 (11.41)
σ32 : 3 → 1.53 (11.42)

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
350 Density Estimation with Gaussian Mixture Models

Here, we see that the variances of the first and third component shrink
significantly, the variance of the second component increases slightly.
Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
vidual components prior to updating the variances. Figure 11.6(b) shows
the GMM density after updating the variances.
Figure 11.6 Effect
of updating the 0.30 π1 N (x|µ1 , σ12 ) 0.35 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
variances in a GMM. 0.25 π3 N (x|µ3 , σ32 )
0.30 π3 N (x|µ3 , σ32 )

(a) GMM before GMM density 0.25 GMM density


0.20
updating the 0.20
p(x)

p(x)
variances; (b) GMM 0.15
0.15
after updating the 0.10
0.10
variances while 0.05 0.05
retaining the means
0.00 0.00
and mixture −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
weights. x x

(a) GMM density and individual components (b) GMM density and individual components
prior to updating the variances. after updating the variances.

6376 Similar to the update of the mean parameters, we can interpret (11.31)
6377 as a Monte-Carlo estimate of the weighted covariance of data points xn
6378 associated with the k th mixture component, where the weights are the
6379 responsibilities rnk . As with the updates of the mean parameters, this up-
6380 date depends on all πj , µj , Σj , j = 1, . . . , K , through the responsibilities
6381 rnk , which prohibits a closed-form solution.

6382 11.2.4 Updating the Mixture Weights


Theorem 11.3 (Update of the GMM Mixture Weights). The mixture weights
of the GMM are updated as
Nk
πknew = (11.43)
N
6383 for k = 1, . . . , K , where N is the number of data points and Nk is defined
6384 in (11.25).

Proof To find the partial derivative of the log-likelihood with respect


P to
the weight parameters πk , k = 1, . . . , K , we take the constraint k πk =
1 into account by using Lagrange multipliers (see Section 7.2). The La-
grangian is
K
!
X
L=L+λ πk − 1 (11.44a)
k=1

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.2 Parameter Learning via Maximum Likelihood 351
N K K
!
X X  X
= log πk N xn | µk , Σk + λ πk − 1 , (11.44b)
n=1 k=1 k=1

where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivative with respect to πk as
N 
∂L X N xn | µk , Σk
= PK  +λ (11.45a)
∂πk n=1 j=1 πj N xn | µj , Σj
N 
1 X πk N xn | µk , Σk Nk
=  +λ = + λ, (11.45b)
πk n=1 K j=1 πj N xn | µj , Σj
πk
P
| {z }
=Nk

and the partial derivative with respect to the Lagrange multiplier λ as


K
∂L X
= πk − 1 . (11.46)
∂λ k=1
Setting both partial derivatives to 0 (necessary condition for optimum)
yields the system of equations
Nk
πk = − , (11.47)
λ
K
X
1= πk . (11.48)
k=1

Using (11.47) in (11.48) and solving for πk , we obtain


K K
X X Nk N
πk = 1 ⇐⇒ − = 1 ⇐⇒ − = 1 ⇐⇒ λ = −N .
k=1 k=1
λ λ
(11.49)
This allows us to substitute −N for λ in (11.47) to obtain
Nk
, πknew = (11.50)
N
6385 which gives us the update for the weight parameters πk and proves Theo-
6386 rem 11.3.
6387 We can identify the mixture weight in (11.43) as the ratio of the to-
6388 tal responsibility
P of the k th cluster and the number of data points. Since
6389 N = k Nk the number of data points can also be interpreted as the to-
6390 tal responsibility of all mixture components together, such that πk is the
6391 relative importance of the k th mixture component for the dataset.
PN
6392 Remark. Since Nk = i=1 rnk , the update equation (11.43) for the mix-
6393 ture weights πk also depends on all πj , µj , Σj , j = 1, . . . , K via the re-
6394 sponsibilities rnk . ♦

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
352 Density Estimation with Gaussian Mixture Models

Example 11.5 (Weight Parameter Updates)

Figure 11.7 Effect


of updating the 0.35 π1 N (x|µ1 , σ12 ) 0.30 π1 N (x|µ1 , σ12 )
π2 N (x|µ2 , σ22 ) π2 N (x|µ2 , σ22 )
mixture weights in a 0.30 π3 N (x|µ3 , σ32 ) 0.25 π3 N (x|µ3 , σ32 )
GMM density GMM density
GMM. (a) GMM 0.25
0.20
before updating the 0.20
p(x)

p(x)
0.15
mixture weights; (b) 0.15
0.10
GMM after updating 0.10

the mixture weights 0.05 0.05

while retaining the 0.00 0.00


−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
means and x x
variances. Note the
(a) GMM density and individual components (b) GMM density and individual components
different scalings of
prior to updating the mixture weights. after updating the mixture weights.
the vertical axes.
In our running example from Figure 11.3, the mixture weights are up-
dated as follows:
1
π1 : 3
→ 0.29 (11.51)
1
π2 : 3
→ 0.29 (11.52)
1
π3 : 3
→ 0.42 (11.53)
Here we see that the third component gets more weight/importance,
while the other components become slightly less important. Figure 11.7
illustrates the effect of updating the mixture weights. Figure 11.7(a) is
identical to Figure 11.6(b) and shows the GMM density and its individual
components prior to updating the mixture weights. Figure 11.7(b) shows
the GMM density after updating the mixture weights.
Overall, having updated the means, the variances and the weights once,
we obtain the GMM shown in Figure 11.7(b). Compared with the ini-
tialization shown in Figure 11.3, we can see that the parameter updates
caused the GMM density to shift some of its mass toward the data points.
After updating the means, variances and weights once, the GMM fit
in Figure 11.7(b) is already remarkably better than its initialization from
Figure 11.3. This is also evidenced by the log-likelihood values, which in-
creased from 28.3 (initialization) to 14.4 after one complete update cycle.

6395 11.3 EM Algorithm


6396 Unfortunately, the updates in (11.20), (11.31), and (11.43) do not consti-
6397 tute a closed-form solution for the updates of the parameters µk , Σk , πk
6398 of the mixture model because the responsibilities rnk depend on those pa-
6399 rameters in a complex way. However, the results suggest a simple iterative
6400 scheme for finding a solution to the parameters estimation problem via
EM algorithm 6401 maximum likelihood. The Expectation Maximization algorithm (EM algo-

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.3 EM Algorithm 353

6402 rithm) was proposed by Dempster et al. (1977) and is a general iterative
6403 scheme for learning parameters (maximum likelihood or MAP) in mixture
6404 models and, more generally, latent-variable models.
6405 In our example of the Gaussian mixture model, we choose initial values
6406 for µk , Σk , πk and alternate until convergence between
6407 • E-step: Evaluate the responsibilities rnk (posterior probability of data
6408 point n belonging to mixture component k ).
6409 • M-step: Use the updated responsibilities to re-estimate the parameters
6410 µk , Σk , πk .
6411 Every step in the EM algorithm increases the log-likelihood function (Neal
6412 and Hinton, 1999). For convergence, we can check the log-likelihood or
6413 the parameters directly. A concrete instantiation of the EM algorithm for
6414 estimating the parameters of a GMM is as follows:
6415 1 Initialize µk , Σk , πk
2 E-step: Evaluate responsibilities rnk for every data point xn using cur-
rent parameters πk , µk , Σk :

πk N xn | µk , Σk
rnk = P . (11.54)
j πj N xn | µj , Σj

3 M-step: Re-estimate parameters πk , µk , Σk using the current responsi-


bilities rnk (from E-step):
N
1 X
µk = rnk xn , (11.55)
Nk n=1
N
1 X
Σk = rnk (xn − µk )(xn − µk )> , (11.56)
Nk n=1
Nk
πk = . (11.57)
N

Example 11.6 (GMM Fit)

Figure 11.8 EM
π1 N (x|µ1 , σ12 ) 28 algorithm applied to
0.30
π2 N (x|µ2 , σ22 )
the GMM from
Negative log-likelihood

26
0.25 π3 N (x|µ3 , σ32 )
GMM density 24 Figure 11.2.
0.20
22
p(x)

0.15
20
0.10
18
0.05 16
0.00 14
−5 0 5 10 15 0 1 2 3 4 5
x
Iteration
(a) Final GMM fit. After five iterations, the EM (b) Negative log-likelihood as a function of the
algorithm converges and returns this mixture EM iterations.
density.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
354 Density Estimation with Gaussian Mixture Models
Figure 11.9 10
Illustration of the 104

Negative log-likelihood
EM algorithm for 5
fitting a Gaussian
mixture model with
x2

0
three components to 6 × 103

a two-dimensional −5
dataset. (a) Dataset;
4 × 103
(b) Negative log −10
−10 −5 0 5 10 0 20 40 60
likelihood (lower is x1 EM iteration
better) as a function
(a) Dataset. (b) Negative log-likelihood.
of the EM iterations.
The red dots
10 10
indicate the
iterations for which
5 5
the corresponding
GMM fits are shown
x2

x2
0 0
in (c)–(f). The
yellow discs indicate
−5 −5
the mean of the
Gaussian
−10 −10
distribution. −10 −5 0 5 10 −10 −5 0 5 10
x1 x1

(c) EM initialization. (d) EM after 1 iteration.

10 10

5 5
x2

x2

0 0

−5 −5

−10 −10
−10 −5 0 5 10 −10 −5 0 5 10
x1 x1

(e) EM after 10 iterations (f) EM after 63 iterations

When we run EM on our example from Figure 11.3, we obtain the final
result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
shows how the negative log-likelihood evolves as a function of the EM
iterations. The final GMM is given as
 
p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
 (11.58)
+ 0.43N x | 3.64, 1.63 .

6416 We applied the EM algorithm to the two-dimensional dataset shown


6417 in Figure 11.1 with K = 3 mixture components. Figure 11.9 illustrates
6418 a some steps of the EM algorithm and the negative log-likelihood as a
6419 function of the EM iteration (Figure 11.9(b)).

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 355

Figure 11.10
4 Dataset colored
according to the
2 responsibilities of
the mixture
0 components when
x2

EM converges.
−2 While a single
mixture component
−4 is clearly
responsible for the
−5 0 5 data on the left, the
x1 overlap of the two
data clusters on the
right could have
been generated by
6420 Figure 11.10 visualizes the final responsibilities of the mixture compo-
two mixture
6421 nents for the data points. It becomes clear that there are data points that components.
6422 cannot be uniquely assigned to a single (either blue or yellow) compo-
6423 nent, such that the responsibilities of these two clusters for those points
6424 are around 0.5.

6425 11.4 Latent Variable Perspective


6426 We can look at the GMM from the perspective of a discrete latent variable
6427 model, i.e., where the latent variable z can attain only a finite set of val-
6428 ues. This is in contrast to PCA where the latent variables were continuous-
6429 valued numbers in RM .
6430 The advantages of the probabilistic perspective are that (i) it will justify
6431 some ad-hoc decisions we made in the previous sections, (ii) it allows for
6432 a concrete interpretation of the responsibilities as posterior probabilities,
6433 (iii) the iterative algorithm for updating the model parameters can be de-
6434 rived in a principled manner as the EM algorithm for maximum likelihood
6435 parameter estimation in latent variable models.

6436 11.4.1 Generative Process and Probabilistic Model


6437 To derive the probabilistic model for GMMs, it is useful to think about the
6438 generative process, i.e., the process that allows us to generate data, using
6439 a probabilistic model.
We assume a mixture model with K components and that a data point
x can be generated by exactly one mixture component. We introduce a
binary indicator variable zk ∈ {0, 1} with two states (see Section 6.2) that
indicates whether the k th mixture component generated that data point
so that

p(x | zk = 1) = N x | µk , Σk . (11.59)
6440 We define z := [z1 , . . . , zK ]> ∈ RK as a probability vector consisting of

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
356 Density Estimation with Gaussian Mixture Models

Sometimes this kind 6441 K −1 many 0s and exactly one 1. For example, for K = 3, a valid z would
of probability 6442 be z = [z1 , z2 , z3 ]> = [0, 1, 0]> , which would select the second mixture
distribution is called PK
“multinoulli”, a
6443 component since z2 = 1. The properties of z imply that k=1 zk = 1.
generalization of 6444 Therefore, z is a one-hot encoding (also: 1-of-K representation).
the Bernoulli Thus far, we assumed that the indicator variables zk are known. How-
distribution to more ever, in practice, this is not the case, and we place a prior distribution on
than two
values (Murphy,
z . Given that we do not know which mixture component generated the
2012). data point, we treat the indicator vector z as a latent variable and place a
one-hot encoding prior
1-of-K K
X
representation >
p(z) = π = [π1 , . . . , πK ] , πk = 1 , (11.60)
k=1

on it. Then the k th entry


πk = p(zk = 1) (11.61)
6445 of this probability vector describes the probability that the k th mixture
Figure 11.11 6446 component generated data point x.
Graphical model for
6447
a GMM with a single
Remark (Sampling from a GMM). The construction of this latent variable
data point. 6448 model (see the corresponding graphical model in Figure 11.11) lends it-
6449
π self to a very simple sampling procedure (generative process) to generate
6450 data:
6451
z 1 Sample z (i) ∼ p(z)
6452 2 Sample x(i) ∼ p(x | z (i) = 1)

µk 6453 In the first step, we select a mixture component i (via the one-hot encod-
6454 ing z ) at random according to p(z) = π ; in the second step we draw a
Σk x
k = 1, . . . , K
6455 sample from the corresponding mixture component. When we discard the
6456 samples of the latent variable so that we are left with the x(i) , we have
6457 valid samples from the GMM.
6458 This kind of sampling, where samples of random variables depend on
ancestral sampling6459 samples from the variable’s parents in the graphical model, is called ancestral
6460 sampling. ♦
Generally, a probabilistic model is defined by the joint distribution of
the data and the latent variables (see Section 8.3). With the prior p(z)
defined in (11.60)–(11.61) and the conditional p(x | z) from (11.59) we
obtain all K components of this joint distribution via

p(x, zk = 1) = p(x | zk = 1)p(zk = 1) = πk N x | µk , Σk (11.62)
for k = 1, . . . , K , so that
    
p(x, z1 = 1) π1 N x | µ1 , Σ1
p(x, z) =  .. ..
= , (11.63)
   
. . 
p(x, zK = 1) πK N x | µK , ΣK
6461 which fully specifies the probabilistic model.

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.4 Latent Variable Perspective 357

6462 11.4.2 Likelihood


To obtain the likelihood p(x | θ) in a latent variable model, we need to
marginalize out the latent variables. In our case, this can be done by
summing out all latent variables from the joint p(x, z) in (11.63) so that
X
p(x | θ) = p(x | θ, z)p(z | θ) , θ := {µk , Σk , πk : k = 1, . . . , K} .
z
(11.64)
We now explicitly condition on the parameters θ of the probabilistic model,
which we previously omitted. In (11.64), P we sum over all K possible one-
hot encodings of z , which is denoted by z . Since there is only a single
non-zero single entry in each z there are only K possible configurations/
settings of z . For example, if K = 3 then z can have the configurations
     
1 0 0
0 , 1 , 0 . (11.65)
0 0 1
Summing over all possible configurations of z in (11.64) is equivalent to
looking at the non-zero entry of the z -vector and write
X
p(x | θ) = p(x | θ, z)p(z | θ) (11.66a)
z
K
X
= p(x | θ, zk = 1)p(zk = 1 | θ) (11.66b)
k=1

so that the desired marginal distribution is given as


K
(11.66b)
X
p(x | θ) = p(x | θ, zk = 1)p(zk = 1|θ) (11.67a)
k=1
K
X 
= πk N x | µk , Σk , (11.67b)
k=1

which we identify as the GMM model from (11.3). Given a dataset X we


immediately obtain the likelihood
N N X
K
(11.67b)
Y Y 
p(X | θ) = p(xn | θ) = πk N xn | µk , Σk (11.68)
n=1 n=1 k=1

6463 which is exactly the GMM likelihood from (11.9). Therefore, the latent
6464 variable model with latent indicators zk is an equivalent way of thinking
6465 about a Gaussian mixture model.

6466 11.4.3 Posterior Distribution


Let us have a brief look at the posterior distribution on the latent variable
z . According to Bayes’ theorem, the posterior of the k th component having

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
358 Density Estimation with Gaussian Mixture Models

Figure 11.12 π
Graphical model for
a GMM with N data
points.
zn

µk
Σk xn
k = 1, . . . , K
n = 1, . . . , N

generated data point x


p(zk = 1)p(x | zk = 1)
p(zk = 1 | x) = , (11.69)
p(x)
where the marginal p(x) is given in (11.67b). This yields the posterior
distribution for the k th indicator variable zk

p(zk = 1)p(x | zk = 1) πk N x | µk , Σk
p(zk = 1 | x) = PK = PK ,
j=1 p(zj = 1)p(x | zj = 1) j=1 πj N x | µj , Σj
(11.70)
6467 which we identify as the responsibility of the k th mixture component for
6468 data point x. Note that we omitted the explicit conditioning on the GMM
6469 parameters πk , µk , Σk where k = 1, . . . , K .

6470 11.4.4 Extension to a Full Dataset


6471 Thus far, we only discussed the case where the dataset consists only of a
6472 single data point x. However, the concepts of the prior and the posterior
6473 can be directly extended to the case of N data points X := {x1 , . . . , xN }.
In the probabilistic interpretation of the GMM, every data point xn pos-
sesses its own latent variable
z n = [zn1 , . . . , znK ]> ∈ RK . (11.71)
6474 Previously (when we only considered a single data point x) we omitted
6475 the index n, but now this becomes important.
6476 We share the same prior distribution π across all latent variables z n .
6477 The corresponding graphical model is shown in Figure 11.12, where we
6478 use the plate notation.
The conditional distribution p(x1 , . . . , xN | z 1 , . . . , z N ) factorizes over
the data points and is given as
N
Y
p(x1 , . . . , xN | z 1 , . . . , z N ) = p(xn | z n ) . (11.72)
n=1

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 359

To obtain the posterior distribution p(znk = 1 | xn ) we follow the same


reasoning as in Section 11.4.3 and apply Bayes’ theorem to obtain
p(xn | znk = 1)p(znk = 1)
p(znk = 1 | xn ) = PK (11.73a)
j=1 p(xn | znj = 1)p(znj = 1)

πk N xn | µk , Σk
= PK  = rnk . (11.73b)
j=1 πj N xn | µj , Σj

6479 This means that p(zk = 1 | xn ) is the (posterior) probability that the k th
6480 mixture component generated data point xn and corresponds to the re-
6481 sponsibility rnk we introduced in (11.17). Now, the responsibilities also
6482 have not only an intuitive but also a mathematically justified interpreta-
6483 tion as posterior probabilities.

6484 11.4.5 EM Algorithm Revisited


The EM algorithm that we introduced as an iterative scheme for maximum
likelihood estimation can be derived in a principled way from the latent
variable perspective. Given a current setting θ (t) of model parameters, the
E-step calculates the expected log-likelihood
Q(θ | θ (t) ) = Ez | x,θ(t) [log p(x, z | θ)] (11.74a)
Z
= log p(x, z | θ)p(z | x, θ (t) )dz , (11.74b)

6485 where the expectation of log p(x, z | θ) is taken with respect to the poste-
6486 rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
6487 of model parameters θ (t+1) by maximizing (11.74b).
6488 Although an EM iteration does increase the log-likelihood, there are
6489 no guarantees that EM converges to the maximum likelihood solution.
6490 It is possible that the EM algorithm converges to a local maximum of
6491 the log-likelihood. Different initializations of the parameters θ could be
6492 used in multiple EM runs to reduce the risk of ending up in a bad local
6493 optimum. We do not go into further details here, but refer to the excellent
6494 expositions by Rogers and Girolami (2016) and Bishop (2006).

6495 11.5 Further Reading


6496 The GMM can be considered a generative model in the sense that it is
6497 straightforward to generate new data using ancestral sampling (Bishop,
6498 2006). For given GMM parameters πk , µk , Σk , k = 1, . . . , K , we sample
6499 an index k from the probability
 vector [π1 , . . . , πK ]> and then sample a
6500 data point x ∼ N µk , Σk . If we repeat this N times, we obtain a dataset
6501 that has been generated by a GMM. Figure 11.1 was generated using this
6502 procedure.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
360 Density Estimation with Gaussian Mixture Models

6503 Throughout this chapter, we assumed that the number of components


6504 K is known. In practice, this is often not the case. However, we could use
6505 nested cross-validation, as discussed in Section 8.5.1, to find good models.
6506 Gaussian mixture models are closely related to the K -means cluster-
6507 ing algorithm. K -means also uses the EM algorithm to assign data points
6508 to clusters. If we treat the means in the GMM as cluster centers and ig-
6509 nore the covariances, we arrive at K -means. As also nicely described by
6510 MacKay (2003a), K -means makes a “hard” assignments of data points
6511 to cluster centers µk , whereas a GMM makes a “soft” assignment via the
6512 responsibilities.
6513 We only touched upon the latent variable perspective of GMMs and the
6514 EM algorithm. Note that EM can be used for parameter learning in general
6515 latent variable models, e.g., nonlinear state-space models (Ghahramani
6516 and Roweis, 1999; Roweis and Ghahramani, 1999) and for reinforcement
6517 learning as discussed by Barber (2012). Therefore, the latent variable per-
6518 spective of a GMM is useful to derive the corresponding EM algorithm in
6519 a principled way (Bishop, 2006; Barber, 2012; Murphy, 2012).
6520 We only discussed maximum likelihood estimation (via the EM algo-
6521 rithm) for finding GMM parameters. The standard criticisms of maximum
6522 likelihood also apply here:
6523 • As in linear regression, maximum likelihood can suffer from severe
6524 overfitting. In the GMM case, this happens when the mean of a mix-
6525 ture component is identical to a data point and the covariance tends to
6526 0. Then, the likelihood approaches infinity. Bishop (2006) and Barber
6527 (2012) discuss this issue in detail.
6528 • We only obtain a point estimate of the parameters πk , µk , Σk for k =
6529 1, . . . , K , which does not give any indication of uncertainty in the pa-
6530 rameter values. A Bayesian approach would place a prior on the param-
6531 eters, which can be used to obtain a posterior distribution on the param-
6532 eters. This posterior allows us to compute the model evidence (marginal
6533 likelihood), which can be used for model comparison, which gives us a
6534 principled way to determine the number of mixture components. Un-
6535 fortunately, closed-form inference is not possible in this setting because
6536 there is no conjugate prior for this model. However, approximations,
6537 such as variational inference, can be used to obtain an approximate
6538 posterior (Bishop, 2006).
6539 In this chapter, we discussed mixture models for density estimation.
6540 There is a plethora of density estimation techniques available. In practice,
6541 we often use histograms and kernel density estimation.
histograms 6542 Histograms provide a non-parametric way to represent continuous den-
6543 sities and have been proposed by Pearson (1895). A histogram is con-
6544 structed by “binning” the data space and count how many data points fall
6545 into each bin. Then a bar is drawn at the center of each bin, and the height
6546 of the bar is proportional to the number of data points within that bin. The

Draft (2018-11-08) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
11.5 Further Reading 361
0.30 Figure 11.13
Data Histogram (orange
0.25 KDE bars) and kernel
Histogram density estimation
0.20
(blue line). The
p(x)

0.15
kernel density
estimator produces
0.10 a smooth estimate
of the underlying
0.05 density, whereas the
histogram is an
0.00 unsmoothed count
−4 −2 0 2 4 6 8
x measure of how
many data points
(black) fall into a
6547 bin size is a critical hyper-parameter, and a bad choice can lead to overfit- single bin.
6548 ting and underfitting. Cross-validation, as discussed in Section 8.1.4, can
6549 be used to determine a good bin size.
Kernel density estimation, independently proposed by Rosenblatt (1956) kernel density
and Parzen (1962), is a nonparametric way for density estimation. Given estimation
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
N
1 X x − xn
 
p(x) = k , (11.75)
N h n=1 h
6550 where k is a kernel function, i.e., a non-negative function that integrates
6551 to 1 and h > 0 is a smoothing/bandwidth parameter, which plays a simi-
6552 lar role as the bin size in histograms. Note that we place a kernel on every
6553 single data point xn in the dataset. Commonly used kernel functions are
6554 the uniform distribution and the Gaussian distribution. Kernel density esti-
6555 mates are closely related to histograms, but by choosing a suitable kernel,
6556 we can guarantee smoothness of the density estimate. Figure 11.13 illus-
6557 trates the difference between a histogram and a kernel density estimator
6558 (with a Gaussian-shaped kernel) for a given dataset of 250 data points.

c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.

You might also like