13 Slides
13 Slides
Lecture 13
Gaussian Process Classification
Philipp Hennig
08 June 2020
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Generalized Linear Models
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families 8
3 27.04. Continuous Variables 2 16 16.06. Graphical Models
4 28.04. Monte Carlo 17 22.06. Factor Graphs 9
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models 10
7 11.05. Parametric Regression 4 20 30.06. Mixture Models
8 12.05. Learning Representations 21 06.07. EM 11
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference
10 19.05. Understanding Kernels 23 13.07. Topics
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
11 25.05. An Example for GP Regression 6 24 14.07. Example: Inferringevision
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 1
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
0
x2
−2
−4
−4 −3 −2 −1 0 1 2 3 4
x1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems
https://github.com/zalandoresearch/fashion-mnist
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification vs. Regression
Two types of supervised learning problems
Regression:
Given supervised data (special case d = 1: univariate regression)
Classification:
Given supervised data (special case d = 2: binary classification)
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 3
Until further notice, consider only discriminative binary classification:
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 4
Until further notice, consider only discriminative binary classification:
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 4
Let’s not throw out Gauss just yet
Turning Gaussian process models into discrete likelihoods
0.75
π(x)
0.5
0.25
3
f(x)
−3
−6
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 5
Link Functions
sigmoids — the logistic link function
0.8
0.6
σ(f)
0.4
0.2
0
−4 −3 −2 −1 0 1 2 3 4
Z
f
f
1 1 2 x
πf = σ(f) = = sech dx
1 + exp(−f) −∞ 4 2
dπ
σ(f) = 1 − σ(−f) f(π) = ln π − ln(1 − π) = π(f) · (1 − π(f))
df
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 6
CODE
Generative Model for Logistic Regression.ipynb
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 7
A Gaussian Process model for Classification
Logistic Regression
p(f) = GP(f; m, k)
(
σ(f) if y = 1
p(y | fx ) = σ(yfx ) = using σ(x) = 1 − σ(−x).
1 − σ(f) if y = −1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 8
A Gaussian Process model for Classification
Logistic Regression
p(f) = GP(f; m, k)
(
σ(f) if y = 1
p(y | fx ) = σ(yfx ) = using σ(x) = 1 − σ(−x).
1 − σ(f) if y = −1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 8
Logistic Regression is non-analytic
We’ll have to break out the toolbox
f2
f1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 9
Logistic Regression is non-analytic
We’ll have to break out the toolbox
f2
f1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 9
Logistic Regression is non-analytic
We’ll have to break out the toolbox
f2
f1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 9
We do not always care about all details of the posterior, just “key aspects”.
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 (2)10
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ Kernels ▶ maximum likelihood / MAP
▶ Markov Chains ▶ Laplace approximations
▶ ▶
▶
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 11
The Laplace Approximation
A local Gaussian approximation Pierre Simon M. de Laplace, 1814
f2
f1
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 12
An idea as old as Probabilistic Inference
Théorie analytique des probabilités, 1814
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 13
The Laplace Approximation
formally
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 14
The Laplace Approximation for GP Classification
conceptual step (implementation details coming up) [based on Rasmussen & Williams, 2006, §3.4]
∫
Recall: p(x) = N (x; m, V), p(z | x) = N (z; Ax, B) ⇒ p(z) = p(z | x)p(x) dx = N (z; Am, AVA⊺ + B).
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 15
The Laplace Approximation for GP Classification
conceptual step (implementation details coming up) [based on Rasmussen & Williams, 2006, §3.4]
∫
Recall: p(x) = N (x; m, V), p(z | x) = N (z; Ax, B) ⇒ p(z) = p(z | x)p(x) dx = N (z; Am, AVA⊺ + B).
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 15
The Laplace Approximation for GP Classification
conceptual step (implementation details coming up) [based on Rasmussen & Williams, 2006, §3.4]
q(fX ) = N (fX ; f̂, −(∇∇⊺ log p(fX | y)|fX =f̂ )−1 ) =: N (fX ; f̂, Σ̂)
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 15
▶ the Laplace approximation is only very roughly motivated (see above)
▶ it can be arbitrarily wrong, since it is a local approximation
▶ but it is often among the most computationally efficient things to try
▶ for logistic regression, it tends to work relatively well, because
▶ the log posterior is concave (see below)
▶ the algebraic structure of the link function yields “almost” a Gaussian posterior (cf. picture above)
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 16
Implementing the Laplace Approximation
Derivation [based on Rasmussen & Williams, 2006, §3.4]
Y
n
1
p(f) = GP(f, m, k) p(y | fX ) = σ(yi fxi ) σ(z) =
1 + e−x
i=1
log p(fX | y) = log p(y | fX ) + log p(fX ) − log p(y) with log σ(yi fxi ) = − log(1 + e−yi fxi )
X
n
1
= log σ(yi fxi ) − (fX − mX )⊺ K−1
XX (fX − mX ) + const.
2
i=1
X
n
∂ log σ(yi fxi ) yi + 1
∇ log p(fX | y) = ∇ log σ(yi fxi ) − K−1
XX (fX − mX ) with = δij − σ(fxi )
∂fxj 2
i=1
X
n
∂ 2 log σ(yi fxi )
∇∇⊺ log p(fX | y) = ∇∇⊺ log σ(yi fxi ) − K−1 with = −δia δib σ(fxi )(1 − σ(fxi ))
XX
∂fxa ∂fxb | {z }
i=1
=:wi with 0<wi <1
−1 −1
=: − diag w − K = −(W + K ) ^ convex minimization / concave maximization!
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 17
Implementing the Laplace Approximation I
Newton Optimization [Rasmussen & Williams, 2006, §3.4]
1 procedure OPTIMIZE(L(·), f0 )
2 f ^ f0 initialize
3 while not converged do
4 g ^ ∇L(f) compute gradient
5 H ^(∇∇⊺ L(f))−1 compute inverse Hessian
6 ∆ ^ Hg Newton step
7 f^f − ∆ perform step
8 converged ^ ∥∆∥ < ϵ check for convergence
9 end while
10 return f
11 end procedure
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 18
Implementing the Laplace Approximation II
Concrete Algorithm, preliminary form [Rasmussen & Williams, 2006, §3.4]
1 procedure GP-LOGISTIC-TRAIN(KXX , mX , y)
2 f ^ mX initialize
3 while not converged do
4 r ^ y+1
2 − σ(f) = ∇ log p(y | fX ), gradient of log likelihood
5 W ^ diag(σ(f) ⊙ (1 − σ(f))) = −∇∇ log p(y | fX ), Hessian of log likelihood
6 g ^ r − K−1
XX (f − mX ) compute gradient
7 H ^ −(W + K−1 )−1 compute inverse Hessian
8 ∆ ^ Hg Newton step
9 f^f − ∆ perform step
10 converged ^ ∥∆∥ < ϵ check for convergence
11 end while
12 return f
13 end procedure
This can be numerically unstable as it (repeatedly) requires (W + K−1 )−1 . For a numerically stable
alternative, use B := I + W1/2 KXX W1/2 (cf. Rasmussen & Williams).
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 19
Computing Predictions
from the Laplace approximation [Rasmussen & Williams, 2006, §3.4]
X
n
1
log p(fX | y) = log σ(yi fxi ) − (fX − mX )⊺ K−1
XX (fX − mX ) + const.
2
i=1
X
n
∂ log σ(yi fxi ) yi + 1
∇ log p(fX | y) = ∇ log σ(yi fxi ) −K−1
XX (fX − mX ) with = δij − σ(fxi )
∂fxj 2
i=1
| {z }
=:r
0
f
−2
−4
0.8
0.6
π
0.4
0.2
0
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Pictorial View
Gaussian process logistic regression
0
f
−2
−4
0.8
0.6
π
0.4
0.2
0
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Pictorial View
Gaussian process logistic regression
0
f
−2
−4
0.8
0.6
π
0.4
0.2
0
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Pictorial View
Gaussian process logistic regression
0
f
−2
−4
0.8
0.6
π
0.4
0.2
0
−8 −6 −4 −2 0 2 4 6 8
x
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Gaussian Process Classification — (Probabilistic) Logistic Regression:
▶ Supervised classification phrased in a discriminative model with probabilistic interpretation
▶ model binary outputs as a transformation of a latent function with a Gaussian process prior
▶ due to non-Gaussian likelihood, the posterior is non-Gaussian; exact inference intractable
▶ Laplace approximation: Find MAP estimator, second order expansion for Gaussian approximation
▶ tune code for numerical stability, efficient computations
▶ Laplace approximation provides Gaussian posterior on training points, hence evidence, predictions
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 22