0% found this document useful (0 votes)
36 views39 pages

13 Slides

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views39 pages

13 Slides

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Probabilistic Machine Learning

Lecture 13
Gaussian Process Classification

Philipp Hennig
08 June 2020

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Generalized Linear Models
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families 8
3 27.04. Continuous Variables 2 16 16.06. Graphical Models
4 28.04. Monte Carlo 17 22.06. Factor Graphs 9
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models 10
7 11.05. Parametric Regression 4 20 30.06. Mixture Models
8 12.05. Learning Representations 21 06.07. EM 11
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference
10 19.05. Understanding Kernels 23 13.07. Topics
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
11 25.05. An Example for GP Regression 6 24 14.07. Example: Inferringevision
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 1
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

0
x2

−2

−4
−4 −3 −2 −1 0 1 2 3 4
x1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification Problems
a typography of supervised learning problems

https://github.com/zalandoresearch/fashion-mnist

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 2
Classification vs. Regression
Two types of supervised learning problems

Regression:
Given supervised data (special case d = 1: univariate regression)

(X, Y) := (xi , yi )i=1,...,n with xi ∈ X, yi ∈ Rd

find function f : X _ Rd such that f “models” Y ≈ f(X).

Classification:
Given supervised data (special case d = 2: binary classification)

(X, Y) := (xi , ci )i=1,...,n with xi ∈ X, ci ∈ {1, . . . , d}


Pd
find probability π : X _ Ud (Ud = {p ∈ [0, 1]d : i=1 pi = 1}) such that π “models” yi ∼ πxi .

Regression predicts a function, classification predicts a probability.

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 3
Until further notice, consider only discriminative binary classification:

y ∈ {−1; +1} x _ π(x) =: πx ∈ [0, 1]


(
π(x) if y = 1
p(y | x) =
1 − π(x) if y = −1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 4
Until further notice, consider only discriminative binary classification:

y ∈ {−1; +1} x _ π(x) =: πx ∈ [0, 1]


(
π(x) if y = 1
p(y | x) =
1 − π(x) if y = −1

Discriminative learning phrased probabilistically:


▶ We would like to learn πx (y) = p(y | x)
▶ This is almost like regression: p(y | x) = N (y; fx , σ 2 ) = πx (y)
▶ only the domain is wrong: y ∈ {−1; 1} vs. y ∈ R.

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 4
Let’s not throw out Gauss just yet
Turning Gaussian process models into discrete likelihoods

0.75
π(x)

0.5

0.25

3
f(x)

−3

−6
−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 5
Link Functions
sigmoids — the logistic link function

0.8

0.6
σ(f)

0.4

0.2

0
−4 −3 −2 −1 0 1 2 3 4

Z  
f
f
1 1 2 x
πf = σ(f) = = sech dx
1 + exp(−f) −∞ 4 2

σ(f) = 1 − σ(−f) f(π) = ln π − ln(1 − π) = π(f) · (1 − π(f))
df
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 6
CODE
Generative Model for Logistic Regression.ipynb

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 7
A Gaussian Process model for Classification
Logistic Regression

p(f) = GP(f; m, k)
(
σ(f) if y = 1
p(y | fx ) = σ(yfx ) = using σ(x) = 1 − σ(−x).
1 − σ(f) if y = −1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 8
A Gaussian Process model for Classification
Logistic Regression

p(f) = GP(f; m, k)
(
σ(f) if y = 1
p(y | fx ) = σ(yfx ) = using σ(x) = 1 − σ(−x).
1 − σ(f) if y = −1

The problem: The posterior is not Gaussian!


Qn
p(Y | fX )p(fX ) N (fX ; m, k) i=1 σ(yi fxi )
p(fX | Y) = =R Qn
p(Y) N (fX ; m, k) i=1 σ(yi fxi ) dfX
1 X
n
log p(fX | Y) = − f⊺X k−1 f X + log σ(yi fxi ) + const.
2 XX
i=1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 8
Logistic Regression is non-analytic
We’ll have to break out the toolbox

f2

f1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 9
Logistic Regression is non-analytic
We’ll have to break out the toolbox

f2

f1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 9
Logistic Regression is non-analytic
We’ll have to break out the toolbox

f2

f1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 9
We do not always care about all details of the posterior, just “key aspects”.

▶ Remember that the Gaussian choice was also one of convenience.


▶ Moments of p(f, y) = p(y | f)p(f) we may be interested in
Z Z
Ep (1) = p(y, f) df = 1 · p(y, f) df =Z the evidence
Z Z
1
Ep(f|y) (f) = f · p(f | y) df = f · p(f, y) df = f̄ the mean
Z
Z Z
1
Ep(f|y) (f2 ) − f̄2 = f2 · p(f | y) df − f̄2 = f2 · p(f, y) df − f̄2 = var(f) the variance
Z

Z for hyperparameter tuning f̄ as a point estimator var(f) as an error estimator


Unfortunately, all these are usually intractable. But we can aim to approximate them.

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 (2)10
The Toolbox

Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ Kernels ▶ maximum likelihood / MAP
▶ Markov Chains ▶ Laplace approximations
▶ ▶

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 11
The Laplace Approximation
A local Gaussian approximation Pierre Simon M. de Laplace, 1814

f2

f1

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 12
An idea as old as Probabilistic Inference
Théorie analytique des probabilités, 1814

Pierre-Simon, marquis de Laplace


(1749-1827)

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 13
The Laplace Approximation
formally

▶ Consider a probability distribution p(θ) (may be a posterior p(θ | D) or something else)


▶ find a (local) maximum of p(θ) or (equivalently) log p(θ)

θ̂ = arg max log p(θ) ⇒ ∇ log p(θ̂) = 0

▶ perform second order Taylor expansion around θ = θ̂ + δ in log space


 
1
log p(δ) = log p(θ̂) + δ ⊺ ∇∇⊺ log p(θ̂) δ + O(δ 3 )
2 | {z }
=:Ψ

▶ define the Laplace approximation q to p

q(θ) = N (θ; θ̂, −Ψ−1 )

▶ Note that, if p(θ) = N (θ; m, Σ), then p(θ) = q(θ)

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 14
The Laplace Approximation for GP Classification
conceptual step (implementation details coming up) [based on Rasmussen & Williams, 2006, §3.4]

▶ Find maximum posterior probability for latent f at training points


f̂ = arg max log p(fX | y)
▶ Assign approximate Gaussian posterior at training points
q(fX ) = N (fX ; f̂, −(∇∇⊺ log p(fX | y)|fX =f̂ )−1 ) =: N (fX ; f̂, Σ̂)
▶ approximate posterior predictions at fx for latent function
Z Z
q(fx | y) = p(fx | fX )q(fX ) dfX = N (fx ; mx + kxX K−1 −1
XX (fX − mX ), kxx − kxX KXX kXx )q(fX ) dfX

= N (fx ; mx + kxX K−1 −1 −1 −1


XX (f̂ − mX ), kxx − kxX KXX kXx + kxX KXX Σ̂KXX kXx )
Compare with exact predictions
Z
Ep(fx ,fX |y) (fx ) = (Ep(fx |fX ) (fx ))p(fX | y) dfX = mx + kxX K−1
XX (Ep(fX |y) (fX ) − mX ) =: f̄x


Recall: p(x) = N (x; m, V), p(z | x) = N (z; Ax, B) ⇒ p(z) = p(z | x)p(x) dx = N (z; Am, AVA⊺ + B).

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 15
The Laplace Approximation for GP Classification
conceptual step (implementation details coming up) [based on Rasmussen & Williams, 2006, §3.4]

▶ Find maximum posterior probability for latent f at training points


f̂ = arg max log p(fX | y)
▶ Assign approximate Gaussian posterior at training points
q(fX ) = N (fX ; f̂, −(∇∇⊺ log p(fX | y)|fX =f̂ )−1 ) =: N (fX ; f̂, Σ̂)
▶ approximate posterior predictions at fx for latent function
Z Z
q(fx | y) = p(fx | fX )q(fX ) dfX = N (fx ; mx + kxX K−1 −1
XX (fX − mX ), kxx − kxX KXX kXx )q(fX ) dfX

= N (fx ; mx + kxX K−1 −1 −1 −1


XX (f̂ − mX ), kxx − kxX KXX kXx + kxX KXX Σ̂KXX kXx )
Compare with exact predictions
Z
varp(fx ,fX |y) (fx ) = (fx − f̄x )2 dp(fx | fX ) dp(fX ) = kxx − kxX K−1 −1 −1
XX kXx + kxX KXX varp(fX |y) (fX )KXX kXx )


Recall: p(x) = N (x; m, V), p(z | x) = N (z; Ax, B) ⇒ p(z) = p(z | x)p(x) dx = N (z; Am, AVA⊺ + B).

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 15
The Laplace Approximation for GP Classification
conceptual step (implementation details coming up) [based on Rasmussen & Williams, 2006, §3.4]

▶ Find maximum posterior probability for latent f at training points

f̂ = arg max log p(fX | y)


▶ Assign approximate Gaussian posterior at training points

q(fX ) = N (fX ; f̂, −(∇∇⊺ log p(fX | y)|fX =f̂ )−1 ) =: N (fX ; f̂, Σ̂)

▶ approximate posterior predictions at fx for latent function


Z Z
q(fx | y) = p(fx | fX )q(fX ) dfX = N (fx ; mx + kxX K−1 −1
XX (fX − mX ), kxx − kxX KXX kXx )q(fX ) dfX

= N (fx ; mx + kxX K−1 −1 −1 −1


XX (f̂ − mX ), kxx − kxX KXX kXx + kxX KXX Σ̂KXX kXx )

▶ compute predictions for label probabilities:


Z
Ep(f|y) [πx ] ≈ Eq [πx ] = σ(fx )q(fx | y) dfx or (not the same!) π̂x = σ(Eq (fx ))

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 15
▶ the Laplace approximation is only very roughly motivated (see above)
▶ it can be arbitrarily wrong, since it is a local approximation
▶ but it is often among the most computationally efficient things to try
▶ for logistic regression, it tends to work relatively well, because
▶ the log posterior is concave (see below)
▶ the algebraic structure of the link function yields “almost” a Gaussian posterior (cf. picture above)

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 16
Implementing the Laplace Approximation
Derivation [based on Rasmussen & Williams, 2006, §3.4]

Y
n
1
p(f) = GP(f, m, k) p(y | fX ) = σ(yi fxi ) σ(z) =
1 + e−x
i=1

log p(fX | y) = log p(y | fX ) + log p(fX ) − log p(y) with log σ(yi fxi ) = − log(1 + e−yi fxi )
X
n
1
= log σ(yi fxi ) − (fX − mX )⊺ K−1
XX (fX − mX ) + const.
2
i=1
X
n  
∂ log σ(yi fxi ) yi + 1
∇ log p(fX | y) = ∇ log σ(yi fxi ) − K−1
XX (fX − mX ) with = δij − σ(fxi )
∂fxj 2
i=1
X
n
∂ 2 log σ(yi fxi )
∇∇⊺ log p(fX | y) = ∇∇⊺ log σ(yi fxi ) − K−1 with = −δia δib σ(fxi )(1 − σ(fxi ))
XX
∂fxa ∂fxb | {z }
i=1
=:wi with 0<wi <1
−1 −1
=: − diag w − K = −(W + K ) ^ convex minimization / concave maximization!

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 17
Implementing the Laplace Approximation I
Newton Optimization [Rasmussen & Williams, 2006, §3.4]

1 procedure OPTIMIZE(L(·), f0 )
2 f ^ f0 initialize
3 while not converged do
4 g ^ ∇L(f) compute gradient
5 H ^(∇∇⊺ L(f))−1 compute inverse Hessian
6 ∆ ^ Hg Newton step
7 f^f − ∆ perform step
8 converged ^ ∥∆∥ < ϵ check for convergence
9 end while
10 return f
11 end procedure

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 18
Implementing the Laplace Approximation II
Concrete Algorithm, preliminary form [Rasmussen & Williams, 2006, §3.4]

1 procedure GP-LOGISTIC-TRAIN(KXX , mX , y)
2 f ^ mX initialize
3 while not converged do
4 r ^ y+1
2 − σ(f) = ∇ log p(y | fX ), gradient of log likelihood
5 W ^ diag(σ(f) ⊙ (1 − σ(f))) = −∇∇ log p(y | fX ), Hessian of log likelihood
6 g ^ r − K−1
XX (f − mX ) compute gradient
7 H ^ −(W + K−1 )−1 compute inverse Hessian
8 ∆ ^ Hg Newton step
9 f^f − ∆ perform step
10 converged ^ ∥∆∥ < ϵ check for convergence
11 end while
12 return f
13 end procedure
This can be numerically unstable as it (repeatedly) requires (W + K−1 )−1 . For a numerically stable
alternative, use B := I + W1/2 KXX W1/2 (cf. Rasmussen & Williams).
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 19
Computing Predictions
from the Laplace approximation [Rasmussen & Williams, 2006, §3.4]

X
n
1
log p(fX | y) = log σ(yi fxi ) − (fX − mX )⊺ K−1
XX (fX − mX ) + const.
2
i=1
X
n  
∂ log σ(yi fxi ) yi + 1
∇ log p(fX | y) = ∇ log σ(yi fxi ) −K−1
XX (fX − mX ) with = δij − σ(fxi )
∂fxj 2
i=1
| {z }
=:r

1 procedure GP-LOGISTIC-PREDICT(f̂, W, R, r, k, x) f̂, W, R = Cholesky(B), r handed over from training


2 for i = 1, . . . , LENGTH(x) do
3 f̄i ^ kxi X r mean prediction (note at minimum, 0 = ∇p(fX | y) = r − K−1 XX
(fX − mX )).
−1
4 s ^ R (W kXxi )1/2
pre-computation allows this step in O(n2 )
5 v ^ kRxi xi − s⊺ s v = cov(fx )

6 π̄i ^ σ(fi )N (fi , f̄i , v) dfi predictive probability for class 1 is p(y | f̄) = p(yx | fx )p(fx | f̄) dfx
7 end for entire loop is O(n2 m) for m test cases.
8 return π̄x
9 end procedure
Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 20
Pictorial View
Gaussian process logistic regression

0
f

−2

−4

0.8

0.6
π

0.4

0.2

0
−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Pictorial View
Gaussian process logistic regression

0
f

−2

−4

0.8

0.6
π

0.4

0.2

0
−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Pictorial View
Gaussian process logistic regression

0
f

−2

−4

0.8

0.6
π

0.4

0.2

0
−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Pictorial View
Gaussian process logistic regression

0
f

−2

−4

0.8

0.6
π

0.4

0.2

0
−8 −6 −4 −2 0 2 4 6 8
x

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 21
Gaussian Process Classification — (Probabilistic) Logistic Regression:
▶ Supervised classification phrased in a discriminative model with probabilistic interpretation
▶ model binary outputs as a transformation of a latent function with a Gaussian process prior
▶ due to non-Gaussian likelihood, the posterior is non-Gaussian; exact inference intractable
▶ Laplace approximation: Find MAP estimator, second order expansion for Gaussian approximation
▶ tune code for numerical stability, efficient computations
▶ Laplace approximation provides Gaussian posterior on training points, hence evidence, predictions

Probabilistic ML — P. Hennig, SS 2020 — Lecture 13: GP Classification — © Philipp Hennig, 2020 CC BY-NC-SA 3.0 22

You might also like