0% found this document useful (0 votes)
35 views38 pages

08 Learning Representations

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views38 pages

08 Learning Representations

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Probabilistic Inference and Learning

Lecture 08
Learning Representations

Philipp Hennig
11 May 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Coming up: Ways to learn representations
▶ Can we learn the features?
▶ How do we do this in practice?
▶ hierarchical Bayesian inference
▶ Connections to deep learning

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )


p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
(2)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )


p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
(2)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )


p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )


p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )


p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )


p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Can we Learn the Features?
Hierarchical Bayesian Inference

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of


p(w | y, ϕ) =
p(y | ϕ) feature functions to choose from
▶ Maybe we can restrict to a finite-dimensional
1
sub-space and search in there? Say
0.8 1
ϕi (x; θ) =
1 + exp(− x−θ
θ2 )
1

0.6
ϕ(x)

0.4

0.2

0
−5 0 5
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Can we Learn the Features?
Hierarchical Bayesian Inference

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of


p(w | y, ϕ) =
p(y | ϕ) feature functions to choose from
▶ Maybe we can restrict to a finite-dimensional
1
sub-space and search in there? Say
0.8 1
ϕi (x; θ) =
1 + exp(− x−θ
θ2 )
1

0.6
ϕ(x)

▶ θ1 , θ2 are just unknown parameters!


0.4
▶ So can we infer them just like w?
0.2

0
−5 0 5
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Can we Learn the Features?
Hierarchical Bayesian Inference

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of


p(w | y, ϕ) =
p(y | ϕ) feature functions to choose from
▶ Maybe we can restrict to a finite-dimensional
1
sub-space and search in there? Say
0.8 1
ϕi (x; θ) =
1 + exp(− x−θ
θ2 )
1

0.6
ϕ(x)

▶ θ1 , θ2 are just unknown parameters!


0.4
▶ So can we infer them just like w?
0.2
▶ Yes, but not as easily: the likelihood

p(y | w, θ) = N (y; ϕ(x; θ)⊺ w, σ 2 )


0
−5 0 5
x contains a non-linear map of θ.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 (2) 4
Hierarchical Bayesian Inference
Bayesian model adaptation

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)


p(f | y, x, θ) = R =
p(y | f, x, θ)p(f |, θ) df p(y | x, θ)

▶ Model parameters like θ are also known as hyper-parameters.


▶ This is largely a computational, practical distinction:
data are observed _ condition
variables are the things we care about _ full probabilistic treatment
parameters are the things we have to deal with to get the model right _ integrate out
hyper-parameters are the top-level, too expensive to properly infer _ fit
The model evidence in Bayes’ Theorem is the (marginal) likelihood for the model. So we would like

p(y | θ)p(θ)
p(θ | y) = R
p(y | θ ′ )p(θ ′ ) dθ ′

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Hierarchical Bayesian Inference
Bayesian model adaptation

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)


p(f | y, x, θ) = R =
p(y | f, x, θ)p(f |, θ) df p(y | x, θ)

▶ For Gaussians, die evidence has analytic form:


⊺ ⊺ ⊺
N (y; ϕθX w, Λ) · N (w, µ, Σ) = N (w; mθpost , Vθpost ) ·N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
| {z } | {z } | {z } | {z }
p(y|f,x,θ) p(f) p(f|y,x,θ) p(y|θ,x)

▶ BUT: It’s not a linear function of θ, so analytic Gaussian inference is not available!

Computational complexity is the principal challenge of probabilistic reasoning.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
The Toolbox

Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ hierarchical models ▶ Maximum likelihood / Maximum
▶ a-posteriori
▶ ▶
▶ ▶
▶ ▶

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
ML / MAP in Practice
Finding the “best fit” θ in Gaussian models [e.g. DJC MacKay, The evidence framework applied to classification networks, 1992]

Z
θ̂ = arg max p(y | x, θ) = arg max p(y | f, x, θ)p(f |, θ) df
θ θ
⊺ ⊺
= arg max N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
⊺ ⊺
= arg max log N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
⊺ ⊺
= arg min − log N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
 
1  −1  N
= arg min (y − ϕθX ⊺ µ)⊺ ϕθX ⊺ ΣϕθX + Λ ⊺
(y − ϕθX µ) +

log ϕθX ΣϕθX + Λ 

2 |  + 2 log 2π
θ {z } | {z }
square error model complexity / Occam factor

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7

log ϕθX ΣϕθX + Λ

Numquam ponenda est pluralitas sine necessitate.


Plurality must never be posited without necessity.
William of Occam
(1285 (Occam, Surrey)–1349 (Munich, Bavaria))
stained-glass window by Lawrence Lee

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ


log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ


log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ


log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ


log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Type II Inference
Fitting a probabilistic model by maximum marginal likelihood

log p(y | θ)
800
sq. error
Occam

600 10

f(x)
loss

400
0

200

−10
0

0 5 10 15 −5 0 5
t x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
▶ Parameters θ that affect the model should ideally be part of the inference process. The evidence
Z
p(y | θ) = p(y | f, θ)p(f | θ) df

(the denominator in Bayes’ theorem) is the (“type-II” or “marginal”) likelihood for θ


▶ If analytic inference on θ is intractable (which it usually is), θ can be fitted by “type-II” maximum
likelihood (or maximum a-posteriori).
▶ Bayesian inference still has effects here because the marginal likelihood gives rise to complexity
penalties / Occam factors.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
A Structural Observation
Graphical Model

y output

weights w1 w2 w3 w4 w5 w6 w7 w8 w9

features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9

parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9

x input

A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss, and fixed
input layer. Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
What does the Optimizer need from us?
A bit of algorithmic wizardry

 −1
 =:∆
z }| { 
1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
=:K
| {z } =:c
=:G
| {z }
=:e

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation

L
m9 m8
e c
 −1
m5  =:∆
z }| { 
1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
m6 G m7 L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
=:K
m4 | {z } =:c
=:G
∆ K | {z }
=:e
m2 m3
= m9 + m8 = (m6 ⊺ m5 m6 ) + log |m7 + Λ|
ϕ
= ...

m1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Forward Mode
 −1
 =:∆
z }| { 
L 1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
ṁ9 ṁ8 =:K
| {z } =:c
e c =:G
| {z }
ṁ5 =:e
 
ṁ6 G ṁ7 ∂L ∂L ∂e ∂L ∂c ∂e ∂c ∂e ∂∆ ∂e ∂G ∂c ∂K
= + = ṁ9 + ṁ8 = ṁ9 + + ṁ8
∂θ ∂e ∂θ ∂c ∂θ ∂θ ∂θ ∂∆ ∂θ ∂G ∂θ ∂K ∂θ
ṁ4    
∂∆ ∂G ∂K ∂∆ ∂ϕ ∂G ∂K ∂K
∆ K = ṁ9 ṁ6 + ṁ5 + ṁ8 ṁ7 = ṁ9 ṁ6 + ṁ5 + ṁ8 ṁ7
∂θ ∂θ ∂θ ∂ϕ ∂θ ∂K ∂θ ∂θ
ṁ2 ṁ3  
∂ϕ ∂K ∂K ∂ϕ
ϕ = ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) = ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 )
∂θ ∂θ ∂ϕ ∂θ
∂ϕ ∂θ
ṁ1 = (ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) ṁ3 )
∂θ ∂θ
= (ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) ṁ3 ) ṁ1 1
θ

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Forward Mode

L  −1
 =:∆
z }| { 
ṁ9 ṁ8 1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
e c 2 | {z } | {z }
=:K
ṁ5 | {z } =:c
=:G
| {z }
ṁ6 G ṁ7 =:e

ṁ4 ∂L ∂L ∂c
ṁ9 = = 1/2 ṁ8 = = 1/2 [ṁ7 ]ij = = K−1 ij
∆ K ∂e ∂c ∂Kij
ṁ2 ṁ3 ∂e ∂e ∂Gij
[ṁ6 ]i = = 2[G∆]i [ṁ5 ]ij = = ∆ i ∆j [ṁ4 ]ij,kℓ = = −Gik Gjℓ
ϕ ∂∆i ∂Gij ∂Kkℓ
∂Kij
[ṁ3 ]ij,ab = = δia [Σϕ]bj + δjb [Σϕ]kj
∂ϕab
ṁ1
∂∆i ∂ϕab
[ṁ2 ]i,ab = = −δia µb [ṁ1 ]ab,ℓ = = your choice!
∂ϕab ∂θℓ
θ

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Backward Mode [Seppo Linnainmaa, 1970]
 −1
 =:∆
z }| { 
1 ⊺
ϕθX ⊺ ΣϕθX θ⊺ θ⊺
L L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕX µ) + log ϕX ΣϕX + Λ
θ
2 | {z } | {z }
m̄9 m̄8 =:K
| {z } =:c
e c =:G
| {z }
m̄5 
=:e

G ∂L ∂L ∂ϕ ∂L ∂∆ ∂L ∂K ∂ϕ ∂ϕ
m̄6 m̄7 = =: m̄1 = + =: (m̄2 + m̄3 )
∂θ ∂ϕ ∂θ ∂∆ ∂ϕ ∂K ∂ϕ ∂θ ∂θ
m̄4  
∂L ∂e ∂∆ ∂∆ ∂L ∂G ∂L ∂c ∂K ∂K
∆ K m̄2 = =: m̄6 m̄3 = + =: (m̄4 + m̄7 )
∂e ∂∆ ∂ϕ ∂ϕ ∂G ∂K ∂c ∂K ∂ϕ ∂ϕ
m̄2 m̄3 ∂L ∂e ∂G ∂G ∂L ∂e ∂e ∂L ∂e ∂e
ϕ m̄4 = =: m̄5 m̄5 = =: m̄9 m̄6 = =: m̄9
∂e ∂G ∂K ∂K ∂e ∂G ∂G ∂e ∂∆ ∂∆
∂L ∂c ∂c
m̄7 = =: m̄8 m̄8 = m̄9 = 1/2
m̄1 ∂c ∂K ∂K
∂L
w̄i = ∂subgraph are known as adjoints. Traverse graph backward to collect the derivative. This is
i
θ faster than forward-mode for single-output-many-input functions, but requires storing the above
structure (known as a Wengert list). (cf. “Backpropagation”)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Deep Networks
But not Bayesian deep networks

6
y
20
w3
4 ϕ31 ϕ32
10 w2
L(t)

ϕ21 ϕ22
2
0 w1
ϕ11 ϕ12
w0
0 −10
0 1,000 2,000 −5 0 5
t x x

X
F X X X  !
f̂(x, W) = ϕ3i (x, wlower )w3i = ϕ3i ϕ2j ϕ1ℓ (w0ℓ x)w1ℓj w2ji w3i
i=1 i j ℓ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Deep Networks
But not Bayesian deep networks

6
y
20
w3
4 ϕ31 ϕ32
10 w2
L(t)

ϕ21 ϕ22
2
0 w1
ϕ11 ϕ12
w0
0 −10
0 1,000 2,000 −5 0 5
t x x

f̂(x, W) = arg min ∥y − f̂(x, W)∥2 + α2 ∥W∥2 = L(W) Wt+1 = Wt + τ ∇L(W)


W∈RD

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
The connection to Deep Learning
Just go MAP all the way

If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n

p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way

If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n

p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way

If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n

p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ

1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)

X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way

If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n

p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ

1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)

X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)

X X 1 nX ⊺
b
≈ arg min w2i + θj + ∥yβ − ϕθβ w∥2
w,θ 2σ 2 b
i © Philipp Hennig,j 2021 CC BY-NC-SA 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— β=1 15
The connection to Deep Learning
Just go MAP all the way

If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n

p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ

1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)

X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)

X X 1 nX ⊺
b

≈ arg min w2i + θj + 2
∥yβ − ϕθβ w∥2 ∼ N r + L(θ, w), O(b−1 )
w,θ 2σ b
i © Philipp Hennig,j 2021 CC BY-NC-SA 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— β=1 15
Connections and Differences
Bayesian and Deep Learning

▶ MAP inference does not capture uncertainty on parameters:


▶ no posterior uncertainty from not fully identified parameters
▶ no model capacity control in the evidence term
▶ A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss,
and fixed input layer (deep networks can of course be treated in the same way).
Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor. Data sub-sampling can be used just as in
other areas to speed up computations at the cost of reduced computational precision.
▶ All worries one may have about fitting or hand-picking features for Bayesian regression also apply
to deep learning. By highlighting assumptions and piors, the probabilistic view forces us to
address many problems directly, rather than obscuring them with notation and intuitions.
▶ Automatic Differentiation (AD) is a algorithmic tool that is just as helpful for Bayesian inference
as it is for deep learning
It is possible to construct a point estimate for a Bayesian model, and to construct full posteriors for deep
networks. The two domains are not separate, they are just different mental scaffolds. If you’re hoping
for a theory of deep learning, probability theory is a primary contender.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Summary:
▶ The features used for Gaussian linear regression can be learnt by hierarchical Bayesian Inference
▶ This is usually intractable. Instead, approximate inference methods are used
▶ For example, maximum a-posteriori probability (MAP) inference fits a point-estimate for feature
parameters
▶ MAP inference is an optimization problem, and can thus be performed in the same way as other
optimization-based ML approaches, including deep learning. That is, using the same optimizers
(e.g. stochastic gradient descent), the same automatic differentiation frameworks (e.g.
TensorFlow / pyTorch, etc.) and the same data subsampling techniques.
The different viewpoints (probabilistic / statistical / empirical (“deep”)) on Machine Learning often overlap
and inform each other. Understanding of Bayesian linear (Gaussian) regression can help us build a better
intuition for deep learning, too.

Next lecture: Instead of learning a few features, sometimes we can get away with using infinitely many
features.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17

You might also like