0% found this document useful (0 votes)

35 views38 pages

08 Learning Representations

Uploaded by

irpower1375

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views38 pages

08 Learning Representations

Uploaded by

irpower1375

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Probabilistic Inference and Learning

Lecture 08
Learning Representations

Philipp Hennig
11 May 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Coming up: Ways to learn representations
▶ Can we learn the features?
▶ How do we do this in practice?
▶ hierarchical Bayesian inference
▶ Connections to deep learning

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
(2)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Reminder: General Linear Regression
An unbounded abundance of choices for features

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

10
f(x)

−10

−8 −6 −4 −2 0 2 4 6 8
x

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(y | w, ϕX ) = N (y; ϕ⊺X w, σ 2 I) = N (y; fX , σ 2 I)
p(fx | y, ϕX ) = N (fx ; ϕ⊺x µ + ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 (y − ϕ⊺X µ), ϕ⊺x Σϕx − ϕ⊺x ΣϕX (ϕ⊺X ΣϕX + σ 2 I)−1 ϕ⊺X Σϕx
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Can we Learn the Features?
Hierarchical Bayesian Inference

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of

p(w | y, ϕ) =
p(y | ϕ) feature functions to choose from
▶ Maybe we can restrict to a finite-dimensional
1
sub-space and search in there? Say
0.8 1
ϕi (x; θ) =
1 + exp(− x−θ
θ2 )
1

0.6
ϕ(x)

0.4

0.2

0
−5 0 5
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Can we Learn the Features?
Hierarchical Bayesian Inference

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of

p(w | y, ϕ) =
p(y | ϕ) feature functions to choose from
▶ Maybe we can restrict to a finite-dimensional
1
sub-space and search in there? Say
0.8 1
ϕi (x; θ) =
1 + exp(− x−θ
θ2 )
1

0.6
ϕ(x)

▶ θ1 , θ2 are just unknown parameters!

0.4
▶ So can we infer them just like w?
0.2

0
−5 0 5
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Can we Learn the Features?
Hierarchical Bayesian Inference

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of

p(w | y, ϕ) =
p(y | ϕ) feature functions to choose from
▶ Maybe we can restrict to a finite-dimensional
1
sub-space and search in there? Say
0.8 1
ϕi (x; θ) =
1 + exp(− x−θ
θ2 )
1

0.6
ϕ(x)

▶ θ1 , θ2 are just unknown parameters!

0.4
▶ So can we infer them just like w?
0.2
▶ Yes, but not as easily: the likelihood

p(y | w, θ) = N (y; ϕ(x; θ)⊺ w, σ 2 )

0
−5 0 5
x contains a non-linear map of θ.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 (2) 4
Hierarchical Bayesian Inference
Bayesian model adaptation

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)

p(f | y, x, θ) = R =
p(y | f, x, θ)p(f |, θ) df p(y | x, θ)

▶ Model parameters like θ are also known as hyper-parameters.

▶ This is largely a computational, practical distinction:
data are observed _ condition
variables are the things we care about _ full probabilistic treatment
parameters are the things we have to deal with to get the model right _ integrate out
hyper-parameters are the top-level, too expensive to properly infer _ fit
The model evidence in Bayes’ Theorem is the (marginal) likelihood for the model. So we would like

p(y | θ)p(θ)
p(θ | y) = R
p(y | θ ′ )p(θ ′ ) dθ ′

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Hierarchical Bayesian Inference
Bayesian model adaptation

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)

p(f | y, x, θ) = R =
p(y | f, x, θ)p(f |, θ) df p(y | x, θ)

▶ For Gaussians, die evidence has analytic form:

⊺ ⊺ ⊺
N (y; ϕθX w, Λ) · N (w, µ, Σ) = N (w; mθpost , Vθpost ) ·N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
| {z } | {z } | {z } | {z }
p(y|f,x,θ) p(f) p(f|y,x,θ) p(y|θ,x)

▶ BUT: It’s not a linear function of θ, so analytic Gaussian inference is not available!

Computational complexity is the principal challenge of probabilistic reasoning.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
The Toolbox

Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ hierarchical models ▶ Maximum likelihood / Maximum
▶ a-posteriori
▶ ▶
▶ ▶
▶ ▶

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
ML / MAP in Practice
Finding the “best fit” θ in Gaussian models [e.g. DJC MacKay, The evidence framework applied to classification networks, 1992]

Z
θ̂ = arg max p(y | x, θ) = arg max p(y | f, x, θ)p(f |, θ) df
θ θ
⊺ ⊺
= arg max N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
⊺ ⊺
= arg max log N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
⊺ ⊺
= arg min − log N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
 
1 −1  N
= arg min (y − ϕθX ⊺ µ)⊺ ϕθX ⊺ ΣϕθX + Λ ⊺
(y − ϕθX µ) +
⊺
log ϕθX ΣϕθX + Λ 

2 |  + 2 log 2π
θ {z } | {z }
square error model complexity / Occam factor

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
⊺
log ϕθX ΣϕθX + Λ

Numquam ponenda est pluralitas sine necessitate.

Plurality must never be posited without necessity.
William of Occam
(1285 (Occam, Surrey)–1349 (Munich, Bavaria))
stained-glass window by Lawrence Lee

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ

⊺
log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ

⊺
log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ

⊺
log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward

4
−50

2
−100

X ΣϕX |
f(x) and ϕ(x)

⊺ λ
0

log |ϕλ
−150

−2

−200
−4
−4 −2 0 2 4 0 2 4 6
x λ

⊺
log ϕθX ΣϕθX + Λ

measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Type II Inference
Fitting a probabilistic model by maximum marginal likelihood

log p(y | θ)
800
sq. error
Occam

600 10

f(x)
loss

400
0

200

−10
0

0 5 10 15 −5 0 5
t x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
▶ Parameters θ that affect the model should ideally be part of the inference process. The evidence
Z
p(y | θ) = p(y | f, θ)p(f | θ) df

(the denominator in Bayes’ theorem) is the (“type-II” or “marginal”) likelihood for θ

▶ If analytic inference on θ is intractable (which it usually is), θ can be fitted by “type-II” maximum
likelihood (or maximum a-posteriori).
▶ Bayesian inference still has effects here because the marginal likelihood gives rise to complexity
penalties / Occam factors.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
A Structural Observation
Graphical Model

y output

weights w1 w2 w3 w4 w5 w6 w7 w8 w9

features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9

parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9

x input

A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss, and fixed
input layer. Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
What does the Optimizer need from us?
A bit of algorithmic wizardry

 −1
=:∆
z }| {
1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
=:K
| {z } =:c
=:G
| {z }
=:e

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation

L
m9 m8
e c
 −1
m5 =:∆
z }| {
1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
m6 G m7 L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
=:K
m4 | {z } =:c
=:G
∆ K | {z }
=:e
m2 m3
= m9 + m8 = (m6 ⊺ m5 m6 ) + log |m7 + Λ|
ϕ
= ...

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Forward Mode
 −1
=:∆
z }| {
L 1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
ṁ9 ṁ8 =:K
| {z } =:c
e c =:G
| {z }
ṁ5 =:e

ṁ6 G ṁ7 ∂L ∂L ∂e ∂L ∂c ∂e ∂c ∂e ∂∆ ∂e ∂G ∂c ∂K
= + = ṁ9 + ṁ8 = ṁ9 + + ṁ8
∂θ ∂e ∂θ ∂c ∂θ ∂θ ∂θ ∂∆ ∂θ ∂G ∂θ ∂K ∂θ
ṁ4
∂∆ ∂G ∂K ∂∆ ∂ϕ ∂G ∂K ∂K
∆ K = ṁ9 ṁ6 + ṁ5 + ṁ8 ṁ7 = ṁ9 ṁ6 + ṁ5 + ṁ8 ṁ7
∂θ ∂θ ∂θ ∂ϕ ∂θ ∂K ∂θ ∂θ
ṁ2 ṁ3
∂ϕ ∂K ∂K ∂ϕ
ϕ = ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) = ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 )
∂θ ∂θ ∂ϕ ∂θ
∂ϕ ∂θ
ṁ1 = (ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) ṁ3 )
∂θ ∂θ
= (ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) ṁ3 ) ṁ1 1
θ

L  −1
=:∆
z }| {
ṁ9 ṁ8 1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
e c 2 | {z } | {z }
=:K
ṁ5 | {z } =:c
=:G
| {z }
ṁ6 G ṁ7 =:e

ṁ4 ∂L ∂L ∂c
ṁ9 = = 1/2 ṁ8 = = 1/2 [ṁ7 ]ij = = K−1 ij
∆ K ∂e ∂c ∂Kij
ṁ2 ṁ3 ∂e ∂e ∂Gij
[ṁ6 ]i = = 2[G∆]i [ṁ5 ]ij = = ∆ i ∆j [ṁ4 ]ij,kℓ = = −Gik Gjℓ
ϕ ∂∆i ∂Gij ∂Kkℓ
∂Kij
[ṁ3 ]ij,ab = = δia [Σϕ]bj + δjb [Σϕ]kj
∂ϕab
ṁ1
∂∆i ∂ϕab
[ṁ2 ]i,ab = = −δia µb [ṁ1 ]ab,ℓ = = your choice!
∂ϕab ∂θℓ
θ

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Backward Mode [Seppo Linnainmaa, 1970]
 −1
=:∆
z }| {
1 ⊺
ϕθX ⊺ ΣϕθX θ⊺ θ⊺
L L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕX µ) + log ϕX ΣϕX + Λ
θ
2 | {z } | {z }
m̄9 m̄8 =:K
| {z } =:c
e c =:G
| {z }
m̄5
=:e

G ∂L ∂L ∂ϕ ∂L ∂∆ ∂L ∂K ∂ϕ ∂ϕ
m̄6 m̄7 = =: m̄1 = + =: (m̄2 + m̄3 )
∂θ ∂ϕ ∂θ ∂∆ ∂ϕ ∂K ∂ϕ ∂θ ∂θ
m̄4
∂L ∂e ∂∆ ∂∆ ∂L ∂G ∂L ∂c ∂K ∂K
∆ K m̄2 = =: m̄6 m̄3 = + =: (m̄4 + m̄7 )
∂e ∂∆ ∂ϕ ∂ϕ ∂G ∂K ∂c ∂K ∂ϕ ∂ϕ
m̄2 m̄3 ∂L ∂e ∂G ∂G ∂L ∂e ∂e ∂L ∂e ∂e
ϕ m̄4 = =: m̄5 m̄5 = =: m̄9 m̄6 = =: m̄9
∂e ∂G ∂K ∂K ∂e ∂G ∂G ∂e ∂∆ ∂∆
∂L ∂c ∂c
m̄7 = =: m̄8 m̄8 = m̄9 = 1/2
m̄1 ∂c ∂K ∂K
∂L
w̄i = ∂subgraph are known as adjoints. Traverse graph backward to collect the derivative. This is
i
θ faster than forward-mode for single-output-many-input functions, but requires storing the above
structure (known as a Wengert list). (cf. “Backpropagation”)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Deep Networks
But not Bayesian deep networks

6
y
20
w3
4 ϕ31 ϕ32
10 w2
L(t)

ϕ21 ϕ22
2
0 w1
ϕ11 ϕ12
w0
0 −10
0 1,000 2,000 −5 0 5
t x x

X
F X X X !
f̂(x, W) = ϕ3i (x, wlower )w3i = ϕ3i ϕ2j ϕ1ℓ (w0ℓ x)w1ℓj w2ji w3i
i=1 i j ℓ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Deep Networks
But not Bayesian deep networks

6
y
20
w3
4 ϕ31 ϕ32
10 w2
L(t)

ϕ21 ϕ22
2
0 w1
ϕ11 ϕ12
w0
0 −10
0 1,000 2,000 −5 0 5
t x x

f̂(x, W) = arg min ∥y − f̂(x, W)∥2 + α2 ∥W∥2 = L(W) Wt+1 = Wt + τ ∇L(W)

W∈RD

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
The connection to Deep Learning
Just go MAP all the way

If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n
⊺
p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way

1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)

X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way

1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)

X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)

X X 1 nX ⊺
b
≈ arg min w2i + θj + ∥yβ − ϕθβ w∥2
w,θ 2σ 2 b
i © Philipp Hennig,j 2021 CC BY-NC-SA 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— β=1 15
The connection to Deep Learning
Just go MAP all the way

1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)

X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)

X X 1 nX ⊺
b

≈ arg min w2i + θj + 2
∥yβ − ϕθβ w∥2 ∼ N r + L(θ, w), O(b−1 )
w,θ 2σ b
i © Philipp Hennig,j 2021 CC BY-NC-SA 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— β=1 15
Connections and Differences
Bayesian and Deep Learning

▶ MAP inference does not capture uncertainty on parameters:

▶ no posterior uncertainty from not fully identified parameters
▶ no model capacity control in the evidence term
▶ A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss,
and fixed input layer (deep networks can of course be treated in the same way).
Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor. Data sub-sampling can be used just as in
other areas to speed up computations at the cost of reduced computational precision.
▶ All worries one may have about fitting or hand-picking features for Bayesian regression also apply
to deep learning. By highlighting assumptions and piors, the probabilistic view forces us to
address many problems directly, rather than obscuring them with notation and intuitions.
▶ Automatic Differentiation (AD) is a algorithmic tool that is just as helpful for Bayesian inference
as it is for deep learning
It is possible to construct a point estimate for a Bayesian model, and to construct full posteriors for deep
networks. The two domains are not separate, they are just different mental scaffolds. If you’re hoping
for a theory of deep learning, probability theory is a primary contender.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Summary:
▶ The features used for Gaussian linear regression can be learnt by hierarchical Bayesian Inference
▶ This is usually intractable. Instead, approximate inference methods are used
▶ For example, maximum a-posteriori probability (MAP) inference fits a point-estimate for feature
parameters
▶ MAP inference is an optimization problem, and can thus be performed in the same way as other
optimization-based ML approaches, including deep learning. That is, using the same optimizers
(e.g. stochastic gradient descent), the same automatic differentiation frameworks (e.g.
TensorFlow / pyTorch, etc.) and the same data subsampling techniques.
The different viewpoints (probabilistic / statistical / empirical (“deep”)) on Machine Learning often overlap
and inform each other. Understanding of Bayesian linear (Gaussian) regression can help us build a better
intuition for deep learning, too.

Next lecture: Instead of learning a few features, sometimes we can get away with using infinitely many
features.

Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
14 Generalized Linear Models
No ratings yet
14 Generalized Linear Models
29 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
13 Slides
No ratings yet
13 Slides
39 pages
ML 3
No ratings yet
ML 3
66 pages
15 Exponential Families
No ratings yet
15 Exponential Families
33 pages
Main 2
No ratings yet
Main 2
37 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Lec 2 Prob Supervised Learning
No ratings yet
Lec 2 Prob Supervised Learning
31 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
27 Revision
No ratings yet
27 Revision
80 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Slide 1
No ratings yet
Slide 1
37 pages
Culbertson and Sturtz - 2013 - Bayesian Machine Learning Via Category Theory
No ratings yet
Culbertson and Sturtz - 2013 - Bayesian Machine Learning Via Category Theory
74 pages
Tungban Probabilistic ML 2021 - Lecture09
No ratings yet
Tungban Probabilistic ML 2021 - Lecture09
46 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Probabilistic Models in Supervised Learning
No ratings yet
Probabilistic Models in Supervised Learning
32 pages
Quiz 2 - Statistics Coursera
No ratings yet
Quiz 2 - Statistics Coursera
1 page
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
No ratings yet
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
65 pages
Hw2solparti 2017
No ratings yet
Hw2solparti 2017
4 pages
Tungban Probabilistic ML 2021 - 04 - Sampling
No ratings yet
Tungban Probabilistic ML 2021 - 04 - Sampling
24 pages
Bayes Decision Rule in Pattern Recognition
No ratings yet
Bayes Decision Rule in Pattern Recognition
10 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
Lin Reg
No ratings yet
Lin Reg
34 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
hw2b 2017
No ratings yet
hw2b 2017
7 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
L1-Understanding Diffusion Models A Unified Persp
No ratings yet
L1-Understanding Diffusion Models A Unified Persp
27 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Bayesian Inference & Gaussian Processes
No ratings yet
Bayesian Inference & Gaussian Processes
2 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Bayesian Logistic Regression Overview
No ratings yet
Bayesian Logistic Regression Overview
4 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lec23 PDF
No ratings yet
Lec23 PDF
7 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Statistical Learning by Sasha Rakhlin
No ratings yet
Statistical Learning by Sasha Rakhlin
26 pages
Ece368h1s 01 - 22 - 2023
No ratings yet
Ece368h1s 01 - 22 - 2023
4 pages
Unit - II
No ratings yet
Unit - II
171 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Slide 8 01
No ratings yet
Slide 8 01
37 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
PR Lab Assignment 1-5
No ratings yet
PR Lab Assignment 1-5
31 pages
Ielts Reading Topic Games
No ratings yet
Ielts Reading Topic Games
4 pages
Knowledge Conflicts For LLMS: A Survey
No ratings yet
Knowledge Conflicts For LLMS: A Survey
27 pages
Shruti Rastogi RESUME
No ratings yet
Shruti Rastogi RESUME
1 page
Astm C591 21
No ratings yet
Astm C591 21
7 pages
Grade 8 PAT Term 1
No ratings yet
Grade 8 PAT Term 1
7 pages
Getting The Whole Picture Wide-Azimuth Multicomponent Seismic
No ratings yet
Getting The Whole Picture Wide-Azimuth Multicomponent Seismic
7 pages
How Do Companies Report Sustainaility
No ratings yet
How Do Companies Report Sustainaility
40 pages
Asiarope 2025
No ratings yet
Asiarope 2025
2 pages
Analysis and Design of Offshore Tubular Members Against Ship Impacts
No ratings yet
Analysis and Design of Offshore Tubular Members Against Ship Impacts
40 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
6 pages
Lesson Plan Philippine Archipelago Formation Grade11
No ratings yet
Lesson Plan Philippine Archipelago Formation Grade11
3 pages
Jeddah-Jeddah Ip11231
No ratings yet
Jeddah-Jeddah Ip11231
2 pages
Unlock Your Potential
No ratings yet
Unlock Your Potential
45 pages
Refrigerants
No ratings yet
Refrigerants
28 pages
Modular Inverse
No ratings yet
Modular Inverse
6 pages
Statistical Inference
No ratings yet
Statistical Inference
4 pages
Questionnaire On Human Resource Management
89% (19)
Questionnaire On Human Resource Management
2 pages
Introduction
No ratings yet
Introduction
15 pages
CSU276387
No ratings yet
CSU276387
21 pages
Design Thinking Strategy Guide
No ratings yet
Design Thinking Strategy Guide
23 pages
Erro
No ratings yet
Erro
74 pages
2025 Social Mulanje-District
No ratings yet
2025 Social Mulanje-District
8 pages
Statistics Assignment
No ratings yet
Statistics Assignment
20 pages
Cell Biology 3rd Edition Thomas D. Pollard Download
100% (4)
Cell Biology 3rd Edition Thomas D. Pollard Download
64 pages
Community Perspectives Learning Module
No ratings yet
Community Perspectives Learning Module
5 pages
FEM DV Checklist For ADAPT-Builder (Summary) (20240101)
No ratings yet
FEM DV Checklist For ADAPT-Builder (Summary) (20240101)
33 pages
Anatema
No ratings yet
Anatema
2 pages
GEOT3002 IRIS Final Exam
No ratings yet
GEOT3002 IRIS Final Exam
5 pages
Acoustical Design in Buildings (G11)
No ratings yet
Acoustical Design in Buildings (G11)
40 pages
Subspace Analysis in R2 and R3
No ratings yet
Subspace Analysis in R2 and R3
3 pages

08 Learning Representations

Uploaded by

08 Learning Representations

Uploaded by

Probabilistic Inference and Learning

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(w) = N (w; µ, Σ) ⇒ p(f) = N (fx ; ϕ⊺x µ, ϕx Σϕx )

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of

▶ θ1 , θ2 are just unknown parameters!

p(y | w, ϕ)p(w | ϕ) ▶ There is an infinite-dimensional space of

▶ θ1 , θ2 are just unknown parameters!

p(y | w, θ) = N (y; ϕ(x; θ)⊺ w, σ 2 )

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)

▶ Model parameters like θ are also known as hyper-parameters.

p(y | f, x, θ)p(f |, θ) p(y | f, x, θ)p(f |, θ)

▶ For Gaussians, die evidence has analytic form:

Computational complexity is the principal challenge of probabilistic reasoning.

Numquam ponenda est pluralitas sine necessitate.

(the denominator in Bayes’ theorem) is the (“type-II” or “marginal”) likelihood for θ

f̂(x, W) = arg min ∥y − f̂(x, W)∥2 + α2 ∥W∥2 = L(W) Wt+1 = Wt + τ ∇L(W)

▶ MAP inference does not capture uncertainty on parameters:

You might also like