08 Learning Representations
08 Learning Representations
Lecture 08
Learning Representations
Philipp Hennig
11 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Learning Representations 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Coming up: Ways to learn representations
▶ Can we learn the features?
▶ How do we do this in practice?
▶ hierarchical Bayesian inference
▶ Connections to deep learning
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
Reminder: General Linear Regression
An unbounded abundance of choices for features
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
10
f(x)
−10
−8 −6 −4 −2 0 2 4 6 8
x
0.6
ϕ(x)
0.4
0.2
0
−5 0 5
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Can we Learn the Features?
Hierarchical Bayesian Inference
0.6
ϕ(x)
0
−5 0 5
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Can we Learn the Features?
Hierarchical Bayesian Inference
0.6
ϕ(x)
p(y | θ)p(θ)
p(θ | y) = R
p(y | θ ′ )p(θ ′ ) dθ ′
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Hierarchical Bayesian Inference
Bayesian model adaptation
▶ BUT: It’s not a linear function of θ, so analytic Gaussian inference is not available!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ Directed Graphical Models ▶ Monte Carlo
▶ Gaussian Distributions ▶ Linear algebra / Gaussian inference
▶ hierarchical models ▶ Maximum likelihood / Maximum
▶ a-posteriori
▶ ▶
▶ ▶
▶ ▶
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
ML / MAP in Practice
Finding the “best fit” θ in Gaussian models [e.g. DJC MacKay, The evidence framework applied to classification networks, 1992]
Z
θ̂ = arg max p(y | x, θ) = arg max p(y | f, x, θ)p(f |, θ) df
θ θ
⊺ ⊺
= arg max N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
⊺ ⊺
= arg max log N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
⊺ ⊺
= arg min − log N (y; ϕθX µ, ϕθX ΣϕθX + Λ)
θ
1 −1 N
= arg min (y − ϕθX ⊺ µ)⊺ ϕθX ⊺ ΣϕθX + Λ ⊺
(y − ϕθX µ) +
⊺
log ϕθX ΣϕθX + Λ
2 | + 2 log 2π
θ {z } | {z }
square error model complexity / Occam factor
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
⊺
log ϕθX ΣϕθX + Λ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
What is Model Complexity?
The Occam factor is not always straightforward
4
−50
2
−100
X ΣϕX |
f(x) and ϕ(x)
⊺ λ
0
log |ϕλ
−150
−2
−200
−4
−4 −2 0 2 4 0 2 4 6
x λ
⊺
log ϕθX ΣϕθX + Λ
measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward
4
−50
2
−100
X ΣϕX |
f(x) and ϕ(x)
⊺ λ
0
log |ϕλ
−150
−2
−200
−4
−4 −2 0 2 4 0 2 4 6
x λ
⊺
log ϕθX ΣϕθX + Λ
measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward
4
−50
2
−100
X ΣϕX |
f(x) and ϕ(x)
⊺ λ
0
log |ϕλ
−150
−2
−200
−4
−4 −2 0 2 4 0 2 4 6
x λ
⊺
log ϕθX ΣϕθX + Λ
measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
What is Model Complexity?
The Occam factor is not always straightforward
4
−50
2
−100
X ΣϕX |
f(x) and ϕ(x)
⊺ λ
0
log |ϕλ
−150
−2
−200
−4
−4 −2 0 2 4 0 2 4 6
x λ
⊺
log ϕθX ΣϕθX + Λ
measures model complexity as the “volume” of hypotheses covered by the joint Gaussian distribution.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Type II Inference
Fitting a probabilistic model by maximum marginal likelihood
log p(y | θ)
800
sq. error
Occam
600 10
f(x)
loss
400
0
200
−10
0
0 5 10 15 −5 0 5
t x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
▶ Parameters θ that affect the model should ideally be part of the inference process. The evidence
Z
p(y | θ) = p(y | f, θ)p(f | θ) df
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
A Structural Observation
Graphical Model
y output
weights w1 w2 w3 w4 w5 w6 w7 w8 w9
features [ϕx ]1 [ϕx ]2 [ϕx ]3 [ϕx ]4 [ϕx ]5 [ϕx ]6 [ϕx ]7 [ϕx ]8 [ϕx ]9
parameters θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9
x input
A linear Gaussian regressor is a single hidden layer neural network, with quadratic output loss, and fixed
input layer. Hyperparameter-fitting corresponds to training the input layer. The usual way to train such
network, however, does not include the Occam factor.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
What does the Optimizer need from us?
A bit of algorithmic wizardry
−1
=:∆
z }| {
1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
=:K
| {z } =:c
=:G
| {z }
=:e
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation
L
m9 m8
e c
−1
m5 =:∆
z }| {
1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
m6 G m7 L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
=:K
m4 | {z } =:c
=:G
∆ K | {z }
=:e
m2 m3
= m9 + m8 = (m6 ⊺ m5 m6 ) + log |m7 + Λ|
ϕ
= ...
m1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Forward Mode
−1
=:∆
z }| {
L 1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
2 | {z } | {z }
ṁ9 ṁ8 =:K
| {z } =:c
e c =:G
| {z }
ṁ5 =:e
ṁ6 G ṁ7 ∂L ∂L ∂e ∂L ∂c ∂e ∂c ∂e ∂∆ ∂e ∂G ∂c ∂K
= + = ṁ9 + ṁ8 = ṁ9 + + ṁ8
∂θ ∂e ∂θ ∂c ∂θ ∂θ ∂θ ∂∆ ∂θ ∂G ∂θ ∂K ∂θ
ṁ4
∂∆ ∂G ∂K ∂∆ ∂ϕ ∂G ∂K ∂K
∆ K = ṁ9 ṁ6 + ṁ5 + ṁ8 ṁ7 = ṁ9 ṁ6 + ṁ5 + ṁ8 ṁ7
∂θ ∂θ ∂θ ∂ϕ ∂θ ∂K ∂θ ∂θ
ṁ2 ṁ3
∂ϕ ∂K ∂K ∂ϕ
ϕ = ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) = ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 )
∂θ ∂θ ∂ϕ ∂θ
∂ϕ ∂θ
ṁ1 = (ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) ṁ3 )
∂θ ∂θ
= (ṁ9 ṁ6 ṁ2 + (ṁ9 ṁ5 ṁ4 + ṁ8 ṁ7 ) ṁ3 ) ṁ1 1
θ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Forward Mode
L −1
=:∆
z }| {
ṁ9 ṁ8 1 ⊺
ϕθX ⊺ ΣϕθX ⊺ ⊺
L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕθX µ) + log ϕθX ΣϕθX + Λ
e c 2 | {z } | {z }
=:K
ṁ5 | {z } =:c
=:G
| {z }
ṁ6 G ṁ7 =:e
ṁ4 ∂L ∂L ∂c
ṁ9 = = 1/2 ṁ8 = = 1/2 [ṁ7 ]ij = = K−1 ij
∆ K ∂e ∂c ∂Kij
ṁ2 ṁ3 ∂e ∂e ∂Gij
[ṁ6 ]i = = 2[G∆]i [ṁ5 ]ij = = ∆ i ∆j [ṁ4 ]ij,kℓ = = −Gik Gjℓ
ϕ ∂∆i ∂Gij ∂Kkℓ
∂Kij
[ṁ3 ]ij,ab = = δia [Σϕ]bj + δjb [Σϕ]kj
∂ϕab
ṁ1
∂∆i ∂ϕab
[ṁ2 ]i,ab = = −δia µb [ṁ1 ]ab,ℓ = = your choice!
∂ϕab ∂θℓ
θ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
What does the Optimizer need from us?
Automatic Differentiation — Backward Mode [Seppo Linnainmaa, 1970]
−1
=:∆
z }| {
1 ⊺
ϕθX ⊺ ΣϕθX θ⊺ θ⊺
L L(θ) = (y − ϕθX µ)⊺ +Λ (y − ϕX µ) + log ϕX ΣϕX + Λ
θ
2 | {z } | {z }
m̄9 m̄8 =:K
| {z } =:c
e c =:G
| {z }
m̄5
=:e
G ∂L ∂L ∂ϕ ∂L ∂∆ ∂L ∂K ∂ϕ ∂ϕ
m̄6 m̄7 = =: m̄1 = + =: (m̄2 + m̄3 )
∂θ ∂ϕ ∂θ ∂∆ ∂ϕ ∂K ∂ϕ ∂θ ∂θ
m̄4
∂L ∂e ∂∆ ∂∆ ∂L ∂G ∂L ∂c ∂K ∂K
∆ K m̄2 = =: m̄6 m̄3 = + =: (m̄4 + m̄7 )
∂e ∂∆ ∂ϕ ∂ϕ ∂G ∂K ∂c ∂K ∂ϕ ∂ϕ
m̄2 m̄3 ∂L ∂e ∂G ∂G ∂L ∂e ∂e ∂L ∂e ∂e
ϕ m̄4 = =: m̄5 m̄5 = =: m̄9 m̄6 = =: m̄9
∂e ∂G ∂K ∂K ∂e ∂G ∂G ∂e ∂∆ ∂∆
∂L ∂c ∂c
m̄7 = =: m̄8 m̄8 = m̄9 = 1/2
m̄1 ∂c ∂K ∂K
∂L
w̄i = ∂subgraph are known as adjoints. Traverse graph backward to collect the derivative. This is
i
θ faster than forward-mode for single-output-many-input functions, but requires storing the above
structure (known as a Wengert list). (cf. “Backpropagation”)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Deep Networks
But not Bayesian deep networks
6
y
20
w3
4 ϕ31 ϕ32
10 w2
L(t)
ϕ21 ϕ22
2
0 w1
ϕ11 ϕ12
w0
0 −10
0 1,000 2,000 −5 0 5
t x x
X
F X X X !
f̂(x, W) = ϕ3i (x, wlower )w3i = ϕ3i ϕ2j ϕ1ℓ (w0ℓ x)w1ℓj w2ji w3i
i=1 i j ℓ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Deep Networks
But not Bayesian deep networks
6
y
20
w3
4 ϕ31 ϕ32
10 w2
L(t)
ϕ21 ϕ22
2
0 w1
ϕ11 ϕ12
w0
0 −10
0 1,000 2,000 −5 0 5
t x x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
The connection to Deep Learning
Just go MAP all the way
If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n
⊺
p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way
If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n
⊺
p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way
If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n
⊺
p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ
1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)
X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The connection to Deep Learning
Just go MAP all the way
If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n
⊺
p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ
1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)
X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)
X X 1 nX ⊺
b
≈ arg min w2i + θj + ∥yβ − ϕθβ w∥2
w,θ 2σ 2 b
i © Philipp Hennig,j 2021 CC BY-NC-SA 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— β=1 15
The connection to Deep Learning
Just go MAP all the way
If we consider multiple layers, we might as well not integrate out the final layer’s weights
Y
n Y
n
⊺
p(w, θ | y) ∝ p(y | w, ϕθ )p(w, θ) = p(w, θ) · p(yi | w, ϕθi ) = p(w, θ) · N (yi ; ϕθi w, σ 2 )
i=1 i=1
arg max p(w, θ | y) = arg min − log p(w, θ | y)
w,θ w,θ
1 X X
n n
θ⊺
= arg min − log p(w, θ) + ∥yi − ϕ i w∥ 2
= arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i=1 i=1
| {z }
L(θ,w)
X X 1 X
n
⊺ X
n
= arg min w2i + θj + ∥yi − ϕθi w∥2 = arg min r(w, θ) + ℓ2 (yi ; θ, w)
w,θ 2σ 2 w,θ
i j i=1 i=1
| {z }
L(θ,w)
X X 1 nX ⊺
b
≈ arg min w2i + θj + 2
∥yβ − ϕθβ w∥2 ∼ N r + L(θ, w), O(b−1 )
w,θ 2σ b
i © Philipp Hennig,j 2021 CC BY-NC-SA 3.0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— β=1 15
Connections and Differences
Bayesian and Deep Learning
Next lecture: Instead of learning a few features, sometimes we can get away with using infinitely many
features.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 08: Learning Representations— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17