0% found this document useful (0 votes)
12 views34 pages

Ch14 Bayesian Learning

The document provides supplementary slides on Bayesian Learning, covering key topics such as the formulation of Bayesian learning, conjugate priors, approximate inference, and Gaussian processes. It contrasts frequentist and Bayesian views, explains the Bayesian learning rule, and discusses maximum a posteriori estimation and sequential Bayesian learning. Additionally, it highlights the importance of conjugate priors for computational convenience and outlines methods for approximate inference when conjugate priors are not available.

Uploaded by

tarika7219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views34 pages

Ch14 Bayesian Learning

The document provides supplementary slides on Bayesian Learning, covering key topics such as the formulation of Bayesian learning, conjugate priors, approximate inference, and Gaussian processes. It contrasts frequentist and Bayesian views, explains the Bayesian learning rule, and discusses maximum a posteriori estimation and sequential Bayesian learning. Additionally, it highlights the importance of conjugate priors for computational convenience and outlines methods for approximate inference when conjugate priors are not available.

Uploaded by

tarika7219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Formulation Conjugate Priors Approximate Inference Gaussian Processes

Chapter 14
Bayesian Learning

supplementary slides to
Machine Learning Fundamentals
c
Hui Jiang 2020
published by Cambridge University Press

August 2020

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Outline

1 Formulation of Bayesian Learning

2 Conjugate Priors

3 Approximate Inference

4 Gaussian Processes

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Learning (I)


frequentist vs. Bayesian views in machine learning
◦ frequentist: model parameters as unknown but fixed quantities
◦ Bayesian: model parameters as random variables
Bayesians use probability distributions of model parameters
Bayes’ theorem:
p(x, θ) p(θ) p(x|θ)
p(θ | x) = =
p(x) p(x)

◦ p(θ): prior distribution of model parameters θ


◦ p(θ|x): the posterior distribution of θ given data x
◦ p(x|θ): the likelihood function of the model
Bayesian learning rule: posterior ∝ prior × likelihood
p(θ|x) ∝ p(θ) p(x|θ)
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Learning (II)

prior specification: p(θ)


◦ use a prior distribution to describe prior
knowledge on models
Bayesian learning
◦ optimally combine prior knowledge
 with data
◦ given a training set: D = x1 , x2 , · · · , xN
D
◦ Bayesian learning rule: p(θ) −→ p(θ|D)
p(θ|D) ∝ p(θ)p(D|θ)
N
Y
p(θ|D) ∝ p(θ) p(D|θ) = p(θ) p(xi |θ)
i=1
posterior ∝ prior × likelihood
Bayesian inference
◦ make a decision based on p(θ|D)
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Inference for Classification


given posterior p(θ|D) and likelihood p(x | θ)
define predictive distribution as
R
p(x | D) = θ p(x | θ) p(θ | D) dθ
Bayesian classification:

◦ K classes: ω1 , ω2 , · · · , ωK
◦ choose prior p(θk ) and a training set Dk for each class ωk
◦ Bayesian learning:
p(θk ) p(Dk | ωk , θk )
p(θk | Dk ) = ∝ p(θk ) p(Dk | ωk , θk )
p(Dk )
◦ Bayesian inference:
g(x) = arg maxK
k=1 p(x | Dk )
Z
= arg maxK
k=1 Pr(ω k ) p(x|ωk , θk ) p(θk | Dk ) dθk
θk

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Maximum a Posteriori (MAP) Estimation

not easy to use a distribution p(θ|D) to θML = arg max p(D | θ)


θ
describe models
point estimation: only use a point to
estimate a distribution p(θ|D)
maximum a posteriori (MAP) estimation:

θMAP = arg max p(θ | D)


θ
= arg max p(θ) p(D | θ)
θ

MAP estimation vs. ML estimation


◦ ML solely relies on training data
◦ MAP optimally combines prior θMAP vs. θML
knowledge with data
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Sequential Bayesian Learning


Bayesian learning is an excellent tool for on-line learning,
where training data come one by one
sequential Bayesian learning
◦ use the Bayesian learning to update models after each sample
◦ track a slowly-changing environment

p(θ | x1 ) ∝ p(θ)p(x1 | θ)
p(θ | x1 , x2 ) ∝ p(θ | x1 ) p(x2 | θ)

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Example: Sequential Bayesian Learning

a univariate Gaussian model with known variance:


(x−µ)2
1 − 2
p(x | µ) = N (x | µ, σ02 ) = p e 2σ0

2πσ02

choose a prior distribution:


(µ−ν0 )2
1 − 2
p(µ) = N (µ | ν0 , τ02 ) = p e 2τ0

2πτ02
as n → ∞, we have
p(µ|x1 ) ∝ p(µ)p(x1 |µ) =⇒ p(µ|x1 ) = N (µ|ν1 , τ12 )
2
σ0 τ02 τ02 σ0
2
◦ τn → 0
with ν1 = τ02 +σ02 ν0 + τ02 +σ02 x1 and τ12 = τ02 +σ02
◦ νn → x̄n
p(µ | x1 , · · · , xn ) = N (µ|νn , τn2 ) with
nτ02 σ02
τ02 σ0
2 ◦ µMAP → µML
νn = 2 x̄n
nτ02 +σ0
+ 2 ν0
nτ02 +σ0
and τn2 = nτ02 +σ02

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors

conjugate priors: a prior is chosen to ensure its posterior has


the same functional form as the prior

conjugate to the likelihood function of the underlying model,


i.e. both have the same function form

choice of conjugate priors leads to computational convenience


in Bayesian learning

not every model has a conjugate prior, e.g. mixture models

all e-family models have conjugate priors

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Examples of Conjugate Priors


model p(x|θ) conjugate prior p(θ)
1-D Gaussian (known variance) 1-D Gaussian
N (x | µ, σ02 ) N (µ | ν, τ 2 )
1-D Gaussian (known mean) inverse-gamma
N (x | µ0 , σ 2 ) gamma−1 (σ 2 | α, β)
Gaussian (known covariance) Gaussian
N (x | µ, Σ0 ) N (µ | ν, Φ)
Gaussian (known mean) inverse-Wishart
N (x | µ0 , Σ) W −1 (Σ | Φ, ν)
multivariate Gaussian Gaussian-inverse-Wishart
N (x | µ, Σ) GIW(µ, Σ | ν, Φ, λ, ν) =
N (µ | ν, λ1 Σ) W −1 (Σ | Φ, ν)
multinomial Dirichlet
ri αi −1
Mult(r | w) = C(r) · M Dir(w | α) = B(α) · M
Q Q
i=1 wi i=1 wi
(r1 +···+rM )! Γ(α1 +···+αM )
with C(r) = r1 !···rM ! with B(α) = Γ(α1 )···Γ(αM )

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors for Bayesian Learning: Multinomials


 
a sample of some counts: r = r1 r2 · · · rM
ri
multinomial models: p(r | w) = Mult r | w = C(r) · M
 Q
i=1 wi
the conjugate prior is Dirichlet:
(0)
αi −1
p(w) = Dir(w | α(0) ) = B(α(0) ) · M
Q
i=1 wi
Bayesian learning:
(0)
QM α +ri −1
p(w | r) ∝ p(w) p(r | w) ∝ i=1 wi i
the posterior is also Dirichlet:
(1)
QM α −1
p(w | r) = Dir(w | α(1) ) = B(α(1) ) · i=1 wi i
MAP estimation:
PM
w(MAP) = arg maxw p(w | r) subject to i=1 wi = 1
(1) (0)
(MAP) αi −1 ri +αi −1
=⇒ wi = PM (1) = PM (0)
 ∀i = 1, 2, · · · , M
i=1 αi −M ri +αi −M
i=1

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors for Bayesian Learning: Gaussians (1)


Gaussian models: p(x | µ, Σ) = N (x | µ, Σ)
the conjugate prior is a Gaussian-inverse-Wishart
 (GIW)
 distribution:
 
p µ, Σ = GIW µ, Σ ν0 , Φ0 , λ0 , ν0 = N µ ν0 , λ10 Σ W −1 Σ Φ0 , ν0
 
ν0 +d+2 h | i
= c0 Σ−1 exp − 12 λ0 µ − ν0 Σ−1 µ − ν0 − 21 tr Φ0 Σ−1

2


the likelihood function of a training set D = x1 , x2 , · · · xN :
N
 Y 
p D µ, Σ = p xi µ, Σ
i=1
N
−1
Σ 2
1 h
−1  N | −1
i
= exp − tr N S Σ − (µ − x̄) Σ (µ − x̄)
(2π)N d/2 2 2

Bayesian learning:
  
p µ, Σ D ∝ GIW µ, Σ ν0 , Φ0 , λ0 , ν0 · p D µ, Σ
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors for Bayesian Learning: Gaussians (2)


the posterior is another GIW distribution:
 
p µ, Σ DN = GIW µ, Σ ν1 , Φ1 , λ1 , ν1
ν1 +d+2 h 1 |  1 i
= c1 Σ−1 2
exp − λ1 µ − ν1 Σ−1 µ − ν1 − tr Φ1 Σ−1
2 2

◦ λ1 = λ0 + N and ν1 = ν0 + N
◦ ν1 = λ0λν00+N
+N x̄
|
◦ Φ1 = Φ0 + N S + λλ00+N
N

x̄ − ν0 x̄ − ν0
 
MAP estimation: µMAP , ΣMAP = arg maxµ,Σ p µ, Σ DN
λ0 ν0 + N x̄
µMAP = ν1 =
λ0 + N
|
Φ0 + N S + λλ00+N
N

Φ1 x̄ − ν0 x̄ − ν0
ΣMAP = =
ν1 + d + 1 ν0 + N + d + 1
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Approximate Inference

when conjugate priors do not exist, Bayesian learning may


lead to very complicated posterior distributions

approximate inference: approximate the true posterior


distribution with a simple distribution for Bayesian inference

popular approximate inference methods:


1 Laplace’s method

2 variational Bayesian (VB) method

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Laplace’s Method

use a Gaussian centered at θMAP to approximate


the true posterior p(θ | D)
Taylor’s expansion of f (θ) = ln p(θ | D) at θMAP :

f (θ) = f (θMAP ) + ∇(θMAP ) θ − θMAP +
1 | 
θ − θMAP H(θMAP ) θ − θMAP + · · ·
2!
2nd-order approximation:
1 | 
f (θ) ≈ f (θMAP ) + θ − θMAP H(θMAP ) θ − θMAP
2
1 | 
p(θ | D) ≈ C · exp θ − θMAP H(θMAP ) θ − θMAP
| 2 {z }

N θMAP ,−H−1 (θMAP )

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Learning of Logistic Regression



a training set D = (x1 , y1 ), · · · , (xN , yN ) , xi ∈ Rd , yi ∈ {0, 1}
likelihood function of logistic regression:
 QN yi  1−yi
p(D | w) = l(w| xi )
i=1 1 − l(w| xi )

choose a Gaussian prior: p(w) = N w |w0 , Σ0
Bayesian learning: p(w | D) ∝ p(w) p(D | w)
the posterior p(w | D) is not Gaussian anymore
use Laplace’s method to approximate the true posterior
◦ use a gradient descent to find wMAP
∇(w) = ∇ ln p(w | D) = −Σ−1 w − w0 + N |
 P 
0 i=1 yi − l(w xi ) xi
◦ use a Gaussian approximation:
 
p(w | D) ≈ N w wMAP , −H−1 (wMAP )
with H(w) = −Σ−1
PN | |
 |
0 − i=1 l(w xi ) 1 − l(w xi ) xi xi
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Methods (I)


variational Bayesian (VB): use a simpler variational
distribution q(θ) to approximate the true posterior p(θ | D):
 
q ∗ (θ) = arg min KL q(θ) k p(θ | D)
q

Z
  p(D, θ)
KL q(θ) k p(θ | D) = ln p(D) − q(θ) ln dθ
q(θ)
|θ {z }
L(q)
 
minq KL q(θ) k p(θ | D) ⇐⇒ maxq L(q)
assume q(θ) = q1 (θ1 ) q2 (θ2 ) · · · qI (θI ) can be factorized over
some disjoint subsets θ = θ1 ∪ θ2 ∪ · · · ∪ θI
R QI PI R
L(q) = θ i=1 qi (θi ) ln p(D, θ)dθ − i=1 θi
qi (θi ) ln qi (θi )dθi
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Methods (II)


maximize L(q) w.r.t. each qi (θi ) separately
Z hZ Y i Z
max qi (θi ) qj (θj ) ln p(D, θ)dθj6=i dθi − qi (θi ) ln qi (θi )dθi
qi θi θj6=i j6=i θi
| {z }
 
Ej6=i ln p(D,θ)

  
define a new distribution: pe(θi ; D) ∝ exp Ej6=i ln p(D, θ)

we have qi∗ (θi ) = arg maxqi qi (θi ) ln peq(θ i ;D)


R
θi i (θi )
dθi
 
=⇒ qi∗ (θi ) = arg minqi KL qi (θi ) k pe(θi ; D)
 
derive qi∗ (θi ) = pe(θi ; D) ∝ exp Ej6=i ln p(D, θ)

or

ln qi∗ (θi ) = Ej6=i ln p(D, θ) + C


 

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Methods (III)

mean field theory: use a factorizable


variational distribution to approximate
a true posterior distribution
h i
a 2-D Gaussian with Σ = 12 25

approximate withha variational


σ2 0
i
distribution Σ = 01 σ2
2

best-fit is found by minimizing the


KL-divergence

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (I)


a Gaussian mixture model (GMM):
PM 
wm · N x | µm , Σm
p(x | θ) = m=1

where model parameters θ = wm , µm , Σm | m = 1, 2, · · · , M
no conjugate prior exists for GMMs
choose a prior distribution as
M
Y
p(θ) = p(w1 , · · · , wM ) p(µm , Σm )
m=1

with
(0) (0)
p(w1 , · · · , wM ) = Dir(w1 , · · · , wM | α1 , · · · , αM )
(0)
, Φ(0) (0) (0)

p(µm , Σm ) = GIW µm , Σm | νm m , λm , νm

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (II)

 
introduce 1-of-M latent variable z = z1 z2 · · · zM for GMMs:
M zm
Y zm 
p(x, z | θ) = wm N (x | µm , Σm )
m=1

use the variational Bayesian method to approximate the posterior


distribution p(z, θ|x)
introduce a variational distribution factorized as:
M
Y
q(z, θ) = q(z)q(θ) = q(z) q(w1 , · · · , wM ) q(µm , Σm )
m=1

derive the best-fit variational distribution q ∗ (z, θ)

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (III)

1 ln q ∗ (z) = Eθ ln p(x, z, θ) + C = Eθ ln p(θ) + ln p(x, z|θ) + C


   

=⇒ ln q∗ (z) = C 0 +
h ln |Σ | i h (x − µ )| Σ −1 (x − µ ) i 
PM   m m m m
m=1 zm E ln wm − E −E
| 2 {z 2 }
ln ρm
zm zm
◦ q ∗ (z) is a multinomial: q ∗ (z) ∝
QM QM
m=1 ρm ∝ m=1 rm ,
ρ
where rm = PM m for all m
m=1 ρm
h i
2 ln q ∗ (w1 , · · · , wM ) = Ez,µm ,Σm ln p(θ) + ln p(x, z|θ)
(0)
= M
P PM
m=1 (αm − 1) ln wm + m=1 rm ln wm + C
◦ q ∗ (w1 , · · · , wM ) is a Dirichlet distribution:

(1) (1)
q ∗ (w1 , · · · , wM ) = Dir w1 , · · · , wM | α1 , · · · , αM
(1) (0)
where αm = αm + rm for all m = 1, 2, · · · , M
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (IV)


h i
3 ln q ∗ (µm , Σm ) = Ez,wm ln p(θ) + ln p(x, z|θ) + C

= ln p(µm , Σm ) + E zm ln N (x|µm , Σm ) + C 0
 

◦ q ∗ (µm , Σm ) is also a GIW distribution:

q ∗ (µm , Σm ) = GIW µm , Σm | νm
(1)
, Φ(1) (1) (1) 
m , λm , νm

where
λ(1) (0)
m = λ m + rm
(1) (0)
νm = νm + rm
(0) (0)
(1) λ m νm + r m x
νm = (0)
λ m + rm
λ(0) rm (0) |
Φ(1) (0)
m = Φm + (0)
(0) 
x − νm x − νm
λm + rm

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (V)

based on the above distributions, we have


M
X 
∆   (1)
 (1)
ln πm = E ln wk = ψ αm −ψ αm
m=1
d λ + 1 − i
∆   X m
ln Bm = E ln Σm = ψ − ln Φ(1)
m
i=1
2
h i d (1) −1
E (x−µm )| Σm −1 (x−µm ) = (1) +λ(1) (1) | (1)

m (x−νm ) Φm (x−νm )
νm
to compute ρm as well as rm (∀m = 1, 2, · · · , M )
derive an EM-like algorithm to solve mutual dependency

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (VI)

Variational Bayesian GMMs


 (0) (0) (0) (0) (0)
Input: αm , νm , Φm , λm , νm | m = 1, 2, · · · , M

set n = 0
while not converge do
E-step: collect statistics:
(n) (n) (n) (n) (n) 
αm , νm , Φm , λm , νm + x −→ rm
M-step: update all hyperparameters:
 (n) (n) (n) (n) (n) 
αm , νm , Φm , λm , νm + rm + x
 (n+1) (n+1) (n+1) (n+1) (n+1)
−→ αm , νm , Φm , λm , νm
n=n+1
end while

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Non-Parametric Bayesian Methods

Bayesian learning of parametric models: rely on


prior/posterior distributions of model parameters

how about Bayesian learning of non-parametric models?

non-parametric Bayesian methods: use stochastic processes as


priors for non-parametric models

◦ Gaussian processes

◦ Dirichlet processes

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes: Concepts (I)


given an arbitrary function f (x)
for any set of N points in Rd , i.e. D = x1 , x2 , · · · , xN


function values form an N -dimensional real-valued vector


 |
f = f (x1 ) f (x2 ) · · · f (xN )
assume f follows a multivariate Gaussian distribution
 | 
f = f (x1 ) f (x2 ) · · · f (xN ) ∼ N µD , ΣD
where µD and ΣD depends on N data points in D
it holds for any D, f (x) is a sample from a Gaussian process:
 
f (x) ∼ GP m(x), Φ(x, x0 )
◦ m(x): mean function =⇒ µD
◦ Φ(x, x0 ): covariance function =⇒ ΣD
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes: Concepts (II)


how to specify a Gaussian
process?
mean function m(x) = 0
covariance function: Mercer’s
condition
Φ(x, x0 ) = cov f (xi ), f (xj )

 
◦ ΣD = Φ(xi , xj )
N ×N
RBF kernel function
kxi −xj k2
Φ(xi , xj ) = σ 2 e− 2l2

◦ σ: vertical scale
◦ l: horizontal scale
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Non-Parametric Bayesian Learning

Gaussian processes as a non-parametric prior


◦ randomly sample a function f (·) from a Gaussian process
◦ a prior can be implicitly computed with a data set

D = x1 , x2 , · · · , xN
◦ function values f follow a multivariate Gaussian distribution
◦ non-parametric prior:
 
p f | D = N f | 0, ΣD

Gaussian processes for regression or classification


◦ input-output pairs yield likelihood function
◦ apply Bayesian learning rule:
posterior ∝ prior × likelihood

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (I)


basic setting for regression:
 
◦ f (x) ∼ GP 0, Φ(x, x0 ) x y
regression
◦ y = f (x) +  where  ∼ N (0, σ02 )

given a training set: D = x1 , x2 , · · · , xN
 |
the corresponding outputs: y = y1 y2 · · · yN
a non-parametric prior:
 
p f | D = N f | 0, ΣD

the likelihood function due to the residual Gaussian noise :

p y | f, D = N y | f , σ02 I
 

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (II)


Bayesian learning for the predictive distribution:
Z Z
   
p y|D = p y, f | D df = p y | f, D p f | D df
f f
Z
N y | f , σ02 I N f | 0, ΣD df

=
f
= N y | 0, ΣD + σ02 I = N y | 0, CN
 

hyper-parameter learning:

{σ ∗ , l∗ , σ0∗ } = arg max p y | D, σ, l, σ0 = arg max ln N y | 0, CN


 
σ,l,σ0 σ,l,σ0

◦ may use a gradient descent method


c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (III)

predict output ỹ for a new input x̃:


 
p y, ỹ | D, x = N y, ỹ | 0, CN +1

with  
 CN k 
CN +1 =



k| κ2
2
where κ = Φ(x̃, x̃) + σ02 and ki = Φ(xi , x̃)
point estimation (MAP or
the predictive distribution: mean):

 p y, ỹ | D, x̃  
E ỹ D, y, x̃ = ỹMAP
p ỹ | D, y, x̃ = 
p y|D
  = k| C−1
N y
= N ỹ k| C−1 2 | −1
N y, κ − k CN k

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (IV)

derive a non-parametric prior from D and Φ(x, x0 )


non-parametric Bayesian learning based on y:

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Classification


basic setting for binary classification y ∈ {0, 1} :
 
◦ f (x) ∼ GP 0, Φ(x, x0 )
x y
  1 classification
◦ Pr y = 1 | x = l f (x) = 1+e−f (x)


given a training set D = x1 , x2 , · · · , xN and the corresponding
 |
outputs y = y1 y2 · · · yN
 
non-parametric prior: p f | D = N f | 0, ΣD
QN  yi  1−yi
likelihood: p(y |f, D) = i=1 l f (xi ) 1 − l f (xi )

no closed-form solution to derive the marginal  and predictive


distributions, i.e. p(y | D) and p ỹ | D, y, x̃
require approximate inference, such as Laplace’s method
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14

You might also like