0% found this document useful (0 votes)

12 views34 pages

Ch14 Bayesian Learning

The document provides supplementary slides on Bayesian Learning, covering key topics such as the formulation of Bayesian learning, conjugate priors, approximate inference, and Gaussian processes. It contrasts frequentist and Bayesian views, explains the Bayesian learning rule, and discusses maximum a posteriori estimation and sequential Bayesian learning. Additionally, it highlights the importance of conjugate priors for computational convenience and outlines methods for approximate inference when conjugate priors are not available.

Uploaded by

tarika7219

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views34 pages

Ch14 Bayesian Learning

Uploaded by

tarika7219

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Formulation Conjugate Priors Approximate Inference Gaussian Processes

Chapter 14
Bayesian Learning

supplementary slides to
Machine Learning Fundamentals
c
Hui Jiang 2020
published by Cambridge University Press

August 2020

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Outline

1 Formulation of Bayesian Learning

2 Conjugate Priors

3 Approximate Inference

4 Gaussian Processes

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Learning (I)

frequentist vs. Bayesian views in machine learning
◦ frequentist: model parameters as unknown but fixed quantities
◦ Bayesian: model parameters as random variables
Bayesians use probability distributions of model parameters
Bayes’ theorem:
p(x, θ) p(θ) p(x|θ)
p(θ | x) = =
p(x) p(x)

◦ p(θ): prior distribution of model parameters θ

◦ p(θ|x): the posterior distribution of θ given data x
◦ p(x|θ): the likelihood function of the model
Bayesian learning rule: posterior ∝ prior × likelihood
p(θ|x) ∝ p(θ) p(x|θ)
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Learning (II)

prior specification: p(θ)

◦ use a prior distribution to describe prior
knowledge on models
Bayesian learning
◦ optimally combine prior knowledge
with data
◦ given a training set: D = x1 , x2 , · · · , xN
D
◦ Bayesian learning rule: p(θ) −→ p(θ|D)
p(θ|D) ∝ p(θ)p(D|θ)
N
Y
p(θ|D) ∝ p(θ) p(D|θ) = p(θ) p(xi |θ)
i=1
posterior ∝ prior × likelihood
Bayesian inference
◦ make a decision based on p(θ|D)
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Inference for Classification

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Maximum a Posteriori (MAP) Estimation

not easy to use a distribution p(θ|D) to θML = arg max p(D | θ)

θ
describe models
point estimation: only use a point to
estimate a distribution p(θ|D)
maximum a posteriori (MAP) estimation:

θMAP = arg max p(θ | D)

θ
= arg max p(θ) p(D | θ)
θ

MAP estimation vs. ML estimation

◦ ML solely relies on training data
◦ MAP optimally combines prior θMAP vs. θML
knowledge with data
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Sequential Bayesian Learning

Bayesian learning is an excellent tool for on-line learning,
where training data come one by one
sequential Bayesian learning
◦ use the Bayesian learning to update models after each sample
◦ track a slowly-changing environment

p(θ | x1 ) ∝ p(θ)p(x1 | θ)
p(θ | x1 , x2 ) ∝ p(θ | x1 ) p(x2 | θ)

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Example: Sequential Bayesian Learning

a univariate Gaussian model with known variance:

(x−µ)2
1 − 2
p(x | µ) = N (x | µ, σ02 ) = p e 2σ0

2πσ02

choose a prior distribution:

(µ−ν0 )2
1 − 2
p(µ) = N (µ | ν0 , τ02 ) = p e 2τ0

2πτ02
as n → ∞, we have
p(µ|x1 ) ∝ p(µ)p(x1 |µ) =⇒ p(µ|x1 ) = N (µ|ν1 , τ12 )
2
σ0 τ02 τ02 σ0
2
◦ τn → 0
with ν1 = τ02 +σ02 ν0 + τ02 +σ02 x1 and τ12 = τ02 +σ02
◦ νn → x̄n
p(µ | x1 , · · · , xn ) = N (µ|νn , τn2 ) with
nτ02 σ02
τ02 σ0
2 ◦ µMAP → µML
νn = 2 x̄n
nτ02 +σ0
+ 2 ν0
nτ02 +σ0
and τn2 = nτ02 +σ02

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors

conjugate priors: a prior is chosen to ensure its posterior has

the same functional form as the prior

conjugate to the likelihood function of the underlying model,

i.e. both have the same function form

choice of conjugate priors leads to computational convenience

in Bayesian learning

not every model has a conjugate prior, e.g. mixture models

all e-family models have conjugate priors

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Examples of Conjugate Priors

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors for Bayesian Learning: Multinomials

a sample of some counts: r = r1 r2 · · · rM
ri
multinomial models: p(r | w) = Mult r | w = C(r) · M
Q
i=1 wi
the conjugate prior is Dirichlet:
(0)
αi −1
p(w) = Dir(w | α(0) ) = B(α(0) ) · M
Q
i=1 wi
Bayesian learning:
(0)
QM α +ri −1
p(w | r) ∝ p(w) p(r | w) ∝ i=1 wi i
the posterior is also Dirichlet:
(1)
QM α −1
p(w | r) = Dir(w | α(1) ) = B(α(1) ) · i=1 wi i
MAP estimation:
PM
w(MAP) = arg maxw p(w | r) subject to i=1 wi = 1
(1) (0)
(MAP) αi −1 ri +αi −1
=⇒ wi = PM (1) = PM (0)
∀i = 1, 2, · · · , M
i=1 αi −M ri +αi −M
i=1

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors for Bayesian Learning: Gaussians (1)

Gaussian models: p(x | µ, Σ) = N (x | µ, Σ)
the conjugate prior is a Gaussian-inverse-Wishart
(GIW)
distribution:

p µ, Σ = GIW µ, Σ ν0 , Φ0 , λ0 , ν0 = N µ ν0 , λ10 Σ W −1 Σ Φ0 , ν0

ν0 +d+2 h | i
= c0 Σ−1 exp − 12 λ0 µ − ν0 Σ−1 µ − ν0 − 21 tr Φ0 Σ−1

2

the likelihood function of a training set D = x1 , x2 , · · · xN :
N
Y
p D µ, Σ = p xi µ, Σ
i=1
N
−1
Σ 2
1 h
−1 N | −1
i
= exp − tr N S Σ − (µ − x̄) Σ (µ − x̄)
(2π)N d/2 2 2

Bayesian learning:

p µ, Σ D ∝ GIW µ, Σ ν0 , Φ0 , λ0 , ν0 · p D µ, Σ
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Conjugate Priors for Bayesian Learning: Gaussians (2)

the posterior is another GIW distribution:

p µ, Σ DN = GIW µ, Σ ν1 , Φ1 , λ1 , ν1
ν1 +d+2 h 1 | 1 i
= c1 Σ−1 2
exp − λ1 µ − ν1 Σ−1 µ − ν1 − tr Φ1 Σ−1
2 2

◦ λ1 = λ0 + N and ν1 = ν0 + N
◦ ν1 = λ0λν00+N
+N x̄
|
◦ Φ1 = Φ0 + N S + λλ00+N
N

x̄ − ν0 x̄ − ν0

MAP estimation: µMAP , ΣMAP = arg maxµ,Σ p µ, Σ DN
λ0 ν0 + N x̄
µMAP = ν1 =
λ0 + N
|
Φ0 + N S + λλ00+N
N

Φ1 x̄ − ν0 x̄ − ν0
ΣMAP = =
ν1 + d + 1 ν0 + N + d + 1
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Approximate Inference

when conjugate priors do not exist, Bayesian learning may

lead to very complicated posterior distributions

approximate inference: approximate the true posterior

distribution with a simple distribution for Bayesian inference

popular approximate inference methods:

1 Laplace’s method

2 variational Bayesian (VB) method

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Laplace’s Method

use a Gaussian centered at θMAP to approximate

the true posterior p(θ | D)
Taylor’s expansion of f (θ) = ln p(θ | D) at θMAP :

f (θ) = f (θMAP ) + ∇(θMAP ) θ − θMAP +
1 |
θ − θMAP H(θMAP ) θ − θMAP + · · ·
2!
2nd-order approximation:
1 |
f (θ) ≈ f (θMAP ) + θ − θMAP H(θMAP ) θ − θMAP
2
1 |
p(θ | D) ≈ C · exp θ − θMAP H(θMAP ) θ − θMAP
| 2 {z }

N θMAP ,−H−1 (θMAP )

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Bayesian Learning of Logistic Regression

a training set D = (x1 , y1 ), · · · , (xN , yN ) , xi ∈ Rd , yi ∈ {0, 1}
likelihood function of logistic regression:
QN yi 1−yi
p(D | w) = l(w| xi )
i=1 1 − l(w| xi )

choose a Gaussian prior: p(w) = N w |w0 , Σ0
Bayesian learning: p(w | D) ∝ p(w) p(D | w)
the posterior p(w | D) is not Gaussian anymore
use Laplace’s method to approximate the true posterior
◦ use a gradient descent to find wMAP
∇(w) = ∇ ln p(w | D) = −Σ−1 w − w0 + N |
P
0 i=1 yi − l(w xi ) xi
◦ use a Gaussian approximation:

p(w | D) ≈ N w wMAP , −H−1 (wMAP )
with H(w) = −Σ−1
PN | |
|
0 − i=1 l(w xi ) 1 − l(w xi ) xi xi
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Methods (I)

variational Bayesian (VB): use a simpler variational
distribution q(θ) to approximate the true posterior p(θ | D):

q ∗ (θ) = arg min KL q(θ) k p(θ | D)
q

Z
p(D, θ)
KL q(θ) k p(θ | D) = ln p(D) − q(θ) ln dθ
q(θ)
|θ {z }
L(q)

minq KL q(θ) k p(θ | D) ⇐⇒ maxq L(q)
assume q(θ) = q1 (θ1 ) q2 (θ2 ) · · · qI (θI ) can be factorized over
some disjoint subsets θ = θ1 ∪ θ2 ∪ · · · ∪ θI
R QI PI R
L(q) = θ i=1 qi (θi ) ln p(D, θ)dθ − i=1 θi
qi (θi ) ln qi (θi )dθi
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Methods (II)

maximize L(q) w.r.t. each qi (θi ) separately
Z hZ Y i Z
max qi (θi ) qj (θj ) ln p(D, θ)dθj6=i dθi − qi (θi ) ln qi (θi )dθi
qi θi θj6=i j6=i θi
| {z }

Ej6=i ln p(D,θ)

define a new distribution: pe(θi ; D) ∝ exp Ej6=i ln p(D, θ)

we have qi∗ (θi ) = arg maxqi qi (θi ) ln peq(θ i ;D)

R
θi i (θi )
dθi

=⇒ qi∗ (θi ) = arg minqi KL qi (θi ) k pe(θi ; D)

derive qi∗ (θi ) = pe(θi ; D) ∝ exp Ej6=i ln p(D, θ)

or

ln qi∗ (θi ) = Ej6=i ln p(D, θ) + C

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Methods (III)

mean field theory: use a factorizable

variational distribution to approximate
a true posterior distribution
h i
a 2-D Gaussian with Σ = 12 25

approximate withha variational

σ2 0
i
distribution Σ = 01 σ2
2

best-fit is found by minimizing the

KL-divergence

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (I)

a Gaussian mixture model (GMM):
PM
wm · N x | µm , Σm
p(x | θ) = m=1

where model parameters θ = wm , µm , Σm | m = 1, 2, · · · , M
no conjugate prior exists for GMMs
choose a prior distribution as
M
Y
p(θ) = p(w1 , · · · , wM ) p(µm , Σm )
m=1

with
(0) (0)
p(w1 , · · · , wM ) = Dir(w1 , · · · , wM | α1 , · · · , αM )
(0)
, Φ(0) (0) (0)

p(µm , Σm ) = GIW µm , Σm | νm m , λm , νm

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (II)

introduce 1-of-M latent variable z = z1 z2 · · · zM for GMMs:
M zm
Y zm
p(x, z | θ) = wm N (x | µm , Σm )
m=1

use the variational Bayesian method to approximate the posterior

distribution p(z, θ|x)
introduce a variational distribution factorized as:
M
Y
q(z, θ) = q(z)q(θ) = q(z) q(w1 , · · · , wM ) q(µm , Σm )
m=1

derive the best-fit variational distribution q ∗ (z, θ)

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (III)

1 ln q ∗ (z) = Eθ ln p(x, z, θ) + C = Eθ ln p(θ) + ln p(x, z|θ) + C

=⇒ ln q∗ (z) = C 0 +
h ln |Σ | i h (x − µ )| Σ −1 (x − µ ) i
PM m m m m
m=1 zm E ln wm − E −E
| 2 {z 2 }
ln ρm
zm zm
◦ q ∗ (z) is a multinomial: q ∗ (z) ∝
QM QM
m=1 ρm ∝ m=1 rm ,
ρ
where rm = PM m for all m
m=1 ρm
h i
2 ln q ∗ (w1 , · · · , wM ) = Ez,µm ,Σm ln p(θ) + ln p(x, z|θ)
(0)
= M
P PM
m=1 (αm − 1) ln wm + m=1 rm ln wm + C
◦ q ∗ (w1 , · · · , wM ) is a Dirichlet distribution:

(1) (1)
q ∗ (w1 , · · · , wM ) = Dir w1 , · · · , wM | α1 , · · · , αM
(1) (0)
where αm = αm + rm for all m = 1, 2, · · · , M
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (IV)

h i
3 ln q ∗ (µm , Σm ) = Ez,wm ln p(θ) + ln p(x, z|θ) + C

= ln p(µm , Σm ) + E zm ln N (x|µm , Σm ) + C 0

◦ q ∗ (µm , Σm ) is also a GIW distribution:

q ∗ (µm , Σm ) = GIW µm , Σm | νm
(1)
, Φ(1) (1) (1)
m , λm , νm

where
λ(1) (0)
m = λ m + rm
(1) (0)
νm = νm + rm
(0) (0)
(1) λ m νm + r m x
νm = (0)
λ m + rm
λ(0) rm (0) |
Φ(1) (0)
m = Φm + (0)
(0)
x − νm x − νm
λm + rm

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (V)

based on the above distributions, we have

M
X
∆ (1)
(1)
ln πm = E ln wk = ψ αm −ψ αm
m=1
d λ + 1 − i
∆ X m
ln Bm = E ln Σm = ψ − ln Φ(1)
m
i=1
2
h i d (1) −1
E (x−µm )| Σm −1 (x−µm ) = (1) +λ(1) (1) | (1)

m (x−νm ) Φm (x−νm )
νm
to compute ρm as well as rm (∀m = 1, 2, · · · , M )
derive an EM-like algorithm to solve mutual dependency

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Variational Bayesian Learning of GMMs (VI)

Variational Bayesian GMMs

(0) (0) (0) (0) (0)
Input: αm , νm , Φm , λm , νm | m = 1, 2, · · · , M

set n = 0
while not converge do
E-step: collect statistics:
(n) (n) (n) (n) (n)
αm , νm , Φm , λm , νm + x −→ rm
M-step: update all hyperparameters:
(n) (n) (n) (n) (n)
αm , νm , Φm , λm , νm + rm + x
(n+1) (n+1) (n+1) (n+1) (n+1)
−→ αm , νm , Φm , λm , νm
n=n+1
end while

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Non-Parametric Bayesian Methods

Bayesian learning of parametric models: rely on

prior/posterior distributions of model parameters

how about Bayesian learning of non-parametric models?

non-parametric Bayesian methods: use stochastic processes as

priors for non-parametric models

◦ Gaussian processes

◦ Dirichlet processes

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes: Concepts (I)

given an arbitrary function f (x)
for any set of N points in Rd , i.e. D = x1 , x2 , · · · , xN

function values form an N -dimensional real-valued vector

|
f = f (x1 ) f (x2 ) · · · f (xN )
assume f follows a multivariate Gaussian distribution
|
f = f (x1 ) f (x2 ) · · · f (xN ) ∼ N µD , ΣD
where µD and ΣD depends on N data points in D
it holds for any D, f (x) is a sample from a Gaussian process:

f (x) ∼ GP m(x), Φ(x, x0 )
◦ m(x): mean function =⇒ µD
◦ Φ(x, x0 ): covariance function =⇒ ΣD
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes: Concepts (II)

how to specify a Gaussian
process?
mean function m(x) = 0
covariance function: Mercer’s
condition
Φ(x, x0 ) = cov f (xi ), f (xj )

◦ ΣD = Φ(xi , xj )
N ×N
RBF kernel function
kxi −xj k2
Φ(xi , xj ) = σ 2 e− 2l2

◦ σ: vertical scale
◦ l: horizontal scale
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Non-Parametric Bayesian Learning

Gaussian processes as a non-parametric prior

◦ randomly sample a function f (·) from a Gaussian process
◦ a prior can be implicitly computed with a data set

D = x1 , x2 , · · · , xN
◦ function values f follow a multivariate Gaussian distribution
◦ non-parametric prior:

p f | D = N f | 0, ΣD

Gaussian processes for regression or classification

◦ input-output pairs yield likelihood function
◦ apply Bayesian learning rule:
posterior ∝ prior × likelihood

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (I)

basic setting for regression:

◦ f (x) ∼ GP 0, Φ(x, x0 ) x y
regression
◦ y = f (x) + where ∼ N (0, σ02 )

given a training set: D = x1 , x2 , · · · , xN
|
the corresponding outputs: y = y1 y2 · · · yN
a non-parametric prior:

p f | D = N f | 0, ΣD

the likelihood function due to the residual Gaussian noise :

p y | f, D = N y | f , σ02 I

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (II)

hyper-parameter learning:

{σ ∗ , l∗ , σ0∗ } = arg max p y | D, σ, l, σ0 = arg max ln N y | 0, CN

σ,l,σ0 σ,l,σ0

◦ may use a gradient descent method

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (III)

predict output ỹ for a new input x̃:

p y, ỹ | D, x = N y, ỹ | 0, CN +1

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Regression (IV)

derive a non-parametric prior from D and Φ(x, x0 )

non-parametric Bayesian learning based on y:

c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14
Formulation Conjugate Priors Approximate Inference Gaussian Processes

Gaussian Processes for Classification

basic setting for binary classification y ∈ {0, 1} :

◦ f (x) ∼ GP 0, Φ(x, x0 )
x y
1 classification
◦ Pr y = 1 | x = l f (x) = 1+e−f (x)

given a training set D = x1 , x2 , · · · , xN and the corresponding
|
outputs y = y1 y2 · · · yN

non-parametric prior: p f | D = N f | 0, ΣD
QN yi 1−yi
likelihood: p(y |f, D) = i=1 l f (xi ) 1 − l f (xi )

no closed-form solution to derive the marginal and predictive

distributions, i.e. p(y | D) and p ỹ | D, y, x̃
require approximate inference, such as Laplace’s method
c
supplementary slides to Machine Learning Fundamentals Hui Jiang 2020 published by Cambridge University Press
Chapter 14

Whatisaχ (Chi-square) test used for?: Statistical test used to based on a hypothesis
No ratings yet
Whatisaχ (Chi-square) test used for?: Statistical test used to based on a hypothesis
20 pages
Last Minute Statistics Revision Sscjsosi Abhishek
No ratings yet
Last Minute Statistics Revision Sscjsosi Abhishek
31 pages
CHAPTER 4 Biometry
No ratings yet
CHAPTER 4 Biometry
63 pages
Sampling Distribution
No ratings yet
Sampling Distribution
17 pages
Activity 4
No ratings yet
Activity 4
2 pages
11411
No ratings yet
11411
8 pages
Standard Deviation and Coefficient Variation: Presidency University BANGALORE-560 064
No ratings yet
Standard Deviation and Coefficient Variation: Presidency University BANGALORE-560 064
18 pages
STAT 266 - Lecture 2
No ratings yet
STAT 266 - Lecture 2
45 pages
1 s2.0 S1877050923001102 Main
No ratings yet
1 s2.0 S1877050923001102 Main
7 pages
Module 2 Quiz - Correct
No ratings yet
Module 2 Quiz - Correct
4 pages
Ensemble Methods Final PDF
No ratings yet
Ensemble Methods Final PDF
25 pages
OpenStax Statistics CH10 ImageSlideshow
No ratings yet
OpenStax Statistics CH10 ImageSlideshow
14 pages
Complex Analysis & Statistics Exam Paper
No ratings yet
Complex Analysis & Statistics Exam Paper
4 pages
Dynamic Panel Data Analysis in Stata
No ratings yet
Dynamic Panel Data Analysis in Stata
27 pages
EDAV Assignment2 Devyani2
No ratings yet
EDAV Assignment2 Devyani2
7 pages
Augmented Dickey Fuller Test
No ratings yet
Augmented Dickey Fuller Test
19 pages
Sampling Techniques in Machine Learning
No ratings yet
Sampling Techniques in Machine Learning
9 pages
Practice Solutions
No ratings yet
Practice Solutions
4 pages
DWDM Unit IV Mining - FP Association Rules
No ratings yet
DWDM Unit IV Mining - FP Association Rules
82 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Unit II Business Statistics and Logic
No ratings yet
Unit II Business Statistics and Logic
51 pages
Performance Task in Statistics and Probability
No ratings yet
Performance Task in Statistics and Probability
13 pages
Lab Stat
No ratings yet
Lab Stat
10 pages
Percentiles, Quartiles, and Deciles Guide
75% (4)
Percentiles, Quartiles, and Deciles Guide
3 pages
MDM 4U Online: Statistical Analysis Tasks
No ratings yet
MDM 4U Online: Statistical Analysis Tasks
3 pages
R Package for SEM Using PLS
No ratings yet
R Package for SEM Using PLS
32 pages
Test of Significance in Analytical Chemistry
No ratings yet
Test of Significance in Analytical Chemistry
3 pages
DWM Exam Notes
No ratings yet
DWM Exam Notes
23 pages
Dunnett's Test
No ratings yet
Dunnett's Test
2 pages
Engineering Data Analysis Problem Set 3
No ratings yet
Engineering Data Analysis Problem Set 3
7 pages

Ch14 Bayesian Learning

Uploaded by

Ch14 Bayesian Learning

Uploaded by

Formulation Conjugate Priors Approximate Inference Gaussian Processes

1 Formulation of Bayesian Learning

Bayesian Learning (I)

◦ p(θ): prior distribution of model parameters θ

Bayesian Learning (II)

prior specification: p(θ)

Bayesian Inference for Classification

Maximum a Posteriori (MAP) Estimation

not easy to use a distribution p(θ|D) to θML = arg max p(D | θ)

θMAP = arg max p(θ | D)

MAP estimation vs. ML estimation

Sequential Bayesian Learning

Example: Sequential Bayesian Learning

a univariate Gaussian model with known variance:

choose a prior distribution:

conjugate priors: a prior is chosen to ensure its posterior has

conjugate to the likelihood function of the underlying model,

choice of conjugate priors leads to computational convenience

not every model has a conjugate prior, e.g. mixture models

all e-family models have conjugate priors

Examples of Conjugate Priors

Conjugate Priors for Bayesian Learning: Multinomials

Conjugate Priors for Bayesian Learning: Gaussians (1)

Conjugate Priors for Bayesian Learning: Gaussians (2)

when conjugate priors do not exist, Bayesian learning may

approximate inference: approximate the true posterior

popular approximate inference methods:

2 variational Bayesian (VB) method

use a Gaussian centered at θMAP to approximate

Bayesian Learning of Logistic Regression

Variational Bayesian Methods (I)

Variational Bayesian Methods (II)

we have qi∗ (θi ) = arg maxqi qi (θi ) ln peq(θ i ;D)

ln qi∗ (θi ) = Ej6=i ln p(D, θ) + C

Variational Bayesian Methods (III)

mean field theory: use a factorizable

approximate withha variational

best-fit is found by minimizing the

Variational Bayesian Learning of GMMs (I)

Variational Bayesian Learning of GMMs (II)

use the variational Bayesian method to approximate the posterior

derive the best-fit variational distribution q ∗ (z, θ)

Variational Bayesian Learning of GMMs (III)

1 ln q ∗ (z) = Eθ ln p(x, z, θ) + C = Eθ ln p(θ) + ln p(x, z|θ) + C

Variational Bayesian Learning of GMMs (IV)

◦ q ∗ (µm , Σm ) is also a GIW distribution:

Variational Bayesian Learning of GMMs (V)

based on the above distributions, we have

Variational Bayesian Learning of GMMs (VI)

Variational Bayesian GMMs

Non-Parametric Bayesian Methods

Bayesian learning of parametric models: rely on

how about Bayesian learning of non-parametric models?

non-parametric Bayesian methods: use stochastic processes as

Gaussian Processes: Concepts (I)

function values form an N -dimensional real-valued vector

Gaussian Processes: Concepts (II)

Gaussian Processes for Non-Parametric Bayesian Learning

Gaussian processes as a non-parametric prior

Gaussian processes for regression or classification

Gaussian Processes for Regression (I)

the likelihood function due to the residual Gaussian noise :

Gaussian Processes for Regression (II)

{σ ∗ , l∗ , σ0∗ } = arg max p y | D, σ, l, σ0 = arg max ln N y | 0, CN

◦ may use a gradient descent method

Gaussian Processes for Regression (III)

predict output ỹ for a new input x̃:

Gaussian Processes for Regression (IV)

derive a non-parametric prior from D and Φ(x, x0 )

Gaussian Processes for Classification

no closed-form solution to derive the marginal and predictive

You might also like

the likelihood function due to the residual Gaussian noise :