0% found this document useful (0 votes)

21 views124 pages

PAC Bayesian Learning Introduction

The document provides an introduction to PAC-Bayesian analysis, focusing on the principles of learning and generalization in statistical learning theory. It discusses the importance of confidence in learning algorithms, the concept of empirical and theoretical risk, and the generalization gap between in-sample and out-of-sample performance. The document also outlines various approaches to analyze the performance of learning algorithms, including single hypothesis, finite function classes, and structural risk minimization.

Uploaded by

g4briel.santanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views124 pages

PAC Bayesian Learning Introduction

Uploaded by

g4briel.santanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

An Introduction to PAC-Bayesian

Analysis
John Shawe-Taylor Benjamin Guedj Maria Perez Ortiz
Omar Rivasplata

University College London

AIDA e-Lecture Series

May 18, 2021

1 42
Learning is to be able to generalise

2 42
Learning is to be able to generalise

[Figure from Wikipedia]

2 42
Learning is to be able to generalise

From examples, what can a system

learn about the underlying
phenomenon?

[Figure from Wikipedia]

2 42
Learning is to be able to generalise

From examples, what can a system

learn about the underlying
phenomenon?

Memorising the already seen data is

usually bad −→ overfitting

[Figure from Wikipedia]

2 42
Learning is to be able to generalise

From examples, what can a system

learn about the underlying
phenomenon?

Memorising the already seen data is

usually bad −→ overfitting

Generalisation is the ability to

’perform’ well on unseen data.

[Figure from Wikipedia]

2 42
Statistical Learning Theory is about high confidence

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample

Statistical Learning Theory: tail of the distribution

. finding bounds which hold with high probability
over random samples of size m

Statistical Learning Theory: tail of the distribution

. finding bounds which hold with high probability
over random samples of size m

Compare to a statistical test – at 99% confidence level

. chances of the conclusion not being true are less than 1%

Statistical Learning Theory: tail of the distribution

. finding bounds which hold with high probability
over random samples of size m

Compare to a statistical test – at 99% confidence level

. chances of the conclusion not being true are less than 1%

PAC: probably approximately correct [59]

Use a ‘confidence parameter’ δ: Pm [large error] 6 δ
δ is the probability of being misled by the training set

Statistical Learning Theory: tail of the distribution

. finding bounds which hold with high probability
over random samples of size m

Compare to a statistical test – at 99% confidence level

. chances of the conclusion not being true are less than 1%

PAC: probably approximately correct [59]

Use a ‘confidence parameter’ δ: Pm [large error] 6 δ
δ is the probability of being misled by the training set

Hence high confidence: Pm [approximately correct] > 1 − δ

3 42
Error distribution picture
20

0
0 0.2 0.4 0.6 0.8 1
means | |
95th percentiles

4 42
Mathematical formalization

5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)

5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.
Classical assumptions:
• A data-generating distribution P over Z.
• Learner doesn’t know P, only sees the training set.
• The training set examples are i.i.d. from P: Sm ∼ Pm

5 42
What to achieve from the sample?

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias

Certifying performance:
• what happens beyond the training set
• generalization bounds

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias

Certifying performance:
• what happens beyond the training set
• generalization bounds

Actually these two goals interact with each other!

6 42
Risk (aka error) measures

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)

Theoretical risk: Rout (h) = E `(h(X ), Y )
(out-of-sample)

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)

Theoretical risk: Rout (h) = E `(h(X ), Y )
(out-of-sample)

Examples:
• `(h(X ), Y ) = 1[h(X ) 6= Y ] : 0-1 loss (classification)
• `(h(X ), Y ) = (Y − h(X ))2 : square loss (regression)
• `(h(X ), Y ) = (1 − Yh(X ))+ : hinge loss
• `(h(X ), Y ) = − log(h(X )) : log loss (density estimation)

7 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)

I Rout (h) 6 Rin (h) + (m, δ)

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)

I Rout (h) 6 Rin (h) + (m, δ)

Lower bounds: ˜(m, δ)

w.h.p. ∆(h) >

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)

I Rout (h) 6 Rin (h) + (m, δ)

Lower bounds: ˜(m, δ)

w.h.p. ∆(h) >

Flavours:
distribution-free distribution-dependent
algorithm-free algorithm-dependent

8 42
Before PAC-Bayes

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|

w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ

Structural risk minimisation: data-dependent hypotheses hi

associated with prior weight pi r
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Structural risk minimisation: data-dependent hypotheses hi

associated with prior weight pi r
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Uncountably infinite function class: VC dimension, Rademacher

complexity...

Structural risk minimisation: data-dependent hypotheses hi

associated with prior weight pi r
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Uncountably infinite function class: VC dimension, Rademacher

complexity...
These approaches are suited to analyse the performance of individual
functions, and take some account of correlations.

Structural risk minimisation: data-dependent hypotheses hi

associated with prior weight pi r
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Uncountably infinite function class: VC dimension, Rademacher

complexity...
These approaches are suited to analyse the performance of individual
functions, and take some account of correlations.
−→ Extension: PAC-Bayes allows to consider distributions over
hypotheses.

9 42
The PAC-Bayes framework

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’

Based on data, learn a distribution Q ∈ M1 (H) . ‘posterior’

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’

Based on data, learn a distribution Q ∈ M1 (H) . ‘posterior’
Predictions:
• draw h ∼ Q and predict with the chosen h.
• each prediction with a fresh random draw.

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’

Based on data, learn a distribution Q ∈ M1 (H) . ‘posterior’
Predictions:
• draw h ∼ Q and predict with the chosen h.
• each prediction with a fresh random draw.

The risk measures Rin (h) and Rout (h) are extended by averaging:
R R
Rin (Q ) ≡ H Rin (h) dQ (h) Rout (Q ) ≡ H Rout (h) dQ (h)

(h )
KL(Q kP ) = E ln Q
P (h )
is the Kullback-Leibler divergence.
h∼Q

10 42
PAC-Bayes aka Generalised Bayes

11 42
PAC-Bayes aka Generalised Bayes

”Prior”: exploration mechanism of H

”Posterior” is the twisted prior after confronting with data
11 42
PAC-Bayes bounds vs. Bayesian learning

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model

Data distribution

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model

Data distribution
• PAC-Bayes: bounds hold for any distribution
• Bayes: randomness lies in the noise model generating the output

12 42
A General PAC-Bayesian Theorem
∆-function: “distance” between Rin (Q ) and Rout (Q )
Convex function ∆ : [0, 1] × [0, 1] → R.

General theorem (Bégin et al. [7, 8], Germain [21])

For any distribution D on X × Y, for any set H of voters, for any distribution
P on H, for any δ ∈ (0, 1], and for any ∆-function, we have, with probability at
least 1−δ over the choice of S ∼ D m ,

I∆ (m)

1
∀ Q on H : ∆ Rin (Q ), Rout (Q ) 6 KL(Q kP ) + ln ,
m δ

13 42
A General PAC-Bayesian Theorem
∆-function: “distance” between Rin (Q ) and Rout (Q )
Convex function ∆ : [0, 1] × [0, 1] → R.

General theorem (Bégin et al. [7, 8], Germain [21])

I∆ (m)

1
∀ Q on H : ∆ Rin (Q ), Rout (Q ) 6 KL(Q kP ) + ln ,
m δ

where
X
" m
#
k
I∆ (m) = sup m
r k (1−r )m−k e m∆( m , r)

.
r ∈[0,1] k =0
|k {z }
Bin k ;m,r

13 42
Proof of the general theorem
General theorem
I∆ ( m )

1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Proof ideas.
Change of Measure Inequality
For any P and Q on H, and for any measurable function φ : H → R, we have

P (h) φ(h)
− ln E eφ(h) = − ln E e
h∼P h∼Q Q (h )

Q (h )
6 E ln − E φ(h)
h∼Q P (h ) h∼Q

= KL(Q kP ) − E φ(h).
h∼Q

14 42
Proof of the general theorem
General theorem
I∆ ( m )

1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

= KL(Q kP ) − E φ(h).
h∼Q

Markov’s inequality

for a random variable X satisfying X > 0

EX EX

Pr (X > a) ≤ a
⇐⇒ Pr X 6 δ
≥ 1−δ .

14 42
Proof of the general theorem

Probability of observing k misclassifications among m examples

Given a voter h, consider a binomial variable of m trials with success Rout (h):
m−k
m k
Prm Rin (h) = mk = Rout (h) 1 − Rout (h) = Bin k ; m, Rout (h)
S ∼D k

15 42
I∆ ( m )

1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q