0% found this document useful (0 votes)
21 views124 pages

PAC Bayesian Learning Introduction

The document provides an introduction to PAC-Bayesian analysis, focusing on the principles of learning and generalization in statistical learning theory. It discusses the importance of confidence in learning algorithms, the concept of empirical and theoretical risk, and the generalization gap between in-sample and out-of-sample performance. The document also outlines various approaches to analyze the performance of learning algorithms, including single hypothesis, finite function classes, and structural risk minimization.

Uploaded by

g4briel.santanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views124 pages

PAC Bayesian Learning Introduction

The document provides an introduction to PAC-Bayesian analysis, focusing on the principles of learning and generalization in statistical learning theory. It discusses the importance of confidence in learning algorithms, the concept of empirical and theoretical risk, and the generalization gap between in-sample and out-of-sample performance. The document also outlines various approaches to analyze the performance of learning algorithms, including single hypothesis, finite function classes, and structural risk minimization.

Uploaded by

g4briel.santanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

An Introduction to PAC-Bayesian

Analysis
John Shawe-Taylor Benjamin Guedj Maria Perez Ortiz
Omar Rivasplata

University College London

AIDA e-Lecture Series


May 18, 2021

1 42
Learning is to be able to generalise

2 42
Learning is to be able to generalise

[Figure from Wikipedia]

2 42
Learning is to be able to generalise

From examples, what can a system


learn about the underlying
phenomenon?

[Figure from Wikipedia]

2 42
Learning is to be able to generalise

From examples, what can a system


learn about the underlying
phenomenon?

Memorising the already seen data is


usually bad −→ overfitting

[Figure from Wikipedia]

2 42
Learning is to be able to generalise

From examples, what can a system


learn about the underlying
phenomenon?

Memorising the already seen data is


usually bad −→ overfitting

Generalisation is the ability to


’perform’ well on unseen data.

[Figure from Wikipedia]

2 42
Statistical Learning Theory is about high confidence

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample

Statistical Learning Theory: tail of the distribution


. finding bounds which hold with high probability
over random samples of size m

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample

Statistical Learning Theory: tail of the distribution


. finding bounds which hold with high probability
over random samples of size m

Compare to a statistical test – at 99% confidence level


. chances of the conclusion not being true are less than 1%

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample

Statistical Learning Theory: tail of the distribution


. finding bounds which hold with high probability
over random samples of size m

Compare to a statistical test – at 99% confidence level


. chances of the conclusion not being true are less than 1%

PAC: probably approximately correct [59]


Use a ‘confidence parameter’ δ: Pm [large error] 6 δ
δ is the probability of being misled by the training set

3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample

Statistical Learning Theory: tail of the distribution


. finding bounds which hold with high probability
over random samples of size m

Compare to a statistical test – at 99% confidence level


. chances of the conclusion not being true are less than 1%

PAC: probably approximately correct [59]


Use a ‘confidence parameter’ δ: Pm [large error] 6 δ
δ is the probability of being misled by the training set

Hence high confidence: Pm [approximately correct] > 1 − δ

3 42
Error distribution picture
20

18

16

14

12

10

0
0 0.2 0.4 0.6 0.8 1
means | |
95th percentiles

4 42
Mathematical formalization

5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)

5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.

5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.
Classical assumptions:
• A data-generating distribution P over Z.
• Learner doesn’t know P, only sees the training set.
• The training set examples are i.i.d. from P: Sm ∼ Pm

5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.
Classical assumptions:
• A data-generating distribution P over Z.
• Learner doesn’t know P, only sees the training set.
• The training set examples are i.i.d. from P: Sm ∼ Pm
. these can be relaxed (mostly beyond the scope of this tutorial)

5 42
What to achieve from the sample?

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias

Certifying performance:
• what happens beyond the training set
• generalization bounds

6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance

Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias

Certifying performance:
• what happens beyond the training set
• generalization bounds

Actually these two goals interact with each other!

6 42
Risk (aka error) measures

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)
 
Theoretical risk: Rout (h) = E `(h(X ), Y )
(out-of-sample)

7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .

1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)
 
Theoretical risk: Rout (h) = E `(h(X ), Y )
(out-of-sample)

Examples:
• `(h(X ), Y ) = 1[h(X ) 6= Y ] : 0-1 loss (classification)
• `(h(X ), Y ) = (Y − h(X ))2 : square loss (regression)
• `(h(X ), Y ) = (1 − Yh(X ))+ : hinge loss
• `(h(X ), Y ) = − log(h(X )) : log loss (density estimation)

7 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)


I Rout (h) 6 Rin (h) + (m, δ)

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)


I Rout (h) 6 Rin (h) + (m, δ)

Lower bounds: ˜(m, δ)


w.h.p. ∆(h) > 

8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?

Generalization gap: ∆(h) = Rout (h) − Rin (h)

Upper bounds: w.h.p. ∆(h) 6 (m, δ)


I Rout (h) 6 Rin (h) + (m, δ)

Lower bounds: ˜(m, δ)


w.h.p. ∆(h) > 

Flavours:
distribution-free distribution-dependent
algorithm-free algorithm-dependent

8 42
Before PAC-Bayes

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
 
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
 
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ

Structural risk minimisation: data-dependent hypotheses hi


associated with prior weight pi r  
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
 
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ

Structural risk minimisation: data-dependent hypotheses hi


associated with prior weight pi r  
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Uncountably infinite function class: VC dimension, Rademacher


complexity...

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
 
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ

Structural risk minimisation: data-dependent hypotheses hi


associated with prior weight pi r  
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Uncountably infinite function class: VC dimension, Rademacher


complexity...
These approaches are suited to analyse the performance of individual
functions, and take some account of correlations.

9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1

with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
 
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ

Structural risk minimisation: data-dependent hypotheses hi


associated with prior weight pi r  
w.p. > 1 − δ, ∀hi ∈ H, Rout (hi ) 6 Rin (hi ) + 1
2m log 1
pi δ

Uncountably infinite function class: VC dimension, Rademacher


complexity...
These approaches are suited to analyse the performance of individual
functions, and take some account of correlations.
−→ Extension: PAC-Bayes allows to consider distributions over
hypotheses.

9 42
The PAC-Bayes framework

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’


Based on data, learn a distribution Q ∈ M1 (H) . ‘posterior’

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’


Based on data, learn a distribution Q ∈ M1 (H) . ‘posterior’
Predictions:
• draw h ∼ Q and predict with the chosen h.
• each prediction with a fresh random draw.

10 42
The PAC-Bayes framework

Before data, fix a distribution P ∈ M1 (H) . ‘prior’


Based on data, learn a distribution Q ∈ M1 (H) . ‘posterior’
Predictions:
• draw h ∼ Q and predict with the chosen h.
• each prediction with a fresh random draw.

The risk measures Rin (h) and Rout (h) are extended by averaging:
R R
Rin (Q ) ≡ H Rin (h) dQ (h) Rout (Q ) ≡ H Rout (h) dQ (h)

(h )
KL(Q kP ) = E ln Q
P (h )
is the Kullback-Leibler divergence.
h∼Q

10 42
PAC-Bayes aka Generalised Bayes

11 42
PAC-Bayes aka Generalised Bayes

”Prior”: exploration mechanism of H


”Posterior” is the twisted prior after confronting with data
11 42
PAC-Bayes bounds vs. Bayesian learning

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model

Data distribution

12 42
PAC-Bayes bounds vs. Bayesian learning

Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference

Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model

Data distribution
• PAC-Bayes: bounds hold for any distribution
• Bayes: randomness lies in the noise model generating the output

12 42
A General PAC-Bayesian Theorem
∆-function: “distance” between Rin (Q ) and Rout (Q )
Convex function ∆ : [0, 1] × [0, 1] → R.

General theorem (Bégin et al. [7, 8], Germain [21])


For any distribution D on X × Y, for any set H of voters, for any distribution
P on H, for any δ ∈ (0, 1], and for any ∆-function, we have, with probability at
least 1−δ over the choice of S ∼ D m ,

I∆ (m)
 
  1
∀ Q on H : ∆ Rin (Q ), Rout (Q ) 6 KL(Q kP ) + ln ,
m δ

13 42
A General PAC-Bayesian Theorem
∆-function: “distance” between Rin (Q ) and Rout (Q )
Convex function ∆ : [0, 1] × [0, 1] → R.

General theorem (Bégin et al. [7, 8], Germain [21])


For any distribution D on X × Y, for any set H of voters, for any distribution
P on H, for any δ ∈ (0, 1], and for any ∆-function, we have, with probability at
least 1−δ over the choice of S ∼ D m ,

I∆ (m)
 
  1
∀ Q on H : ∆ Rin (Q ), Rout (Q ) 6 KL(Q kP ) + ln ,
m δ

where
X
" m
#
k
I∆ (m) = sup m
r k (1−r )m−k e m∆( m , r)

.
r ∈[0,1] k =0
|k {z  }
Bin k ;m,r

13 42
Proof of the general theorem
General theorem
I∆ ( m )
  
  1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Proof ideas.
Change of Measure Inequality
For any P and Q on H, and for any measurable function φ : H → R, we have
 
  P (h) φ(h)
− ln E eφ(h) = − ln E e
h∼P h∼Q Q (h )
 
Q (h )
6 E ln − E φ(h)
h∼Q P (h ) h∼Q

= KL(Q kP ) − E φ(h).
h∼Q

14 42
Proof of the general theorem
General theorem
I∆ ( m )
  
  1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Proof ideas.
Change of Measure Inequality
For any P and Q on H, and for any measurable function φ : H → R, we have
 
  P (h) φ(h)
− ln E eφ(h) = − ln E e
h∼P h∼Q Q (h )
 
Q (h )
6 E ln − E φ(h)
h∼Q P (h ) h∼Q

= KL(Q kP ) − E φ(h).
h∼Q

Markov’s inequality

for a random variable X satisfying X > 0


EX EX

Pr (X > a) ≤ a
⇐⇒ Pr X 6 δ
≥ 1−δ .

14 42
Proof of the general theorem

Probability of observing k misclassifications among m examples


Given a voter h, consider a binomial variable of m trials with success Rout (h):
  m−k
  m k   
Prm Rin (h) = mk = Rout (h) 1 − Rout (h) = Bin k ; m, Rout (h)
S ∼D k

15 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P

1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P

1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m

1 X
m
k
Bin k ; m, Rout (h) em·∆( m ,Rout (h))

Binomial law = KL(Q kP ) + ln E
δ h∼P
k =0

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P

1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m

1 X
m
k
Bin k ; m, Rout (h) em·∆( m ,Rout (h))

Binomial law = KL(Q kP ) + ln E
δ h∼P
k =0

X
" m
#
1  k , r)
m∆( m
Supremum over risk 6 KL(Q kP ) + ln sup Bin k ; m, r e
δ r ∈[0,1] k =0

16 42
I∆ ( m )
  
  1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.

 
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
 
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q

Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P

1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m

1 X
m
k
Bin k ; m, Rout (h) em·∆( m ,Rout (h))

Binomial law = KL(Q kP ) + ln E
δ h∼P
k =0

X
" m
#
1  k , r)
m∆( m
Supremum over risk 6 KL(Q kP ) + ln sup Bin k ; m, r e
δ r ∈[0,1] k =0
1
= KL(Q kP ) + ln I∆ (m) .
δ

16 42
General theorem
I∆ ( m )
  
  1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
  h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ 1
m
KL(Q kP ) + ln 2 δm , Langford and Seeger [31]

def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p

17 42
General theorem
I∆ ( m )
  
  1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
  h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ m1 KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
r h
√ i
(b) Rout (Q ) ≤ Rin (Q )) + 2m 1
KL(Q kP ) + ln 2 δm , McAllester [40, 43]

def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
> 2(q − p)2 ,

17 42
General theorem
I∆ ( m )
  
  1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
  h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ m1 KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
r h
√ i
(b) Rout (Q ) ≤ Rin (Q )) + 2m 1
KL(Q kP ) + ln 2 δm , McAllester [40, 43]
 
(c) Rout (Q ) ≤ 1−e1−c c · Rin (Q ) + m
1
KL(Q kP ) + ln δ1 , Catoni [11]

def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
> 2(q − p)2 ,
def
∆c (q , p) = − ln[1 − (1 − e−c ) · p] − c · q ,

17 42
General theorem
I∆ ( m )
  
  1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ

Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
  h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ m1 KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
r h
√ i
(b) Rout (Q ) ≤ Rin (Q )) + 2m 1
KL(Q kP ) + ln 2 δm , McAllester [40, 43]
1
 
(c) Rout (Q ) ≤ 1−e−c
c · Rin (Q ) + m1 KL(Q kP ) + ln δ1 , Catoni [11]
1
 1

(d) Rout (Q ) ≤ Rin (Q ) + λ KL(Q kP ) + ln δ + f (λ, m) . Alquier et al. [4]

def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
> 2(q − p)2 ,
def
∆c (q , p) = − ln[1 − (1 − e−c ) · p] − c · q ,
def λ
∆λ (q , p) = m
(p − q ) .

17 42
Proof of the Langford/Seeger bound

Follows immediately from General Theorem by choosing ∆(q , p) = kl(q , p).

18 42
Proof of the Langford/Seeger bound

Follows immediately from General Theorem by choosing ∆(q , p) = kl(q , p).


Indeed, in that case we have
 R (h) mRS (h)  1−R (h) m(1−RS (h))
E E em∆(RS (h),R (h)) = E E S
R (h )
S
1−R (h)
S ∼D m h ∼P h ∼P S ∼D m
k m−k
Pm k k

1− m
= E
h ∼P
k =0 Pr (RS (h)= mk ) m
R (h ) 1−R (h)
S ∼D m
Pm
= (mk)(k /m)k (1−k /m)m−k ,
k =0 (1)

6 2 m.


18 42
Proof of the Langford/Seeger bound

Follows immediately from General Theorem by choosing ∆(q , p) = kl(q , p).


Indeed, in that case we have
 R (h) mRS (h)  1−R (h) m(1−RS (h))
E E em∆(RS (h),R (h)) = E E S
R (h )
S
1−R (h)
S ∼D m h ∼P h ∼P S ∼D m
k m−k
Pm k k

1− m
= E
h ∼P
k =0 Pr (RS (h)= mk ) m
R (h ) 1−R (h)
S ∼D m
Pm
= (mk)(k /m)k (1−k /m)m−k ,
k =0 (1)

6 2 m.

k

Note that, in Line (1) of the proof, Prm RS (h) = m
is replaced by the probability
S ∼D
mass function of the binomial.

18 42
Proof of the Langford/Seeger bound

Follows immediately from General Theorem by choosing ∆(q , p) = kl(q , p).


Indeed, in that case we have
 R (h) mRS (h)  1−R (h) m(1−RS (h))
E E em∆(RS (h),R (h)) = E E S
R (h )
S
1−R (h)
S ∼D m h ∼P h ∼P S ∼D m
k m−k
Pm k k

1− m
= E
h ∼P
k =0 Pr (RS (h)= mk ) m
R (h ) 1−R (h)
S ∼D m
Pm
= (mk)(k /m)k (1−k /m)m−k ,
k =0 (1)

6 2 m.

k

Note that, in Line (1) of the proof, Prm RS (h) = m
is replaced by the probability
S ∼D
mass function of the binomial.
This is only true if the examples of S are drawn iid. (i.e., S ∼ D m )

18 42
Proof of the Langford/Seeger bound

Follows immediately from General Theorem by choosing ∆(q , p) = kl(q , p).


Indeed, in that case we have
 R (h) mRS (h)  1−R (h) m(1−RS (h))
E E em∆(RS (h),R (h)) = E E S
R (h )
S
1−R (h)
S ∼D m h ∼P h ∼P S ∼D m
k m−k
Pm k k

1− m
= E
h ∼P
k =0 Pr (RS (h)= mk ) m
R (h ) 1−R (h)
S ∼D m
Pm
= (mk)(k /m)k (1−k /m)m−k ,
k =0 (1)

6 2 m.

k

Note that, in Line (1) of the proof, Prm RS (h) = m
is replaced by the probability
S ∼D
mass function of the binomial.
This is only true if the examples of S are drawn iid. (i.e., S ∼ D m )
So this result is no longer valid in the non iid case, even if General Theorem is.

18 42
Linear classifiers

We will choose the prior and posterior distributions to be Gaussians


with unit variance.

19 42
Linear classifiers

We will choose the prior and posterior distributions to be Gaussians


with unit variance.
The prior P will be centered at the origin with unit variance

19 42
Linear classifiers

We will choose the prior and posterior distributions to be Gaussians


with unit variance.
The prior P will be centered at the origin with unit variance
The specification of the centre for the posterior Q (w, µ) will be by a
unit vector mw and a scale factor µ.

19 42
PAC-Bayes Bound for SVM (1/2)

Prior P is Gaussian N(0, 1)


P

20 42
PAC-Bayes Bound for SVM (1/2)

Prior P is Gaussian N(0, 1)


P
Posterior is in the direction w
0

21 42
PAC-Bayes Bound for SVM (1/2)

µ
Prior P is Gaussian N(0, 1)
P
Posterior is in the direction w
0 at distance µ from the origin

22 42
PAC-Bayes Bound for SVM (1/2)

Q
w

P Prior P is Gaussian N(0, 1)


Posterior is in the direction w
0 at distance µ from the origin
Posterior Q is Gaussian
W

23 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)k QD (w, µ) ) 6
m

24 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)k QD (w, µ) ) 6
m

QD (w, µ) true performance of the stochastic classifier

24 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)k QD (w, µ) ) 6
m

QD (w, µ) true performance of the stochastic classifier


SVM is deterministic classifier
 that exactly corresponds to
sgn Ec ∼Q (mw ,µ) [c (x)] as centre of the Gaussian gives the same
classification as halfspace with more weight.

24 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)k QD (w, µ) ) 6
m

QD (w, µ) true performance of the stochastic classifier


SVM is deterministic classifier
 that exactly corresponds to
sgn Ec ∼Q (mw ,µ) [c (x)] as centre of the Gaussian gives the same
classification as halfspace with more weight.
Hence its error bounded by 2QD (mw , µ), since as observed above
if x misclassified at least half of c ∼ Q err.

24 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL( Q̂S (w, µ) kQD (w, µ)) 6
m

25 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL( Q̂S (w, µ) kQD (w, µ)) 6
m

Q̂S (w, µ) stochastic measure of the training error

25 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL( Q̂S (w, µ) kQD (w, µ)) 6
m

Q̂S (w, µ) stochastic measure of the training error


Q̂S (w, µ) = Em [F̃ (µγ(x, y ))]

25 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL( Q̂S (w, µ) kQD (w, µ)) 6
m

Q̂S (w, µ) stochastic measure of the training error


Q̂S (w, µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (ywT φ(x))/(kφ(x)kkwk)

25 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL( Q̂S (w, µ) kQD (w, µ)) 6
m

Q̂S (w, µ) stochastic measure of the training error


Q̂S (w, µ) = Em [F̃ (µγ(x, y ))]
γ(x, y ) = (ywT φ(x))/(kφ(x)kkwk)
Rt 2
F̃ (t ) = 1 − √1 −∞ e−x /2 dx

25 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)kQD (w, µ)) 6
m

26 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)kQD (w, µ)) 6
m

Prior P ≡ Gaussian centered on the origin

26 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)kQD (w, µ)) 6
m

Prior P ≡ Gaussian centered on the origin


Posterior Q ≡ Gaussian along w at a distance µ from the origin

26 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln mδ+1


KL(Q̂S (w, µ)kQD (w, µ)) 6
m

Prior P ≡ Gaussian centered on the origin


Posterior Q ≡ Gaussian along w at a distance µ from the origin
KL(P kQ) = µ2 /2

26 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln m+1


δ
KL(Q̂S (w, µ)kQD (w, µ)) 6
m

27 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln m+1


δ
KL(Q̂S (w, µ)kQD (w, µ)) 6
m

δ is the confidence

27 42
PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(P kQ (w, µ)) + ln m+1


δ
KL(Q̂S (w, µ)kQD (w, µ)) 6
m

δ is the confidence
The bound holds with probability 1 − δ over the random i.i.d.
selection of the training data.

27 42
Form of the SVM bound

Note that bound holds for all posterior distributions so that we can
choose µ to optimise the bound

28 42
Form of the SVM bound

Note that bound holds for all posterior distributions so that we can
choose µ to optimise the bound
If we define the inverse of the KL by

KL−1 (q , A) = max{p : KL(q kp) 6 A}

then have with probability at least 1 − δ


!
−1 µ2 /2 + ln mδ+1
6 y ) 6 2 min KL
Pr (hw, φ(x)i = Em [F̃ (µγ(x, y ))],
µ m

28 42
Gives SVM Optimisation

Primal form:
1 2
Pm 
minw,ξi 2 kwk + C i =1 ξi
s.t. yi wT φ(xi ) > 1 − ξi i = 1, . . . , m
ξi > 0 i = 1, . . . , m

Dual form:
hP Pm i
m 1
maxα i =1 αi − 2 i ,j =1 αi αj yi yj κ(xi , xj )

s.t. 0 6 αi 6 C i = 1, . . . , m
Pm
where κ(xi , xj ) = hφ(xi ), φ(xj )i and hw, φ(x)i = i =1 αi yi κ(xi , x).

29 42
Slack variable conversion
3

2.5

1.5

0.5

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 30 42
Model Selection with the new bound: setup

Comparison of 10-fold Xvalidation, PAC-Bayes Bound and the Prior


PAC-Bayes Bound

31 42
Model Selection with the new bound: setup

Comparison of 10-fold Xvalidation, PAC-Bayes Bound and the Prior


PAC-Bayes Bound
UCI datasets

31 42
Model Selection with the new bound: setup

Comparison of 10-fold Xvalidation, PAC-Bayes Bound and the Prior


PAC-Bayes Bound
UCI datasets
Select C and σ that lead to minimum Classification Error (CE)

31 42
Model Selection with the new bound: setup

Comparison of 10-fold Xvalidation, PAC-Bayes Bound and the Prior


PAC-Bayes Bound
UCI datasets
Select C and σ that lead to minimum Classification Error (CE)
For 10-F XV select the pair that minimize the validation error

31 42
Model Selection with the new bound: setup

Comparison of 10-fold Xvalidation, PAC-Bayes Bound and the Prior


PAC-Bayes Bound
UCI datasets
Select C and σ that lead to minimum Classification Error (CE)
For 10-F XV select the pair that minimize the validation error
For PAC-Bayes Bound and Prior PAC-Bayes Bound select the pair
that minimize the bound

31 42
Results

Classifier
SVM ηPrior SVM
Problem 2FCV 10FCV PAC PrPAC PrPAC τ-PrPAC
digits Bound – – 0.175 0.107 0.050 0.047
TE 0.007 0.007 0.007 0.014 0.010 0.009
waveform Bound – – 0.203 0.185 0.178 0.176
TE 0.090 0.086 0.084 0.088 0.087 0.086
pima Bound – – 0.424 0.420 0.428 0.416
TE 0.244 0.245 0.229 0.229 0.233 0.233
ringnorm Bound – – 0.203 0.110 0.053 0.050
TE 0.016 0.016 0.018 0.018 0.016 0.016
spam Bound – – 0.254 0.198 0.186 0.178
TE 0.066 0.063 0.067 0.077 0.070 0.072
Average TE 0.0846 0.0834 0.081 0.0852 0.0832 0.0832

32 42
Take home messages

Bounds are remarkably tight: for final column average factor


between bound and TE is under 3.

33 42
Take home messages

Bounds are remarkably tight: for final column average factor


between bound and TE is under 3.
Model selection from the bounds is as good as 10FCV: in fact all but
one of the PAC-Bayes model selections give better averages for TE.

33 42
Take home messages

Bounds are remarkably tight: for final column average factor


between bound and TE is under 3.
Model selection from the bounds is as good as 10FCV: in fact all but
one of the PAC-Bayes model selections give better averages for TE.
The better bounds do not appear to give better model selection -
best model selection is from the simplest bound.
A. Ambroladze, E. Parrado-Hernández, and J. Shawe-Taylor. Tighter
PAC-Bayes bounds. In Advances in Neural Information Processing
Systems 18, (2006) Pages 9-16.
P. Germain, A. Lacasse, F. Laviolette and M. Marchand.
PAC-Bayesian learning of linear classifiers, in Proceedings of the
26nd International Conference on Machine Learning (ICML’09,
Montréal, Canada.). ACM Press (2009), 382, Pages 453-460.

33 42
Deep Learning Results

34 42
Deep Learning Results

35 42
A flexible framework

36 42
A flexible framework

Since 1997, PAC-Bayes has been successfully used in many machine


learning settings (this list is by no means exhaustive).
Statistical learning theory Audibert and Bousquet [6], Catoni [9, 10], Guedj
[25], Guedj and Pujol [27], Maurer [39], McAllester
[41, 42, 44, 45], Mhammedi et al. [46], Seeger [51, 52], Shawe-Taylor
and Williamson [56], Thiemann et al. [58]
SVMs & linear classifiers Germain et al. [19], Langford and Shawe-Taylor
[32], McAllester [44]
Supervised learning algorithms reinterpreted as bound minimizers
Ambroladze et al. [5], Germain et al. [22], Shawe-Taylor and Hardoon
[57]
High-dimensional regression Alquier and Biau [1], Alquier and Lounici
[2], Guedj and Robbiano [24], Guedj and Alquier [26], Li et al. [35]
Classification Catoni [9, 10], Lacasse et al. [30], Langford and Shawe-Taylor
[32], Parrado-Hernández et al. [49]

36 42
A flexible framework

Transductive learning, domain adaptation Bégin et al. [7], Derbeko et al.


[12], Germain et al. [20], Nozawa et al. [48]
Non-iid or heavy-tailed data Alquier and Guedj [3], Holland [29], Lever et al.
[34], Seldin et al. [54, 55]
Density estimation Higgs and Shawe-Taylor [28], Seldin and Tishby [53]
Reinforcement learning Fard and Pineau [16], Fard et al. [17], Ghavamzadeh
et al. [23], Seldin et al. [54, 55]
Sequential learning Gerchinovitz [18], Li et al. [36]
Algorithmic stability, differential privacy Dziugaite and Roy [13, 14], London
[37], London et al. [38], Rivasplata et al. [50]
Deep neural networks Dziugaite and Roy [15], Letarte et al. [33], Neyshabur
et al. [47], Zhou et al. [60]
...

37 42
References I
[1] P. Alquier and G. Biau. Sparse single-index model. Journal of Machine Learning Research, 14:243–280, 2013.
[2] P. Alquier and K. Lounici. PAC-Bayesian theorems for sparse regression estimation with exponential weights. Electronic
Journal of Statistics, 5:127–145, 2011.
[3] Pierre Alquier and Benjamin Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887–902, 2018.
[4] Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of Gibbs posteriors. ArXiv
e-prints, 2015. URL http://arxiv.org/abs/1506.04091.
[5] A. Ambroladze, E. Parrado-Hernández, and J. Shawe-taylor. Tighter PAC-Bayes bounds. In Advances in Neural Information
Processing Systems, NIPS, pages 9–16, 2007.
[6] Jean-Yves Audibert and Olivier Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal of Machine
Learning Research, 2007.
[7] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian theory for transductive learning. In
AISTATS, 2014.
[8] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian bounds based on the Rényi divergence.
In AISTATS, 2016.
[9] O. Catoni. Statistical Learning Theory and Stochastic Optimization. École d’Été de Probabilités de Saint-Flour 2001. Springer,
2004.
[10] O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 of Lecture notes
– Monograph Series. Institute of Mathematical Statistics, 2007.
[11] Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning, volume 56. Inst. of
Mathematical Statistic, 2007.
[12] Philip Derbeko, Ran El-Yaniv, and Ron Meir. Explicit learning curves for transduction and application to clustering and
compression algorithms. J. Artif. Intell. Res. (JAIR), 22, 2004.
[13] G. K. Dziugaite and D. M. Roy. Data-dependent PAC-Bayes priors via differential privacy. In NeurIPS, 2018.
[14] G. K. Dziugaite and D. M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of
Entropy-SGD and data-dependent priors. In International Conference on Machine Learning, pages 1376–1385, 2018.

38 42
References II
[15] Gintare K. Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks
with many more parameters than training data. In Proceedings of Uncertainty in Artificial Intelligence (UAI), 2017.
[16] Mahdi Milani Fard and Joelle Pineau. PAC-Bayesian model selection for reinforcement learning. In Advances in Neural
Information Processing Systems (NIPS), 2010.
[17] Mahdi Milani Fard, Joelle Pineau, and Csaba Szepesvári. PAC-Bayesian Policy Evaluation for Reinforcement Learning. In UAI,
Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 195–202, 2011.
[18] S. Gerchinovitz. Prédiction de suites individuelles et cadre statistique classique : étude de quelques liens autour de la
régression parcimonieuse et des techniques d’agrégation. PhD thesis, Université Paris-Sud, 2011.
[19] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. In Proceedings of the
26th Annual International Conference on Machine Learning, ICML, 2009.
[20] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A new PAC-Bayesian perspective on domain adaptation. In Proceedings
of International Conference on Machine Learning, volume 48, 2016.
[21] Pascal Germain. Généralisations de la théorie PAC-bayésienne pour l’apprentissage inductif, l’apprentissage transductif et
l’adaptation de domaine. PhD thesis, Université Laval, 2015.
[22] Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, and François Laviolette. From PAC-Bayes bounds to KL
regularization. In Advances in Neural Information Processing Systems, pages 603–610, 2009.
[23] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends in
Machine Learning, 8(5-6):359–483, 2015.
[24] B. Guedj and S. Robbiano. PAC-Bayesian high dimensional bipartite ranking. Journal of Statistical Planning and Inference,
196:70 – 86, 2018. ISSN 0378-3758.
[25] Benjamin Guedj. A primer on PAC-Bayesian learning. arXiv:1901.05353, 2019. To appear in the Proceedings of the French
Mathematical Society.
[26] Benjamin Guedj and Pierre Alquier. PAC-Bayesian estimation and prediction in sparse additive models. Electron. J. Statist., 7:
264–291, 2013.
[27] Benjamin Guedj and Louis Pujol. Still no free lunches: the price to pay for tighter PAC-Bayes bounds. arXiv preprint
arXiv:1910.04460, 2019.

39 42
References III
[28] Matthew Higgs and John Shawe-Taylor. A PAC-Bayes bound for tailored density estimation. In Proceedings of the International
Conference on Algorithmic Learning Theory (ALT), 2010.
[29] Matthew J Holland. PAC-Bayes under potentially heavy tails. arXiv:1905.07900, 2019. To appear in NeurIPS.
[30] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, and N. Usunier. PAC-Bayes bounds for the risk of the majority vote and
the variance of the Gibbs classifier. In Advances in Neural information processing systems, pages 769–776, 2007.
[31] John Langford and Matthias Seeger. Bounds for averaging classifiers. Technical report, Carnegie Mellon, Departement of
Computer Science, 2001.
[32] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems (NIPS),
2002.
[33] Gaël Letarte, Pascal Germain, Benjamin Guedj, and François Laviolette. Dichotomize and Generalize: PAC-Bayesian Binary
Activated Deep Neural Networks. arXiv:1905.10259, 2019. To appear at NeurIPS.
[34] G. Lever, F. Laviolette, and J. Shawe-Taylor. Distribution-dependent PAC-Bayes priors. In International Conference on
Algorithmic Learning Theory, pages 119–133. Springer, 2010.
[35] C. Li, W. Jiang, and M. Tanner. General oracle inequalities for Gibbs posterior with application to ranking. In Conference on
Learning Theory, pages 512–521, 2013.
[36] Le Li, Benjamin Guedj, and Sébastien Loustau. A quasi-Bayesian perspective to online clustering. Electron. J. Statist., 12(2):
3071–3113, 2018.
[37] B. London. A PAC-Bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in
Neural Information Processing Systems, pages 2931–2940, 2017.
[38] B. London, B. Huang, B. Taskar, and L. Getoor. PAC-Bayesian collective stability. In Artificial Intelligence and Statistics, pages
585–594, 2014.
[39] A. Maurer. A note on the PAC-Bayesian Theorem. arXiv preprint cs/0411099, 2004.
[40] D. A. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
[41] David McAllester. Some PAC-Bayesian theorems. In Proceedings of the International Conference on Computational Learning
Theory (COLT), 1998.

40 42
References IV
[42] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37, 1999.
[43] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3), 1999.
[44] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 2003.
[45] David McAllester. Simplified PAC-Bayesian margin bounds. In COLT, 2003.
[46] Zakaria Mhammedi, Peter D. Grunwald, and Benjamin Guedj. PAC-Bayes Un-Expected Bernstein Inequality. arXiv preprint
arXiv:1905.13367, 2019. Accepted at NeurIPS 2019.
[47] B. Neyshabur, S. Bhojanapalli, D. A. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in
Neural Information Processing Systems, pages 5947–5956, 2017.
[48] Kento Nozawa, Pascal Germain, and Benjamin Guedj. PAC-Bayesian contrastive unsupervised representation learning. arXiv
preprint arXiv:1910.04464, 2019.
[49] E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun. PAC-Bayes bounds with data dependent priors. Journal
of Machine Learning Research, 13:3507–3531, 2012.
[50] O. Rivasplata, E. Parrado-Hernandez, J. Shawe-Taylor, S. Sun, and C. Szepesvari. PAC-Bayes bounds for stable algorithms
with instance-dependent priors. In Advances in Neural Information Processing Systems, pages 9214–9224, 2018.
[51] M. Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269,
2002.
[52] M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations.
PhD thesis, University of Edinburgh, 2003.
[53] Y. Seldin and N. Tishby. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11:
3595–3646, 2010.
[54] Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian inequalities for martingales. IEEE
Transactions on Information Theory, 58(12):7086–7093, 2012.
[55] Yevgeny Seldin, Peter Auer, François Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-Bayesian analysis of contextual
bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.

41 42
References V

[56] J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on
Computational Learning Theory, pages 2–9. ACM, 1997. doi: 10.1145/267460.267466.
[57] John Shawe-Taylor and David Hardoon. Pac-bayes analysis of maximum entropy classification. In Proceedings on the
International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
[58] Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A Strongly Quasiconvex PAC-Bayesian Bound. In
International Conference on Algorithmic Learning Theory, ALT, pages 466–492, 2017.
[59] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
[60] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the
imagenet scale: a PAC-bayesian compression approach. In ICLR, 2019.

42 42

You might also like