PAC Bayesian Learning Introduction
PAC Bayesian Learning Introduction
Analysis
John Shawe-Taylor Benjamin Guedj Maria Perez Ortiz
Omar Rivasplata
1 42
Learning is to be able to generalise
2 42
Learning is to be able to generalise
2 42
Learning is to be able to generalise
2 42
Learning is to be able to generalise
2 42
Learning is to be able to generalise
2 42
Statistical Learning Theory is about high confidence
3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample
3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample
3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample
3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample
3 42
Statistical Learning Theory is about high confidence
For a fixed algorithm, function class and sample size, generating random
samples −→ distribution of test errors
Focusing on the mean of the error distribution?
. can be misleading: learner only has one sample
3 42
Error distribution picture
20
18
16
14
12
10
0
0 0.2 0.4 0.6 0.8 1
means | |
95th percentiles
4 42
Mathematical formalization
5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.
5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.
Classical assumptions:
• A data-generating distribution P over Z.
• Learner doesn’t know P, only sees the training set.
• The training set examples are i.i.d. from P: Sm ∼ Pm
5 42
Mathematical formalization
Learning algorithm A : Zm → H
• Z=X×Y
• H = hypothesis class
X = set of inputs
= set of predictors
Y = set of outputs (e.g.
(e.g. classifiers)
labels)
Training set (aka sample): Sm = ((X1 , Y1 ), . . . , (Xm , Ym ))
a finite sequence of input-output examples.
Classical assumptions:
• A data-generating distribution P over Z.
• Learner doesn’t know P, only sees the training set.
• The training set examples are i.i.d. from P: Sm ∼ Pm
. these can be relaxed (mostly beyond the scope of this tutorial)
5 42
What to achieve from the sample?
6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance
6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance
Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias
6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance
Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias
Certifying performance:
• what happens beyond the training set
• generalization bounds
6 42
What to achieve from the sample?
Use the available sample to:
1 learn a predictor
2 certify the predictor’s performance
Learning a predictor:
• algorithm driven by some learning principle
• informed by prior knowledge resulting in inductive bias
Certifying performance:
• what happens beyond the training set
• generalization bounds
6 42
Risk (aka error) measures
7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .
7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .
1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)
7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .
1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)
Theoretical risk: Rout (h) = E `(h(X ), Y )
(out-of-sample)
7 42
Risk (aka error) measures
A loss function `(h(X ), Y ) is used to measure the discrepancy between
a predicted output h(X ) and the true output Y .
1
Pm
Empirical risk: Rin (h) = m i =1 `(h(Xi ), Yi )
(in-sample)
Theoretical risk: Rout (h) = E `(h(X ), Y )
(out-of-sample)
Examples:
• `(h(X ), Y ) = 1[h(X ) 6= Y ] : 0-1 loss (classification)
• `(h(X ), Y ) = (Y − h(X ))2 : square loss (regression)
• `(h(X ), Y ) = (1 − Yh(X ))+ : hinge loss
• `(h(X ), Y ) = − log(h(X )) : log loss (density estimation)
7 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?
8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?
8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?
8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?
8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?
8 42
Generalization
If predictor h does well on the in-sample (X , Y ) pairs...
...will it still do well on out-of-sample pairs?
Flavours:
distribution-free distribution-dependent
algorithm-free algorithm-dependent
8 42
Before PAC-Bayes
9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1
with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1
with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ
9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1
with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ
9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1
with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ
9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1
with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ
9 42
Before PAC-Bayes
Single hypothesis h (building block): q
1 1
with probability > 1 − δ, Rout (h) 6 Rin (h) + 2m log δ .
Finite function class H (worst-case approach): r
|H|
w.p. > 1 − δ, ∀h ∈ H, Rout (h) 6 Rin (h) + 1
2m log δ
9 42
The PAC-Bayes framework
10 42
The PAC-Bayes framework
10 42
The PAC-Bayes framework
10 42
The PAC-Bayes framework
10 42
The PAC-Bayes framework
The risk measures Rin (h) and Rout (h) are extended by averaging:
R R
Rin (Q ) ≡ H Rin (h) dQ (h) Rout (Q ) ≡ H Rout (h) dQ (h)
(h )
KL(Q kP ) = E ln Q
P (h )
is the Kullback-Leibler divergence.
h∼Q
10 42
PAC-Bayes aka Generalised Bayes
11 42
PAC-Bayes aka Generalised Bayes
12 42
PAC-Bayes bounds vs. Bayesian learning
Prior
12 42
PAC-Bayes bounds vs. Bayesian learning
Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference
12 42
PAC-Bayes bounds vs. Bayesian learning
Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference
Posterior
12 42
PAC-Bayes bounds vs. Bayesian learning
Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference
Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model
12 42
PAC-Bayes bounds vs. Bayesian learning
Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference
Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model
Data distribution
12 42
PAC-Bayes bounds vs. Bayesian learning
Prior
• PAC-Bayes: bounds hold for any distribution
• Bayes: prior choice impacts inference
Posterior
• PAC-Bayes: bounds hold for any distribution
• Bayes: posterior uniquely defined by prior and statistical model
Data distribution
• PAC-Bayes: bounds hold for any distribution
• Bayes: randomness lies in the noise model generating the output
12 42
A General PAC-Bayesian Theorem
∆-function: “distance” between Rin (Q ) and Rout (Q )
Convex function ∆ : [0, 1] × [0, 1] → R.
I∆ (m)
1
∀ Q on H : ∆ Rin (Q ), Rout (Q ) 6 KL(Q kP ) + ln ,
m δ
13 42
A General PAC-Bayesian Theorem
∆-function: “distance” between Rin (Q ) and Rout (Q )
Convex function ∆ : [0, 1] × [0, 1] → R.
I∆ (m)
1
∀ Q on H : ∆ Rin (Q ), Rout (Q ) 6 KL(Q kP ) + ln ,
m δ
where
X
" m
#
k
I∆ (m) = sup m
r k (1−r )m−k e m∆( m , r)
.
r ∈[0,1] k =0
|k {z }
Bin k ;m,r
13 42
Proof of the general theorem
General theorem
I∆ ( m )
1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ
Proof ideas.
Change of Measure Inequality
For any P and Q on H, and for any measurable function φ : H → R, we have
P (h) φ(h)
− ln E eφ(h) = − ln E e
h∼P h∼Q Q (h )
Q (h )
6 E ln − E φ(h)
h∼Q P (h ) h∼Q
= KL(Q kP ) − E φ(h).
h∼Q
14 42
Proof of the general theorem
General theorem
I∆ ( m )
1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ
Proof ideas.
Change of Measure Inequality
For any P and Q on H, and for any measurable function φ : H → R, we have
P (h) φ(h)
− ln E eφ(h) = − ln E e
h∼P h∼Q Q (h )
Q (h )
6 E ln − E φ(h)
h∼Q P (h ) h∼Q
= KL(Q kP ) − E φ(h).
h∼Q
Markov’s inequality
14 42
Proof of the general theorem
15 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P
1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P
1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m
1 X
m
k
Bin k ; m, Rout (h) em·∆( m ,Rout (h))
Binomial law = KL(Q kP ) + ln E
δ h∼P
k =0
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P
1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m
1 X
m
k
Bin k ; m, Rout (h) em·∆( m ,Rout (h))
Binomial law = KL(Q kP ) + ln E
δ h∼P
k =0
X
" m
#
1 k , r)
m∆( m
Supremum over risk 6 KL(Q kP ) + ln sup Bin k ; m, r e
δ r ∈[0,1] k =0
16 42
I∆ ( m )
1
PrS∼Dm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
m δ
Proof.
m·∆ E Rin (h), E Rout (h)
h∼Q h∼Q
Jensen’s Inequality 6 E m · ∆ Rin (h), Rout (h)
h∼Q
Change of measure 6 KL(Q kP ) + ln E em∆ Rin (h),Rout (h)
h∼P
1
Markov’s Inequality ≤ 1−δ KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ S 0∼Dm h∼P
1
Expectation swap = KL(Q kP ) + ln E E em·∆(Rin (h),Rout (h))
δ h∼P S 0∼D m
1 X
m
k
Bin k ; m, Rout (h) em·∆( m ,Rout (h))
Binomial law = KL(Q kP ) + ln E
δ h∼P
k =0
X
" m
#
1 k , r)
m∆( m
Supremum over risk 6 KL(Q kP ) + ln sup Bin k ; m, r e
δ r ∈[0,1] k =0
1
= KL(Q kP ) + ln I∆ (m) .
δ
16 42
General theorem
I∆ ( m )
1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ
Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ 1
m
KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
17 42
General theorem
I∆ ( m )
1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ
Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ m1 KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
r h
√ i
(b) Rout (Q ) ≤ Rin (Q )) + 2m 1
KL(Q kP ) + ln 2 δm , McAllester [40, 43]
def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
> 2(q − p)2 ,
17 42
General theorem
I∆ ( m )
1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ
Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ m1 KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
r h
√ i
(b) Rout (Q ) ≤ Rin (Q )) + 2m 1
KL(Q kP ) + ln 2 δm , McAllester [40, 43]
(c) Rout (Q ) ≤ 1−e1−c c · Rin (Q ) + m
1
KL(Q kP ) + ln δ1 , Catoni [11]
def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
> 2(q − p)2 ,
def
∆c (q , p) = − ln[1 − (1 − e−c ) · p] − c · q ,
17 42
General theorem
I∆ ( m )
1
Prm ∀ Q on H : ∆ Rin (Q ), Rout (Q ) ≤ KL(Q kP ) + ln > 1−δ .
S ∼D m δ
Corollary
[...] with probability at least 1−δ over the choice of S ∼ D m , for all Q on H :
h √ i
(a) kl Rin (Q ), Rout (Q ) ≤ m1 KL(Q kP ) + ln 2 δm , Langford and Seeger [31]
r h
√ i
(b) Rout (Q ) ≤ Rin (Q )) + 2m 1
KL(Q kP ) + ln 2 δm , McAllester [40, 43]
1
(c) Rout (Q ) ≤ 1−e−c
c · Rin (Q ) + m1 KL(Q kP ) + ln δ1 , Catoni [11]
1
1
(d) Rout (Q ) ≤ Rin (Q ) + λ KL(Q kP ) + ln δ + f (λ, m) . Alquier et al. [4]
def
kl(q , p) = q ln q
p
+ (1 − q ) ln 11−q
−p
> 2(q − p)2 ,
def
∆c (q , p) = − ln[1 − (1 − e−c ) · p] − c · q ,
def λ
∆λ (q , p) = m
(p − q ) .
17 42
Proof of the Langford/Seeger bound
18 42
Proof of the Langford/Seeger bound
18 42
Proof of the Langford/Seeger bound
18 42
Proof of the Langford/Seeger bound
18 42
Proof of the Langford/Seeger bound
18 42
Linear classifiers
19 42
Linear classifiers
19 42
Linear classifiers
19 42
PAC-Bayes Bound for SVM (1/2)
20 42
PAC-Bayes Bound for SVM (1/2)
21 42
PAC-Bayes Bound for SVM (1/2)
µ
Prior P is Gaussian N(0, 1)
P
Posterior is in the direction w
0 at distance µ from the origin
22 42
PAC-Bayes Bound for SVM (1/2)
Q
w
23 42
PAC-Bayes Bound for SVM (2/2)
24 42
PAC-Bayes Bound for SVM (2/2)
24 42
PAC-Bayes Bound for SVM (2/2)
24 42
PAC-Bayes Bound for SVM (2/2)
24 42
PAC-Bayes Bound for SVM (2/2)
25 42
PAC-Bayes Bound for SVM (2/2)
25 42
PAC-Bayes Bound for SVM (2/2)
25 42
PAC-Bayes Bound for SVM (2/2)
25 42
PAC-Bayes Bound for SVM (2/2)
25 42
PAC-Bayes Bound for SVM (2/2)
26 42
PAC-Bayes Bound for SVM (2/2)
26 42
PAC-Bayes Bound for SVM (2/2)
26 42
PAC-Bayes Bound for SVM (2/2)
26 42
PAC-Bayes Bound for SVM (2/2)
27 42
PAC-Bayes Bound for SVM (2/2)
δ is the confidence
27 42
PAC-Bayes Bound for SVM (2/2)
δ is the confidence
The bound holds with probability 1 − δ over the random i.i.d.
selection of the training data.
27 42
Form of the SVM bound
Note that bound holds for all posterior distributions so that we can
choose µ to optimise the bound
28 42
Form of the SVM bound
Note that bound holds for all posterior distributions so that we can
choose µ to optimise the bound
If we define the inverse of the KL by
28 42
Gives SVM Optimisation
Primal form:
1 2
Pm
minw,ξi 2 kwk + C i =1 ξi
s.t. yi wT φ(xi ) > 1 − ξi i = 1, . . . , m
ξi > 0 i = 1, . . . , m
Dual form:
hP Pm i
m 1
maxα i =1 αi − 2 i ,j =1 αi αj yi yj κ(xi , xj )
s.t. 0 6 αi 6 C i = 1, . . . , m
Pm
where κ(xi , xj ) = hφ(xi ), φ(xj )i and hw, φ(x)i = i =1 αi yi κ(xi , x).
29 42
Slack variable conversion
3
2.5
1.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 30 42
Model Selection with the new bound: setup
31 42
Model Selection with the new bound: setup
31 42
Model Selection with the new bound: setup
31 42
Model Selection with the new bound: setup
31 42
Model Selection with the new bound: setup
31 42
Results
Classifier
SVM ηPrior SVM
Problem 2FCV 10FCV PAC PrPAC PrPAC τ-PrPAC
digits Bound – – 0.175 0.107 0.050 0.047
TE 0.007 0.007 0.007 0.014 0.010 0.009
waveform Bound – – 0.203 0.185 0.178 0.176
TE 0.090 0.086 0.084 0.088 0.087 0.086
pima Bound – – 0.424 0.420 0.428 0.416
TE 0.244 0.245 0.229 0.229 0.233 0.233
ringnorm Bound – – 0.203 0.110 0.053 0.050
TE 0.016 0.016 0.018 0.018 0.016 0.016
spam Bound – – 0.254 0.198 0.186 0.178
TE 0.066 0.063 0.067 0.077 0.070 0.072
Average TE 0.0846 0.0834 0.081 0.0852 0.0832 0.0832
32 42
Take home messages
33 42
Take home messages
33 42
Take home messages
33 42
Deep Learning Results
34 42
Deep Learning Results
35 42
A flexible framework
36 42
A flexible framework
36 42
A flexible framework
37 42
References I
[1] P. Alquier and G. Biau. Sparse single-index model. Journal of Machine Learning Research, 14:243–280, 2013.
[2] P. Alquier and K. Lounici. PAC-Bayesian theorems for sparse regression estimation with exponential weights. Electronic
Journal of Statistics, 5:127–145, 2011.
[3] Pierre Alquier and Benjamin Guedj. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887–902, 2018.
[4] Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of Gibbs posteriors. ArXiv
e-prints, 2015. URL http://arxiv.org/abs/1506.04091.
[5] A. Ambroladze, E. Parrado-Hernández, and J. Shawe-taylor. Tighter PAC-Bayes bounds. In Advances in Neural Information
Processing Systems, NIPS, pages 9–16, 2007.
[6] Jean-Yves Audibert and Olivier Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal of Machine
Learning Research, 2007.
[7] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian theory for transductive learning. In
AISTATS, 2014.
[8] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesian bounds based on the Rényi divergence.
In AISTATS, 2016.
[9] O. Catoni. Statistical Learning Theory and Stochastic Optimization. École d’Été de Probabilités de Saint-Flour 2001. Springer,
2004.
[10] O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 of Lecture notes
– Monograph Series. Institute of Mathematical Statistics, 2007.
[11] Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning, volume 56. Inst. of
Mathematical Statistic, 2007.
[12] Philip Derbeko, Ran El-Yaniv, and Ron Meir. Explicit learning curves for transduction and application to clustering and
compression algorithms. J. Artif. Intell. Res. (JAIR), 22, 2004.
[13] G. K. Dziugaite and D. M. Roy. Data-dependent PAC-Bayes priors via differential privacy. In NeurIPS, 2018.
[14] G. K. Dziugaite and D. M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of
Entropy-SGD and data-dependent priors. In International Conference on Machine Learning, pages 1376–1385, 2018.
38 42
References II
[15] Gintare K. Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks
with many more parameters than training data. In Proceedings of Uncertainty in Artificial Intelligence (UAI), 2017.
[16] Mahdi Milani Fard and Joelle Pineau. PAC-Bayesian model selection for reinforcement learning. In Advances in Neural
Information Processing Systems (NIPS), 2010.
[17] Mahdi Milani Fard, Joelle Pineau, and Csaba Szepesvári. PAC-Bayesian Policy Evaluation for Reinforcement Learning. In UAI,
Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 195–202, 2011.
[18] S. Gerchinovitz. Prédiction de suites individuelles et cadre statistique classique : étude de quelques liens autour de la
régression parcimonieuse et des techniques d’agrégation. PhD thesis, Université Paris-Sud, 2011.
[19] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. In Proceedings of the
26th Annual International Conference on Machine Learning, ICML, 2009.
[20] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A new PAC-Bayesian perspective on domain adaptation. In Proceedings
of International Conference on Machine Learning, volume 48, 2016.
[21] Pascal Germain. Généralisations de la théorie PAC-bayésienne pour l’apprentissage inductif, l’apprentissage transductif et
l’adaptation de domaine. PhD thesis, Université Laval, 2015.
[22] Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, and François Laviolette. From PAC-Bayes bounds to KL
regularization. In Advances in Neural Information Processing Systems, pages 603–610, 2009.
[23] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends in
Machine Learning, 8(5-6):359–483, 2015.
[24] B. Guedj and S. Robbiano. PAC-Bayesian high dimensional bipartite ranking. Journal of Statistical Planning and Inference,
196:70 – 86, 2018. ISSN 0378-3758.
[25] Benjamin Guedj. A primer on PAC-Bayesian learning. arXiv:1901.05353, 2019. To appear in the Proceedings of the French
Mathematical Society.
[26] Benjamin Guedj and Pierre Alquier. PAC-Bayesian estimation and prediction in sparse additive models. Electron. J. Statist., 7:
264–291, 2013.
[27] Benjamin Guedj and Louis Pujol. Still no free lunches: the price to pay for tighter PAC-Bayes bounds. arXiv preprint
arXiv:1910.04460, 2019.
39 42
References III
[28] Matthew Higgs and John Shawe-Taylor. A PAC-Bayes bound for tailored density estimation. In Proceedings of the International
Conference on Algorithmic Learning Theory (ALT), 2010.
[29] Matthew J Holland. PAC-Bayes under potentially heavy tails. arXiv:1905.07900, 2019. To appear in NeurIPS.
[30] A. Lacasse, F. Laviolette, M. Marchand, P. Germain, and N. Usunier. PAC-Bayes bounds for the risk of the majority vote and
the variance of the Gibbs classifier. In Advances in Neural information processing systems, pages 769–776, 2007.
[31] John Langford and Matthias Seeger. Bounds for averaging classifiers. Technical report, Carnegie Mellon, Departement of
Computer Science, 2001.
[32] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in Neural Information Processing Systems (NIPS),
2002.
[33] Gaël Letarte, Pascal Germain, Benjamin Guedj, and François Laviolette. Dichotomize and Generalize: PAC-Bayesian Binary
Activated Deep Neural Networks. arXiv:1905.10259, 2019. To appear at NeurIPS.
[34] G. Lever, F. Laviolette, and J. Shawe-Taylor. Distribution-dependent PAC-Bayes priors. In International Conference on
Algorithmic Learning Theory, pages 119–133. Springer, 2010.
[35] C. Li, W. Jiang, and M. Tanner. General oracle inequalities for Gibbs posterior with application to ranking. In Conference on
Learning Theory, pages 512–521, 2013.
[36] Le Li, Benjamin Guedj, and Sébastien Loustau. A quasi-Bayesian perspective to online clustering. Electron. J. Statist., 12(2):
3071–3113, 2018.
[37] B. London. A PAC-Bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in
Neural Information Processing Systems, pages 2931–2940, 2017.
[38] B. London, B. Huang, B. Taskar, and L. Getoor. PAC-Bayesian collective stability. In Artificial Intelligence and Statistics, pages
585–594, 2014.
[39] A. Maurer. A note on the PAC-Bayesian Theorem. arXiv preprint cs/0411099, 2004.
[40] D. A. McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
[41] David McAllester. Some PAC-Bayesian theorems. In Proceedings of the International Conference on Computational Learning
Theory (COLT), 1998.
40 42
References IV
[42] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37, 1999.
[43] David McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3), 1999.
[44] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 2003.
[45] David McAllester. Simplified PAC-Bayesian margin bounds. In COLT, 2003.
[46] Zakaria Mhammedi, Peter D. Grunwald, and Benjamin Guedj. PAC-Bayes Un-Expected Bernstein Inequality. arXiv preprint
arXiv:1905.13367, 2019. Accepted at NeurIPS 2019.
[47] B. Neyshabur, S. Bhojanapalli, D. A. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in
Neural Information Processing Systems, pages 5947–5956, 2017.
[48] Kento Nozawa, Pascal Germain, and Benjamin Guedj. PAC-Bayesian contrastive unsupervised representation learning. arXiv
preprint arXiv:1910.04464, 2019.
[49] E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun. PAC-Bayes bounds with data dependent priors. Journal
of Machine Learning Research, 13:3507–3531, 2012.
[50] O. Rivasplata, E. Parrado-Hernandez, J. Shawe-Taylor, S. Sun, and C. Szepesvari. PAC-Bayes bounds for stable algorithms
with instance-dependent priors. In Advances in Neural Information Processing Systems, pages 9214–9224, 2018.
[51] M. Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3:233–269,
2002.
[52] M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations.
PhD thesis, University of Edinburgh, 2003.
[53] Y. Seldin and N. Tishby. PAC-Bayesian analysis of co-clustering and beyond. Journal of Machine Learning Research, 11:
3595–3646, 2010.
[54] Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian inequalities for martingales. IEEE
Transactions on Information Theory, 58(12):7086–7093, 2012.
[55] Yevgeny Seldin, Peter Auer, François Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-Bayesian analysis of contextual
bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.
41 42
References V
[56] J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on
Computational Learning Theory, pages 2–9. ACM, 1997. doi: 10.1145/267460.267466.
[57] John Shawe-Taylor and David Hardoon. Pac-bayes analysis of maximum entropy classification. In Proceedings on the
International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
[58] Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A Strongly Quasiconvex PAC-Bayesian Bound. In
International Conference on Algorithmic Learning Theory, ALT, pages 466–492, 2017.
[59] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
[60] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generalization bounds at the
imagenet scale: a PAC-bayesian compression approach. In ICLR, 2019.
42 42