Supervised Learning (COMP0078)
6. Learning Theory (Part I)
Carlo Ciliberto
University College London
Department of Computer Science
1
Previous Classes
In the previous classes we:
• Formulated the supervised learning problem.
• Focused on Empirical Risk Minmization (ERM).
• Discussed overfitting and how to tackle it (heuristically).
2
Previous Classes
We studied a few different algorithms (in particular focusing on
classification):
• Nearest neighbor
• Least-squares
• Support Vector Machines
• Logistic Regression
• Decision Trees
• Ensemble methods (Bagging, Boosting)
We observed that many of these methods can be formulated as
ERM problems.
3
Outline
Equipped with our shiny new tool-set of supervised learning
algorithms, we will now go back to asking ourselves the
question:
| When/why/how do these methods “work”?
Outline:
• Emprical Risk Minimization (Again!)
• Generalization error
• Excess Risk Decomposition
• Regularization & Bias-Variance Trade-off (Again!)
4
Refresher on the Learning Problem
The goal of Supervised Learning is to find a “good” estimator
fn : X → Y, approximating the lowest expected risk
Z
inf E(f ), E(f ) = ℓ(f (x), y) dρ(x, y)
f :X →Y X ×Y
given only a finite number of (training) examples (xi , yi )ni=1
sampled independently from the unknown distribution ρ.
5
A Wishlist
What do we mean by... “good”?
6
A Wishlist
What do we mean by... “good”?
To have a low excess risk E(fn ) − E(f∗ )
• Consistency. Does E(fn ) − E(f∗ ) → 0
• in Expectation?
• in Probability?
as the size of the training set S = (xi , yi )ni=1 of points
randomly sampled from ρ grows, n → +∞.
• Learning Rates. How “fast” is consistency achieved?
Nonasymptotic bounds: finite sample complexity, tail
bounds, error bounds...
6
Empirical Risk as a Proxy
If ρ is unknown, can we say anything about E(fn ) − inf f ∈F E(f )?
We only “glimpse” ρ via the samples (xi , yi )ni=1 . Can we use
them to gather some information about ρ (or better, on E(f ))?
7
Empirical Risk as a Proxy
If ρ is unknown, can we say anything about E(fn ) − inf f ∈F E(f )?
We only “glimpse” ρ via the samples (xi , yi )ni=1 . Can we use
them to gather some information about ρ (or better, on E(f ))?
Consider function f : X → Y and its empirical risk
n
1X
En (f ) = ℓ(f (xi ), yi )
n
i=1
Is this a good idea?
7
Empirical Risk as a Proxy
If ρ is unknown, can we say anything about E(fn ) − inf f ∈F E(f )?
We only “glimpse” ρ via the samples (xi , yi )ni=1 . Can we use
them to gather some information about ρ (or better, on E(f ))?
Consider function f : X → Y and its empirical risk
n
1X
En (f ) = ℓ(f (xi ), yi )
n
i=1
Is this a good idea? A simple calculation shows that
n n
1X 1X
ES∼ρn [En (f )] = E(xi ,yi )∼ρ [ℓ(f (xi ), yi )] = E(f ) = E(f )
n n
i=1 i=1
The expectation of En (f ) is the expected risk E(f )!
Empirical Risk as a Proxy
If ρ is unknown, can we say anything about E(fn ) − inf f ∈F E(f )?
We only “glimpse” ρ via the samples (xi , yi )ni=1 . Can we use
them to gather some information about ρ (or better, on E(f ))?
Consider function f : X → Y and its empirical risk
n
1X
En (f ) = ℓ(f (xi ), yi )
n
i=1
Is this a good idea? A simple calculation shows that
n n
1X 1X
ES∼ρn [En (f )] = E(xi ,yi )∼ρ [ℓ(f (xi ), yi )] = E(f ) = E(f )
n n
i=1 i=1
The expectation of En (f ) is the expected risk E(f )!
7
Empirical Vs Expected
Nice! But... how close is En (f ) to E(f ) with respect to n?
8
Empirical Vs Expected
Nice! But... how close is En (f ) to E(f ) with respect to n?
1 Pn
Let X and (Xi )ni=1 be i.i.d. random variables, X̄n = n i=1 Xi .
Then
E[(X̄n − E(X ))2 ] = Var(X̄n )
8
Empirical Vs Expected
Nice! But... how close is En (f ) to E(f ) with respect to n?
1 Pn
Let X and (Xi )ni=1 be i.i.d. random variables, X̄n = n i=1 Xi .
Then
Var(X )
E[(X̄n − E(X ))2 ] = Var(X̄n ) =
n
Therefore, the expected (squared) distance between the
empirical mean of the Xi and their expectation E(X ) goes to
zero as O(1/n)... (Assuming X has finite variance).
8
Empirical Vs Expected Risk
If Xi = ℓ(f (xi ), yi ), we have X̄n = En (f ) and therefore
Vf
E[(En (f ) − E(f ))2 ] =
n
Where Vf = Var(ℓ(f (x), y )). In particular
r
Vf
E[|En (f ) − E(f )|] ≤
n
9
Empirical Vs Expected Risk
If Xi = ℓ(f (xi ), yi ), we have X̄n = En (f ) and therefore
Vf
E[(En (f ) − E(f ))2 ] =
n
Where Vf = Var(ℓ(f (x), y )). In particular
r
Vf
E[|En (f ) − E(f )|] ≤
n
Exercise. Can we similarly get a “tail” bound?
P |En (f ) − E(f )| ≤ ε(n, δ) ≥ 1 − δ
9
Empirical Vs Expected
Assume there exists a minimizer f∗ : X → Y of the expected risk
E(f∗ ) = inf E(f )
f ∈F
Then, for any f : X → Y we can decompose the excess risk as
E(f ) − E(f∗ ) = E(f ) − En (f ) + En (f ) − En (f∗ ) + En (f∗ ) − E(f∗ ),
We can therefore leverage on the statistical relation between En
and E to study the expected risk in terms of the empirical risk.
This leads naturally to Empirical Risk Minimization
10
Empirical Risk Minimization (ERM)
Let fn be the minimizer of the empirical risk
fn = arg min En (f )
f ∈F
We automatically have En (fn ) − En (f∗ ) ≤ 0 (for any training set).
Then...
E [E(fn ) − E(f∗ )] ≤ E [E(fn ) − En (fn )] (Question. Why?)
We can “just” focus on studying only the generalization error
E [E(fn ) − En (fn )]
11
Generalization Error
How can we control the generalization error
En (fn ) − E(fn )
with respect to the number n of examples?
This question is far from trivial (a key one in SLT, in fact):
In general... E[En (fn ) − E(fn )]
12
Generalization Error
How can we control the generalization error
En (fn ) − E(fn )
with respect to the number n of examples?
This question is far from trivial (a key one in SLT, in fact):
In general... E[En (fn ) − E(fn )]̸= 0 (Question. Why?)
12
Generalization Error
How can we control the generalization error
En (fn ) − E(fn )
with respect to the number n of examples?
This question is far from trivial (a key one in SLT, in fact):
In general... E[En (fn ) − E(fn )]̸= 0 (Question. Why?)
En and fn both depend on the sampled training data. Therefore,
we cannot use the result
r
Var(ℓ(fn (x), y))
E[|En (fn ) − E(fn )|] ≤
n
which indeed is not true in general...
12
Issues with ERM
LetX = Y = R, ρ with dense support1 and ℓ(y, y) = 0 ∀y ∈ Y.
For any S = (xi , yi )ni=1 with distinct inputs xi , let fn : X → Y be
(
yi if x = xi for some i = 1, . . . , n
fn (x) =
0 otherwise
Then, for any number n of training points:
• E [En (fn )] = 0
• E [E(fn )] = E(0), which is greater than E(f ∗ ) (unless f ∗ ≡ 0)
Therefore E [E(fn ) − En (fn )] = E(0) ↛ 0 as n increases!
1
and such that every pair (x, y ) has measure zero according to ρ
13
Overfitting
An estimator fn is said to overfit the training data if ∀n ∈ N:
• E [E(fn ) − E(f∗ )] > C for a constant C > 0, and
• E [En (fn ) − En (f∗ )] ≤ 0
According to this definition ERM overfits...
14
ERM on Finite Hypotheses Spaces
Is ERM hopeless? Consider the case X and Y finite.
Then, F = Y X = {f : X → Y} is finite as well (albeit possibly
very large), and therefore:
E|En (fn ) − E(fn )| ≤ E sup |En (f ) − E(f )|
f ∈F
X p
≤ E|En (f ) − E(f )| ≤ |F | VF /n
f ∈F
where VF = supf ∈F Vf and |F| denotes the cardinality of F.
15
ERM on Finite Hypotheses Spaces
Is ERM hopeless? Consider the case X and Y finite.
Then, F = Y X = {f : X → Y} is finite as well (albeit possibly
very large), and therefore:
E|En (fn ) − E(fn )| ≤ E sup |En (f ) − E(f )|
f ∈F
X p
≤ E|En (f ) − E(f )| ≤ |F | VF /n
f ∈F
where VF = supf ∈F Vf and |F| denotes the cardinality of F.
Here ERM works! limn→+∞ E|E(fn ) − E(f∗ )| = 0
15
ERM on Finite Hypotheses (Sub) Spaces
The same argument holds in general: let H ⊂ F be a finite
space of hypotheses (even if F is not). Then,
p
E|En (fn ) − E(fn )| ≤ |H| VH /n
In particular, if f∗ ∈ H, then
p
E|E(fn ) − E(f∗ )| ≤ |H| VH /n
and ERM is a good estimator for the problem considered.
16
Example: Threshold functions
Consider a binary classification problem Y = {0, 1}. Someone
has told us that the minimizer of the risk is a “threshold
function” fa∗ (x) = 1[a∗ ,+∞) with a∗ ∈ [−1, 1].
1.5
a
b
0.5
0
-1.5 -1 -0.5 0 0.5 1 1.5
We can learn on H = {fa |a ∈ R} = [−1, 1]. However on a
computer we can only represent real numbers up to a given
precision. 17
Example: Threshold Functions (with precision p)
Discretization: given a p > 0, we can consider
Hp = {fa | a ∈ [−1, 1], a · 10p = [a · 10p ]}
with [a] denoting the integer part (i.e. the closest integer) of a
scalar a. The value p can be interpreted as the “precision” of
our space of functions Hp . Note that |Hp | = 2 · 10p
If f ∗ ∈ Hp , then we have automatically that
p √
E|E(fn ) − E(f∗ )| ≤ |Hp | VH /n ≤ 10p / n
(This is because VH ≤ 1/4 Question. Why?)
18
Rates in Expectation Vs Probability
In practice, even for small values of p
√
E|E(fn ) − E(f∗ )| ≤ 10p / n
will need a very large n in order to have a meaningful bound on
the expected error.
Interestingly, we can get much better constants (not rates
though!) by working in probability...
19
Hoeffding’s Inequality
Let X1 , . . . , Xn independent random variables s.t. Xi ∈ [ai , bi ].
1 Pn
Let X = n i=1 Xi . Then,
2n2 ϵ2
P X − E X ≥ ϵ ≤ 2 exp − Pn 2
i=1 (bi − ai )
20
Applying Hoeffding’s inequality
Assume that ∀f ∈ H, x ∈ X , y ∈ Y the loss is bounded
|ℓ(f (x), y)| ≤ M by some constant M > 0. Then, for any f ∈ H
we have
nϵ2
P (|En (f ) − E(f )| ≥ ϵ) ≤ 2 exp(− )
2M 2
21
Controlling the Generalization Error
We would like to control the generalization error En (fn ) − E(fn )
of our estimator in probability. One possible way to do that is by
controlling the generalzation error of the whole set H.
P (|En (fn ) − E(fn )| ≥ ϵ) ≤ P sup |En (f ) − E(f )| ≥ ϵ
f ∈H
The latter term is the probability that least one of the events
|En (f ) − E(f )| ≥ ϵ occurs for f ∈ H. In other words the
probability of the union of such events. Therefore
X
P sup |En (f ) − E(f )| ≥ ϵ ≤ P (|En (f ) − E(f )| ≥ ϵ)
f ∈H f ∈H
by the so-called union bound.
22
Hoeffding the Generalization Error
By applying Hoeffding’s inequality,
nϵ2
P (|En (fn ) − E(fn )| ≥ ϵ) ≤ 2|H| exp(− )
2M 2
Or, equivalently, that for any δ ∈ (0, 1],
r
2M 2 log(2|H|/δ)
|En (fn ) − E(fn )| ≤
n
with probability at least 1 − δ.
23
Example: Threshold Functions (in Probability)
Going back to Hp space of threshold functions...
r
4 + 6p − 2 log δ
|En (fn ) − E(fn )| ≤
n
since M = 1 and
log 2|H| = log 4 · 10p = log 4 + p log 10 ≤ 2 + 3p.
For example, let δ = 0.001. We can say that
r
6p + 18
|En (fn ) − E(fn )| ≤
n
holds at least 99.9% of the times.
24
Bounds in Expectation Vs Probability
Comparing the two bounds
√
E |En (fn ) − E(fn )| ≤ 10p / n (Expectation)
While, with probability greater than 99.9%
r
6p + 18
|En (fn ) − E(fn )| ≤ (Probability)
n
Although we cannot be 100% sure of it, we can be quite
confident that the generalization error will be much smaller than
what the bound in expectation tells us...
Rates: note however that the rates of convergence to 0 are the
√
same (i.e. O(1/ n)).
25
Improving the bound in Expectation
Exploiting the bound in probability and the knowledge that on
Hp the excess risk is bounded by a constant, we can improve
the bound in expectation...
Let X be a random variable s.t. |X | < M for some constant
M > 0. Then, for any ϵ > 0 we have
E |X | ≤ ϵ P (|X | ≤ ϵ) + MP (|X | > ϵ)
Applying to our problem: for any δ ∈ (0, 1]
r
2M 2 log(2|Hp |/δ)
E |En (fn ) − E(fn )| ≤ (1 − δ) + δM
n
Therefore only log |Hp | appears (no |Hp | alone).
26
Infinite Hypotheses Spaces
What if f∗ ∈ H \ Hp for any p > 0?
ERM on Hp will never minimize the expected risk. There will
always be a gap for E(fn,p ) − E(f∗ ).
For p → +∞ it is natural to expect such gap to decrease... BUT
if p increases too fast (with respect to the number n of
examples) we cannot control the generalization error anymore!
r
6p + 18
|En (fn ) − E(fn )| ≤ → +∞ for p → +∞
n
Therefore we need to increase p gradually as a function p(n) of
the number of training examples. This approach is known as
regularization.
27
Approximation Error for Threshold Functions
Consider fp = 1[ap ,+∞) = arg minf ∈Hp E(f ) with ap ∈ [−1, 1].
We decompose the excess risk E(fn ) − E(f∗ ):
E(fn ) − En (fn ) + En (fn ) − En (fp ) +En (fp ) − E(fp ) + E(fp ) − E(f∗ )
| {z }
≤0
We already know how to control the generalization of fn (via the
supremum over Hp ) and fp (since it is a single function).
Moreover, we have that the approximation error is
E(fp ) − E(f∗ ) ≤ |ap − a∗ | ≤ 10−p (why?)
Note that it does not depend on training data!
28
Approximation Error for Threshold Functions II
Putting everything together: for any δ ∈ (0, 1] and p ≥ 0,
r
4 + 6p − 2 log δ
E(fn ) − E(f∗ ) ≤ 2 + 10−p = ϕ(n, δ, p)
n
holds with probability greater or equal to 1 − δ.
So, for any n and δ, we can choose the best precision as
p(n, δ) = arg min ϕ(n, δ, p)
p≥0
which leads to an error bound ϵ(n, δ) = ϕ(n, δ, p(n, δ)) holding
with probability larger or equal than 1 − δ.
29
Regularization
Most hypotheses spaces are “too” large and therefore prone to
overfitting. Regularization is the process of controlling the
“freedom” of an estimator as a function on the number of
training examples.
Idea. Parametrize H as a union H = ∪γ>0 Hγ of hypotheses
spaces Hγ that are not prone to overfitting (e.g. finite spaces).
γ is known as the regularization parameter (e.g. the precision p
in our examples). Assume Hγ ⊂ Hγ ′ if γ ≤ γ ′ .
Regularization Algorithm. Given n training points, find an
estimator fγ,n on Hγ (e.g. ERM on Hγ ). Let γ = γ(n) increase
as n → +∞.
30
Regularization and Decomposition of the Excess Risk
Let γ > 0 and fγ = argmin E(f )
f ∈Hγ
We can decompose the excess risk E(fγ,n ) − E(f∗ ) as
E(fγ,n ) − E(fγ ) + E(fγ ) − inf E(f ) + inf E(f ) − E(f∗ )
| {z } f ∈H f ∈H
Sample error
| {z } | {z }
Approximation error Irreducible error
31
Irreducible Error
inf E(f ) − E(f∗ )
f ∈H
Recall: H is the “largest” possible Hypotheses space we are
considering.
If the irreducible error is zero, H is called universal (e.g. the
RKHS induced by the Gaussian kernel is a universal
Hypotheses space).
32
Approximation Error
E(fγ ) − inf E(f )
f ∈H
• Does not depend on the dataset (deterministic).
• Does depend on the distribution ρ.
• Also referred to as bias.
33
Convergence of the Approximation Error
Under mild assumptions,
lim E(fγ ) − inf E(f ) = 0
γ→+∞ f ∈H
34
Density Results
lim E(fγ ) − E(f∗ ) = 0
γ→+∞
• Convergence of Approximation error
+
• Universal Hypotheses space
Note: It corresponds to a density property of the space H in
F = {f : X → Y}
35
Approximation error bounds
E(fγ ) − inf E(f ) ≤ A(ρ, γ)
f ∈H
• No rates without assumptions – related to the so-called No
Free Lunch Theorem.
• Studied in Approximation Theory using tools such as
Kolmogorov n-width, K-functionals, interpolation spaces. . .
Prototypical result:
If f∗ is s-“regular”2 parameter equal to s, then
A(ρ, γ) = cγ −s .
2
Some abstract notion of regularity parametrizing the class of target
functions. Typical example: f∗ in a Sobolev space W s,2 .
36
Sample Error
E(fγ,n ) − E(fγ )
Random quantity depending on the data.
Two main ways to study it:
• Capacity/Complexity estimates on Hγ .
• Stability.
37
Sample Error Decomposition
We have seen how to decompose E(fγ,n ) − E(fγ ) in
E(fγ,n ) − En (fγ,n ) + En (fγ,n ) − En (fγ ) + En (fγ ) − E(fγ )
| {z } | {z } | {z }
Generalization error Excess empirical Risk Generalization error
38
Generalization Error(s)
As we have observed,
E(fγ,n ) − En (fγ,n ) and En (fγ ) − E(fγ )
Can be controlled by studying the empirical process
sup |En (f ) − E(f )|
f ∈Hγ
Example: we have already observed that for a finite space Hγ
" # r
VHγ
E sup |En (f ) − E(f )| ≤ |Hγ |
f ∈Hγ n
39
ERM on Finite Spaces and Computational Efficiency
The strategy used for threshold functions can be generalized to
any H for which it is possible to find a finite discretization Hp
with respect to the L1 (X , ρX ) norm (e.g. H compact with
respect to such norm).
However, it could be computationally very expensive to find the
empirical risk minimizer on a discretization Hp , since in
principle it could be necessary to evaluate En (f ) for any f ∈ Hp .
40
ERM on Convex Spaces?
As we have seen in previous classes, ERM on convex (thus
dense) spaces is often much more amenable to computations.
In principle, we have observed that on infinite hypotheses
spaces it is difficult to control the generalization error but...
...we might be able to leverage the discretization argument
used for threshold functions to control the generalization error
of ERM for a larger family of hypotheses spaces.
41
Example: Risks for Continuous functions
Let X ⊂ Rd be a compact (i.e., closed and bounded) space and
C(X ) be the space of continuous functions. Let ∥ · ∥∞ be
defined for any f ∈ C(X ) as ∥f ∥∞ = supx∈X |f (x)|.
If the loss function ℓ : Y × Y → R is such that ℓ(·, y) is uniformly
Lipschitz with constant L > 0, for any y ∈ Y, we have that
1) |E(f1 ) − E(f2 )| ≤ L∥f1 − f2 ∥L1 (X ,ρX ) ≤ L∥f1 − f2 ∥∞ , and
n
1X
2) |En (f1 ) − En (f2 )| ≤ |ℓ(f1 (xi ), yi ) − ℓ(f2 (xi ), yi )| ≤ L∥f1 − f2 ∥∞
n
i=1
Therefore, “close” functions in ∥ · ∥∞ will have similar expected
and empirical risks!
42
Example: Covering numbers
We define the covering number of H of radius η > 0 as the
cardinality of a minimal cover of H with balls of radius η.
m
[
N (H, η) = inf{m | H ⊆ Bη (hi ) hi ∈ H}
i=1
Image credits: Lorenzo Rosasco.
Example. If H ∼
= BR (0) is a ball of radius R in Rd :
N (BR (0), η) = (4R/η)d 43
Example: Covering numbers (continued)
Note that the resolution η is related to the precision in the
sense of a distance between two hypothesis. We can apply the
same reasoning used for threshold functions
r
VH
E sup |E(fn ) − E(f )| ≤ 2Lη + N (H, η)
f ∈H n
For η → 0 the covering number N (H, η) → +∞. However, for
n → +∞ the bound tends to zero.
It is typically possible to find an η(n) for which the bound tends
to zero as n → +∞.
(Exercise. find such an η(n) for H a ball of radius R)
44
Example: Covering numbers (continued)
Same argument can be reproduced for bounds in probability,
namely for any δ ∈ [0, 1),
r
2M 2 log(2N (H, η)/δ)
sup |E(fn ) − E(f )| ≤ 2Lη +
f ∈H n
holds with probability at least 1 − δ.
45
Complexity Measures
In general, the error
sup |En (f ) − E(f )|
f ∈Hγ
Can be controlled via capacity/complexity measures:
• Covering numbers,
• combinatorial dimension, e.g. VC-dimension, fat-shattering
dimension
• Rademacher complexities
• Gaussian complexities
• ...
46
Prototypical Results
A prototypical result (under suitable assumptions, e.g.
regularity of f∗ ):
E(fγ,n ) − E(f ∗ ) ≤ E(fγ,n ) − E(fγ ) + E(fγ ) − E(f ∗ )
| {z } | {z }
≲ γ β n−α ≲ γ −τ
(Variance) (Bias)
47
Choosing γ(n) in practice
The best γ(n) depends on the unknown distribution ρ. So how
can we choose such parameter in practice?
Problem known as model selection. Possible approaches:
• Cross validation,
• complexity regularization/structural risk minimization,
• balancing principles.
• ...
48
Abstract Regularization
We got a new perspective on the concept of regularization:
controlling the expressiveness of the hypotheses space
according to the number of training examples in order to
guarantee good prediction performance and consistency.
There are many ways to implement this strategy in practice:
• Tikhonov (and Ivanov) regularization
• Spectral filtering
• Early stopping
• Random sampling
• ...
49
Wrapping Up
Building on a few (reasoble?) assumptions on the learning
problem, we:
• Have shown how the empirical risk can be a proxy for the
expected.
• Identified the main reasons behind overfitting and
discussed how to counteract it (in a more principle way!).
• Highlighted the key role played by the choice of the
hypotheses space and how their “complexity” affect
performance.
Next class we will focus on one such measure of complexity
and derive upper bounds on the generalization error.
50
Recommended Reading
Chapter 4, 5 and 6 of Shalev-Shwartz, Shai, and Shai
Ben-David. Understanding machine learning: From theory to
algorithms. Cambridge university press, 2014.
51