CS-E4715 Supervised Machine Learning
Lecture 3: Learning with infinite hypothesis classes
Recall: PAC learnability
• A class C is PAC-learnable, if there exist an algorithm A that given
a training sample S outputs a hypothesis hS that has generalization
error satisfying
Pr (R(hS ) ≤ ) ≥ 1 − δ
• for any distribution D, for arbitrary , δ > 0 and sample size m = |S|
that grows at polynomially in 1/,1/δ
1
Recall: PAC learning of a finite hypothesis class
• Sample complexity bound relying on the size of the hypothesis class
(Mohri et al, 2018): Pr (R(hs ) ≤ ) ≥ 1 − δ if
1 1
m≥ (log(|H|) + log( ))
δ
• An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
• Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
• Extra term compared to the rectangle learning example is the term
1
(log(|H|))
• The more hypotheses there are in H, the more training examples are
needed
2
Learning with infinite hypothesis classes
• The size of the hypothesis class is a useful measure of complexity for
finite hypothesis classes (e.g boolean formulae)
• However, most classifers used in practise rely on infinite hypothesis
classes, e.g.
• H = axis-aligned rectangles in R2 (the example last lecture)
• H = hyperplanes in Rd (e.g. Support vector machines)
• H = neural networks with continuous input variables
• Need better tools to analyze these cases
3
Vapnik-Chervonenkis dimension
Intuition
• VC dimension can be understood as measuring the capacity of a
hypothesis class to adapt to different concepts
• It can be understood through the following thought experiment:
• Pick a fixed hypothesis class H, e.g. axis-aligned rectangles in R 2
• Let as enumerate all possible labelings of a training set of size m:
Y m = {y1 , y2 , . . . , y2m }, where yj = (yj1 , . . . , yjm ), and yij ∈ {0, 1} is
the label of i’th example in the j’th labeling
• We are allowed to freely choose a distribution D generating the
inputs and to generate the input data x1 , . . . , xm
• VCdim(H) = size of the largest training set that we can find a
consistent classifier for all labelings in Y m
• Intuitively:
• low VCdim =⇒ easy to learn, low sample complexity
• high VCdim =⇒ hard to learn, high sample complexity
• infinite VCdim =⇒ cannot learn in PAC framework
4
Shattering
• The underlying concept in VC dimension is shattering
• Given a set of points S = {x1 , . . . , xm } and a fixed class of functions
H
• H is said to shatter S if for any possible partition of S into positive
S+ and negative subset S− we can find a hypothesis for which
h(x) = 1 if and only if x ∈ S+
Figure source:
https://datascience.stackexchange.com
5
How to show that VCdim(H) = d
• How to show that VCdim(H) = d for a hypothesis class
• We need to show two facts:
1. There exists a set of inputs of size d that can be shattered by
hypothesis in H (i.e. we can pick the set of inputs any way we like):
VCdim(H) ≥ d
2. There does not exist any set of inputs of size d + 1 that can be
shattered (i.e. need to show a general property): VCdim(H) < d + 1
6
Example: intervals on a real line
• Let the hypothesis class be intervals in R
• Each hypothesis is defined by two parameters bh , eh ∈ R: the
beginning and end of the interval, h(x) = 1bh ≤x≤eh
• We can shatter any set of two points by changing the end points of
the interval:
• We cannot shatter a three point set, as the middle point cannot be
excluded while the left-hand and right-hand side points are included
We conclude that VC dimension for real intervals = 2
7
Lines in R2
• A hypothesis class of lines h(x) = ax + b shatters a set of three
points R2 .
• We conclude that VC dimension is ≥ 3
8
Lines in R2
Four points cannot be shattered by lines in R2 :
• There are only two possible configurations of four points in R2 :
1. All four points reside on the boundary of the convex hull
2. Three points form the convex hull and one is in interior
• In the first case (left), we cannot draw a line separating the top and
bottom points from the left-and and right-hand side points
• In the second case, we cannot separate the interior point from the
points on the boundary of the convex hull with a line
• The two examples are sufficient to show that VCdim = 3
9
VC-dimension of axis-aligned rectangles
• With axis aligned rectangles we can shatter a set of four points
(picture shows 4 of the 16 configurations)
• This implies VCdim(H) ≥ 4
10
VC-dimension of axis-aligned rectangles
• For five distinct points, consider the minimum bounding box of the
points
• There are two possible configurations:
1. There are one or more points in the interior of the box: then one
cannot include the points on the boundary and exclude the points in
the interior
2. At least one of the edges contains two points: in this case we can
pick either of the two points and verify that this point cannot be
excluded while all the other points are included
• Thus by the two examples we have established that VCdim(H) = 4
11
Vapnik-Chervonenkis dimension formally
• Formally VCdim(H) is defined through the growth function
ΠH (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|
{x1 ,...,xm }⊂X
• The growth function gives the maximum number of unique labelings
the hypothesis class H can provide for an arbitrary set of input points
• The maximum of the growth function is 2m for a set of m examples
• Vapnik-Chervonenkis dimension is then
VCdim(H) = max{m|ΠH (m) = 2m }
m
12
Visualization
• The ratio of the growth function ΠH (m) to the maximum number of
labelings of a set of size m is shown
• Hypothesis class is 20-dimensional hyperplanes (VC dimension = 21)
13
VC dimension of finite hypothesis classes
• Any finite hypothesis class has VC dimension VCdim(H) ≤ log2 |H|
• To see this:
• Consider a set of m examples S = {x1 , . . . , xm }
• This set can be labeled 2m different ways, by choosing the labels
yi ∈ {0, 1} independently
• Each hypothesis in h ∈ H fixes one labeling, a length-m binary
vector y(h, S) = (h(x1 ), . . . , h(xm ))
• All hypotheses in H together can provide at most |H| different
labelings in total (different vectors y(h, S), h ∈ H)
• If |H| < 2m we cannot shatter S =⇒ we cannot shatter a set of
size m > log2 |H|
14
VC dimension: Further examples
Examples of classes with a finite VC dimension:
• convex d-polygons in R2 : VCdim = 2d + 1 (e.g. for general, not
restricted to axis-aligned, rectangles VCdim = 5)
• hyperplanes in Rd : VCdim = d + 1 - (e.g. single neural unit, linear
SVM)
• neural networks: VCdim = |E | log |E || where E is the set of edges in
the networks (for sign activation function)
• boolean monomials of d variables: VCdim = d
• arbitrary boolean formulae of d variables: VCdim = 2d
15
Half-time poll: VC dimension of threshold functions in R
Consider a hypothesis class H = {hθ } of threshold functions
hθ : R 7→ {0, 1}, θ ∈ R :
(
1 if x > θ
hθ (x) =
0 otherwise
What is the VC dimension of this hypothesis class?
1. VCdim = 1
2. VCdim = 2
3. VCdim = ∞
Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll
down to ”Lecture 3 poll”:
Answers are anonymous and do not affect grading of the course.
Convex polygons have VC dimension = ∞
• Let our hypothesis class be convex polygons in R2 without
restriction of number of vertices d
• Let us draw an arbitrary circle on R2 - the distribution D will be
concentrated on the circumference of the circle
• This is a difficult distribution for learning polygons - we choose it on
purpose
16
Convex polygons have VC dimension = ∞
• Let us consider a set of m points with arbitrary binary labels
• For any m, let us position m points on the circumference of the
circle
• simulating drawing the inputs from the distribution D
17
Convex polygons have VC dimension = ∞
• Start from an arbitrary positive point (red circles)
• Traverse the circumference clockwise skipping all negative points
and stopping and positive points
18
Convex polygons have VC dimension = ∞
• Connect adjacent positive points with an edge
• This forms a p-polygon inside the circle, where p is the number of
positive data points
19
Convex polygons have VC dimension = ∞
• Define h(x) = +1 for points inside the
polygon and h(x) = 0 outside
• Each of the 2m labelings of m
examples gives us a p-polygon that
includes the p positive points in that
labeling and excludes the negative
points =⇒ we can shatter a set of
size m: VCdim(H) ≥ m
• Since m was arbitrary, we can grow it
without limit VCdim(H) = ∞
20
Generalization bound based on the VC-dimension
• (Mohri, 2018) Let H be a family of functions taking values in
{−1, +1} with VC-dimension d. Then for any δ > 0, with
probability at least 1 − δ the following holds for all h ∈ H:
s r
2 log(em/d) log(1/δ)
R(h) ≤ R̂(h) + +
m/d 2m
• e ≈ 2.71828 is the base of the natural logarithm
• The bound reveals that the critical quantity is m/d, i.e. the number
of examples divided by the VC-dimension
• Manifestation of the Occam’s razor principle: to justify an increase
in the complexity, we need reciprocally more data
21
Rademacher complexity
Experiment: how well does your hypothesis class fit noise?
• Consider a set of training examples S0 = {(xi , yi )}m
i=1
• Generate M new datasets S1 , . . . , SM from S0 by randomly drawing
a new label σ ∈ Y for each training example in S0
Sk = {(xi , σik )}m
i=1
• Train a classifier hk minimizing the empirical risk on training set Sk ,
record its empirical risk
m
1 X
R̂(hk ) = 1hk (xi )6=σik
m
i=1
• Compute the average empirical risk over all datasets:
PM
¯ = M1 k=1 R̂(hk )
22
Experiment: how well does your hypothesis class fit noise?
• Observe the quantity
1
R̂ = − ¯
2
• We have R̂ = 0 when ¯ = 0.5, that is when the predictions
correspond to random coin flips (0.5 probability to predict either
class)
• We have R̂ = 0.5 when ¯ = 0, that is when all hypotheses
hi , i = 1, . . . , M have zero empirical error (perfect fit to noise, not
good!)
• Intuitively we would like our hypothesis
• to be able to separate noise from signal - to have low R̂
• have low empirical error on real data - otherwise impossible to obtain
low generalization error
23
Rademacher complexity
• Rademacher complexity defines complexity as the capacity of
hypothesis class to fit random noise
• For binary classification with labels Y = {−1, +1} empirical
Rademacher complexity can be defined as
m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m
i=1
• σi ∈ {−1, +1} are Rademacher random variables, drawn
independently from uniform distribution (i.e. Pr {σ = 1} = 0.5)
• Expression inside the expectation takes the highest correlation over
all hypothesis in h ∈ H between the random true labels σi and
predicted label h(xi )
24
Rademacher complexity
m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m i=1
• Let us rewrite R̂S (H) in terms of empirical error
• Note that with labels Y = {+1, −1},
(
1 if σi = h(xi )
σi h(xi ) =
−1 if σi =6 h(xi )
• Thus
m
1 X 1 X X
σi h(xi ) = ( 1{h(xi )=σi } − 1{h(xi )6=σi } )
m m
i=1 i i
1 X
= (m − 2 1{h(xi )6=σi } ) = 1 − 2ˆ (h)
m
i
25
Rademacher complexity
• Plug in
1
R̂S (H) = Eσ sup (1 − 2ˆ(h))
2 h∈H
1 1
= (1 − 2Eσ inf ˆ(h)) = − Eσ inf ˆ(h))
2 h∈H 2 h∈H
• Now we have expressed the empirical Rademacher complexity in
terms of expected empirical error of classifying randomly labeled data
• But how does the Rademacher complexity help in model selection?
• We need to relate it to generalization error
26
Generalization bound with Rademacher complexity
(Mohri et al. 2018): For any δ > 0, with probability at least 1 − δ over a
sample drawn from an unknown distribution D, for any h ∈ H we have:
s
log δ2
R(h) ≤ R̂S (h) + R̂S (H) + 3
2m
The bound is composed of the sum of :
• The empirical risk of h on the training data S (with the original
labels): R̂S (h)
• The empirical Rademacher complexity: R̂S (H)
• A term that tends to zero as a function of size of the training data
√
as O(1/ m) assuming constant δ.
27
Example: Rademacher and VC bounds on a real dataset
• Prediction of protein
subcellular localization
• 10-500 training examples,
172 test examples
• Comparing Rademacher and
VC bounds using δ = 0.05
• Training and test error also
shown
28
Example: Rademacher and VC bounds on a real dataset
• Rademacher bound is sharper
than the VC bound
• VC bound is not yet
informative with 500
examples (> 0.5) using
(δ = 0.05)
• The gap between the mean
of the error distribution (≈
test error) and the 0.05
probability tail (VC and
Rademacher bounds) is
evident (and expected)
29
Rademacher vs. VC
Note the differences between Rademacher complexity and VC dimension
• VC dimension is independent of any training sample or distribution
generating the data: it measures the worst-case where the data is
generated in a bad way for the learner
• Rademacher complexity depends on the training sample thus is
dependent on the data generating distribution
• VC dimension focuses the extreme case of realizing all labelings of
the data
• Rademacher complexity measures smoothly the ability to realize
random labelings
30
Rademacher vs. VC
• Generalization bounds based on Rademacher Complexity are
applicable to any binary classifiers (SVM, neural network, decision
tree)
• It motivates state of the art learning algoritms such as support
vector machines
• But computing it might be hard, if we need to train a large number
of classifiers
• Vapnik-Chervonenkis dimension (VCdim) is an alternative that is
usually easier to derive analytically
31
Summary: Statistical learning theory
• Statistical learning theory focuses in analyzing the generalization
ability of learning algorithms
• Probably Approximately Correct framework is the most studied
theoretical framework, asking for bounding the generaliation error
() with high probability (1 − δ), with arbitrary level of error
> 0and confidence δ > 0
• Vapnik-Chervonenkis dimension lets us study learnability infinite
hypothesis classes through the concept of shattering
• Rademacher complexity is a practical alternative to VC dimension,
giving typically sharper bounds (but requires a lot of simulations to
be run)
32