0% found this document useful (0 votes)

65 views36 pages

SML Lecture3

The document discusses the concept of PAC learnability and the importance of VC dimension in supervised machine learning, particularly when dealing with infinite hypothesis classes. It explains how VC dimension measures the capacity of a hypothesis class to adapt to various concepts and outlines methods for determining the VC dimension of different hypothesis classes. Additionally, it introduces Rademacher complexity as a measure of a hypothesis class's ability to fit random noise, linking it to generalization error.

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views36 pages

SML Lecture3

Uploaded by

mohamnaf.b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CS-E4715 Supervised Machine Learning

Lecture 3: Learning with infinite hypothesis classes

Recall: PAC learnability

• A class C is PAC-learnable, if there exist an algorithm A that given

a training sample S outputs a hypothesis hS that has generalization
error satisfying
Pr (R(hS ) ≤ ) ≥ 1 − δ
• for any distribution D, for arbitrary , δ > 0 and sample size m = |S|
that grows at polynomially in 1/,1/δ

1
Recall: PAC learning of a finite hypothesis class

• Sample complexity bound relying on the size of the hypothesis class

(Mohri et al, 2018): Pr (R(hs ) ≤ ) ≥ 1 − δ if

1 1
m≥ (log(|H|) + log( ))
δ
• An equivalent generalization error bound:
1 1
R(h) ≤ (log(|H|) + log( ))
m δ
• Holds for any finite hypothesis class assuming there is a consistent
hypothesis, one with zero empirical risk
• Extra term compared to the rectangle learning example is the term
1
(log(|H|))
• The more hypotheses there are in H, the more training examples are
needed

2
Learning with infinite hypothesis classes

• The size of the hypothesis class is a useful measure of complexity for

finite hypothesis classes (e.g boolean formulae)
• However, most classifers used in practise rely on infinite hypothesis
classes, e.g.
• H = axis-aligned rectangles in R2 (the example last lecture)
• H = hyperplanes in Rd (e.g. Support vector machines)
• H = neural networks with continuous input variables
• Need better tools to analyze these cases

3
Vapnik-Chervonenkis dimension
Intuition

• VC dimension can be understood as measuring the capacity of a

hypothesis class to adapt to different concepts
• It can be understood through the following thought experiment:
• Pick a fixed hypothesis class H, e.g. axis-aligned rectangles in R 2
• Let as enumerate all possible labelings of a training set of size m:
Y m = {y1 , y2 , . . . , y2m }, where yj = (yj1 , . . . , yjm ), and yij ∈ {0, 1} is
the label of i’th example in the j’th labeling
• We are allowed to freely choose a distribution D generating the
inputs and to generate the input data x1 , . . . , xm
• VCdim(H) = size of the largest training set that we can find a
consistent classifier for all labelings in Y m
• Intuitively:
• low VCdim =⇒ easy to learn, low sample complexity
• high VCdim =⇒ hard to learn, high sample complexity
• infinite VCdim =⇒ cannot learn in PAC framework

4
Shattering

• The underlying concept in VC dimension is shattering

• Given a set of points S = {x1 , . . . , xm } and a fixed class of functions
H
• H is said to shatter S if for any possible partition of S into positive
S+ and negative subset S− we can find a hypothesis for which
h(x) = 1 if and only if x ∈ S+

Figure source:

https://datascience.stackexchange.com

5
How to show that VCdim(H) = d

• How to show that VCdim(H) = d for a hypothesis class

• We need to show two facts:
1. There exists a set of inputs of size d that can be shattered by
hypothesis in H (i.e. we can pick the set of inputs any way we like):
VCdim(H) ≥ d
2. There does not exist any set of inputs of size d + 1 that can be
shattered (i.e. need to show a general property): VCdim(H) < d + 1

6
Example: intervals on a real line

• Let the hypothesis class be intervals in R

• Each hypothesis is defined by two parameters bh , eh ∈ R: the
beginning and end of the interval, h(x) = 1bh ≤x≤eh
• We can shatter any set of two points by changing the end points of
the interval:

• We cannot shatter a three point set, as the middle point cannot be

excluded while the left-hand and right-hand side points are included

We conclude that VC dimension for real intervals = 2

7
Lines in R2

• A hypothesis class of lines h(x) = ax + b shatters a set of three

points R2 .

• We conclude that VC dimension is ≥ 3

8
Lines in R2

Four points cannot be shattered by lines in R2 :

• There are only two possible configurations of four points in R2 :

1. All four points reside on the boundary of the convex hull
2. Three points form the convex hull and one is in interior
• In the first case (left), we cannot draw a line separating the top and
bottom points from the left-and and right-hand side points
• In the second case, we cannot separate the interior point from the
points on the boundary of the convex hull with a line
• The two examples are sufficient to show that VCdim = 3

9
VC-dimension of axis-aligned rectangles

• With axis aligned rectangles we can shatter a set of four points

(picture shows 4 of the 16 configurations)
• This implies VCdim(H) ≥ 4

10
VC-dimension of axis-aligned rectangles

• For five distinct points, consider the minimum bounding box of the
points
• There are two possible configurations:
1. There are one or more points in the interior of the box: then one
cannot include the points on the boundary and exclude the points in
the interior
2. At least one of the edges contains two points: in this case we can
pick either of the two points and verify that this point cannot be
excluded while all the other points are included
• Thus by the two examples we have established that VCdim(H) = 4

11
Vapnik-Chervonenkis dimension formally

• Formally VCdim(H) is defined through the growth function

ΠH (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|

{x1 ,...,xm }⊂X

• The growth function gives the maximum number of unique labelings

the hypothesis class H can provide for an arbitrary set of input points
• The maximum of the growth function is 2m for a set of m examples
• Vapnik-Chervonenkis dimension is then

VCdim(H) = max{m|ΠH (m) = 2m }

12
Visualization

• The ratio of the growth function ΠH (m) to the maximum number of

labelings of a set of size m is shown
• Hypothesis class is 20-dimensional hyperplanes (VC dimension = 21)

13
VC dimension of finite hypothesis classes

• Any finite hypothesis class has VC dimension VCdim(H) ≤ log2 |H|

• To see this:
• Consider a set of m examples S = {x1 , . . . , xm }
• This set can be labeled 2m different ways, by choosing the labels
yi ∈ {0, 1} independently
• Each hypothesis in h ∈ H fixes one labeling, a length-m binary
vector y(h, S) = (h(x1 ), . . . , h(xm ))
• All hypotheses in H together can provide at most |H| different
labelings in total (different vectors y(h, S), h ∈ H)
• If |H| < 2m we cannot shatter S =⇒ we cannot shatter a set of
size m > log2 |H|

14
VC dimension: Further examples

Examples of classes with a finite VC dimension:

• convex d-polygons in R2 : VCdim = 2d + 1 (e.g. for general, not

restricted to axis-aligned, rectangles VCdim = 5)
• hyperplanes in Rd : VCdim = d + 1 - (e.g. single neural unit, linear
SVM)
• neural networks: VCdim = |E | log |E || where E is the set of edges in
the networks (for sign activation function)
• boolean monomials of d variables: VCdim = d
• arbitrary boolean formulae of d variables: VCdim = 2d

15
Half-time poll: VC dimension of threshold functions in R

Consider a hypothesis class H = {hθ } of threshold functions

hθ : R 7→ {0, 1}, θ ∈ R :
(
1 if x > θ
hθ (x) =
0 otherwise

What is the VC dimension of this hypothesis class?

1. VCdim = 1
2. VCdim = 2
3. VCdim = ∞

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

down to ”Lecture 3 poll”:
Answers are anonymous and do not affect grading of the course.
Convex polygons have VC dimension = ∞

• Let our hypothesis class be convex polygons in R2 without

restriction of number of vertices d
• Let us draw an arbitrary circle on R2 - the distribution D will be
concentrated on the circumference of the circle
• This is a difficult distribution for learning polygons - we choose it on
purpose

16
Convex polygons have VC dimension = ∞

• Let us consider a set of m points with arbitrary binary labels

• For any m, let us position m points on the circumference of the
circle
• simulating drawing the inputs from the distribution D

17
Convex polygons have VC dimension = ∞

• Start from an arbitrary positive point (red circles)

• Traverse the circumference clockwise skipping all negative points
and stopping and positive points

18
Convex polygons have VC dimension = ∞

• Connect adjacent positive points with an edge

• This forms a p-polygon inside the circle, where p is the number of
positive data points

19
Convex polygons have VC dimension = ∞

• Define h(x) = +1 for points inside the

polygon and h(x) = 0 outside
• Each of the 2m labelings of m
examples gives us a p-polygon that
includes the p positive points in that
labeling and excludes the negative
points =⇒ we can shatter a set of
size m: VCdim(H) ≥ m
• Since m was arbitrary, we can grow it
without limit VCdim(H) = ∞

20
Generalization bound based on the VC-dimension

• (Mohri, 2018) Let H be a family of functions taking values in

{−1, +1} with VC-dimension d. Then for any δ > 0, with
probability at least 1 − δ the following holds for all h ∈ H:
s r
2 log(em/d) log(1/δ)
R(h) ≤ R̂(h) + +
m/d 2m

• e ≈ 2.71828 is the base of the natural logarithm

• The bound reveals that the critical quantity is m/d, i.e. the number
of examples divided by the VC-dimension
• Manifestation of the Occam’s razor principle: to justify an increase
in the complexity, we need reciprocally more data

21
Rademacher complexity
Experiment: how well does your hypothesis class fit noise?

• Consider a set of training examples S0 = {(xi , yi )}m

i=1

• Generate M new datasets S1 , . . . , SM from S0 by randomly drawing

a new label σ ∈ Y for each training example in S0

Sk = {(xi , σik )}m

i=1

• Train a classifier hk minimizing the empirical risk on training set Sk ,

record its empirical risk
m
1 X
R̂(hk ) = 1hk (xi )6=σik
m
i=1

• Compute the average empirical risk over all datasets:

PM
¯ = M1 k=1 R̂(hk )

22
Experiment: how well does your hypothesis class fit noise?

• Observe the quantity

1
R̂ = − ¯
2
• We have R̂ = 0 when ¯ = 0.5, that is when the predictions
correspond to random coin flips (0.5 probability to predict either
class)
• We have R̂ = 0.5 when ¯ = 0, that is when all hypotheses
hi , i = 1, . . . , M have zero empirical error (perfect fit to noise, not
good!)
• Intuitively we would like our hypothesis
• to be able to separate noise from signal - to have low R̂
• have low empirical error on real data - otherwise impossible to obtain
low generalization error

23
Rademacher complexity

• Rademacher complexity defines complexity as the capacity of

hypothesis class to fit random noise
• For binary classification with labels Y = {−1, +1} empirical
Rademacher complexity can be defined as
m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m
i=1

• σi ∈ {−1, +1} are Rademacher random variables, drawn

independently from uniform distribution (i.e. Pr {σ = 1} = 0.5)
• Expression inside the expectation takes the highest correlation over
all hypothesis in h ∈ H between the random true labels σi and
predicted label h(xi )

24
Rademacher complexity

m
1 1 X i
R̂S (H) = Eσ sup σ h(xi )
2 h∈H m i=1

• Let us rewrite R̂S (H) in terms of empirical error

• Note that with labels Y = {+1, −1},
(
1 if σi = h(xi )
σi h(xi ) =
−1 if σi =6 h(xi )

• Thus
m
1 X 1 X X
σi h(xi ) = ( 1{h(xi )=σi } − 1{h(xi )6=σi } )
m m
i=1 i i
1 X
= (m − 2 1{h(xi )6=σi } ) = 1 − 2ˆ (h)
m
i

25
Rademacher complexity

• Plug in
1
R̂S (H) = Eσ sup (1 − 2ˆ(h))
2 h∈H
1 1
= (1 − 2Eσ inf ˆ(h)) = − Eσ inf ˆ(h))
2 h∈H 2 h∈H

• Now we have expressed the empirical Rademacher complexity in

terms of expected empirical error of classifying randomly labeled data
• But how does the Rademacher complexity help in model selection?
• We need to relate it to generalization error

26
Generalization bound with Rademacher complexity

(Mohri et al. 2018): For any δ > 0, with probability at least 1 − δ over a
sample drawn from an unknown distribution D, for any h ∈ H we have:
s
log δ2
R(h) ≤ R̂S (h) + R̂S (H) + 3
2m

The bound is composed of the sum of :

• The empirical risk of h on the training data S (with the original

labels): R̂S (h)
• The empirical Rademacher complexity: R̂S (H)
• A term that tends to zero as a function of size of the training data
√
as O(1/ m) assuming constant δ.

27
Example: Rademacher and VC bounds on a real dataset

• Prediction of protein
subcellular localization
• 10-500 training examples,
172 test examples
• Comparing Rademacher and
VC bounds using δ = 0.05
• Training and test error also
shown

28
Example: Rademacher and VC bounds on a real dataset

• Rademacher bound is sharper

than the VC bound
• VC bound is not yet
informative with 500
examples (> 0.5) using
(δ = 0.05)
• The gap between the mean
of the error distribution (≈
test error) and the 0.05
probability tail (VC and
Rademacher bounds) is
evident (and expected)

29
Rademacher vs. VC

Note the differences between Rademacher complexity and VC dimension

• VC dimension is independent of any training sample or distribution

generating the data: it measures the worst-case where the data is
generated in a bad way for the learner
• Rademacher complexity depends on the training sample thus is
dependent on the data generating distribution
• VC dimension focuses the extreme case of realizing all labelings of
the data
• Rademacher complexity measures smoothly the ability to realize
random labelings

30
Rademacher vs. VC

• Generalization bounds based on Rademacher Complexity are

applicable to any binary classifiers (SVM, neural network, decision
tree)
• It motivates state of the art learning algoritms such as support
vector machines
• But computing it might be hard, if we need to train a large number
of classifiers
• Vapnik-Chervonenkis dimension (VCdim) is an alternative that is
usually easier to derive analytically

31
Summary: Statistical learning theory

• Statistical learning theory focuses in analyzing the generalization

ability of learning algorithms
• Probably Approximately Correct framework is the most studied
theoretical framework, asking for bounding the generaliation error
() with high probability (1 − δ), with arbitrary level of error
> 0and confidence δ > 0
• Vapnik-Chervonenkis dimension lets us study learnability infinite
hypothesis classes through the concept of shattering
• Rademacher complexity is a practical alternative to VC dimension,
giving typically sharper bounds (but requires a lot of simulations to
be run)

MLSM Lecture3 190923
No ratings yet
MLSM Lecture3 190923
36 pages
VC Dimension & Model Complexity
No ratings yet
VC Dimension & Model Complexity
42 pages
Machine Learning Theory for Students
No ratings yet
Machine Learning Theory for Students
45 pages
Understanding PAC Learning Framework
No ratings yet
Understanding PAC Learning Framework
30 pages
05 VC Theory
No ratings yet
05 VC Theory
11 pages
The VC Dimension (Final)
No ratings yet
The VC Dimension (Final)
18 pages
PAC Learning and Complexity
No ratings yet
PAC Learning and Complexity
14 pages
Vapnik-Chervonenkis Dimension
No ratings yet
Vapnik-Chervonenkis Dimension
6 pages
How Many Samples To Learn A Finite Class?
No ratings yet
How Many Samples To Learn A Finite Class?
4 pages
Lect 3
No ratings yet
Lect 3
4 pages
VC Dimensions NG
No ratings yet
VC Dimensions NG
17 pages
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
No ratings yet
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
43 pages
Understanding VC Dimension in Machine Learning
No ratings yet
Understanding VC Dimension in Machine Learning
23 pages
VC Dim
No ratings yet
VC Dim
22 pages
hw2 5
No ratings yet
hw2 5
4 pages
04 Handout
No ratings yet
04 Handout
53 pages
PAC Learning and VC Dimension Explained
No ratings yet
PAC Learning and VC Dimension Explained
31 pages
Understanding Bias-Variance Trade-off
No ratings yet
Understanding Bias-Variance Trade-off
38 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
Understanding VC Dimension in Hypothesis Classes
No ratings yet
Understanding VC Dimension in Hypothesis Classes
2 pages
SupervisedLearning 2 33
No ratings yet
SupervisedLearning 2 33
32 pages
05 VC Bound
No ratings yet
05 VC Bound
27 pages
SVM Insights for Data Scientists
No ratings yet
SVM Insights for Data Scientists
35 pages
Machine Learning: VC-Dimension & Ensembles
No ratings yet
Machine Learning: VC-Dimension & Ensembles
7 pages
Lecture 5
No ratings yet
Lecture 5
12 pages
Aml CH.1
No ratings yet
Aml CH.1
11 pages
PAC Learning and Sample Complexity in ML
No ratings yet
PAC Learning and Sample Complexity in ML
64 pages
Advanced Machine Learning Bounds
No ratings yet
Advanced Machine Learning Bounds
32 pages
Lec 10 SVM
No ratings yet
Lec 10 SVM
35 pages
Week 3
No ratings yet
Week 3
56 pages
VC-dimension For Characterizing Classifiers
No ratings yet
VC-dimension For Characterizing Classifiers
40 pages
AA1 Tema4
No ratings yet
AA1 Tema4
37 pages
Support Vector Machines
No ratings yet
Support Vector Machines
35 pages
VC Dimension
No ratings yet
VC Dimension
6 pages
ML.1.Lecture.9 (Where It Actually Comes From)
No ratings yet
ML.1.Lecture.9 (Where It Actually Comes From)
31 pages
VC Dim
No ratings yet
VC Dim
16 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
C1 - Introduction To ML
No ratings yet
C1 - Introduction To ML
45 pages
Lecture 09 Bounds For Bad Hypotheses
No ratings yet
Lecture 09 Bounds For Bad Hypotheses
26 pages
03 Hypothesis Spaces Commented4
No ratings yet
03 Hypothesis Spaces Commented4
45 pages
Week 7 Notes
No ratings yet
Week 7 Notes
11 pages
VC Dimensions in Multilayer Perceptrons
No ratings yet
VC Dimensions in Multilayer Perceptrons
17 pages
hw2 Sol
No ratings yet
hw2 Sol
3 pages
Unit6 - Lecture 26 1
No ratings yet
Unit6 - Lecture 26 1
23 pages
Lec 6
No ratings yet
Lec 6
29 pages
Lecture26 Growth
No ratings yet
Lecture26 Growth
25 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
ML Unit-1
No ratings yet
ML Unit-1
42 pages
VC-Dimension Bounds for Neural Nets
No ratings yet
VC-Dimension Bounds for Neural Nets
17 pages
PAC Learning & Machine Learning Course
No ratings yet
PAC Learning & Machine Learning Course
36 pages
Slides Lect 07
No ratings yet
Slides Lect 07
22 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Computational Learning Theory Guide
No ratings yet
Computational Learning Theory Guide
24 pages
Lecture5 Learning Theory v1.1
No ratings yet
Lecture5 Learning Theory v1.1
59 pages
Matrix Properties
No ratings yet
Matrix Properties
53 pages
Finite vs Infinite Hypothesis Spaces
No ratings yet
Finite vs Infinite Hypothesis Spaces
7 pages
Mary Lenore's Student Resumé
No ratings yet
Mary Lenore's Student Resumé
1 page
Internet and World Wide Web
No ratings yet
Internet and World Wide Web
22 pages
UKMT JMC 2014 Solutions PDF Download
No ratings yet
UKMT JMC 2014 Solutions PDF Download
11 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Technical Bulletin - Replay A S7K File in The SeaBat User Interface
No ratings yet
Technical Bulletin - Replay A S7K File in The SeaBat User Interface
10 pages
Skill Lab Manual-All Branches
No ratings yet
Skill Lab Manual-All Branches
192 pages
QRG - Bluebeam Revu - Scale and Measurement Tools
No ratings yet
QRG - Bluebeam Revu - Scale and Measurement Tools
7 pages
Hot Candidate List: January. 2023
No ratings yet
Hot Candidate List: January. 2023
22 pages
2.3 2.5 2.6 WS With Ans
No ratings yet
2.3 2.5 2.6 WS With Ans
11 pages
User Guide - Outlook Settings For Your Mobile Phone
No ratings yet
User Guide - Outlook Settings For Your Mobile Phone
2 pages
M3 Final Documentation
No ratings yet
M3 Final Documentation
86 pages
Impact of Social Media On Academic Performance of Students in Nigeria Airforce College of Nursing Sciences Kaduna
100% (3)
Impact of Social Media On Academic Performance of Students in Nigeria Airforce College of Nursing Sciences Kaduna
43 pages
MCA Problem Solving & Programming Exam 2018
No ratings yet
MCA Problem Solving & Programming Exam 2018
25 pages
Shashank Gupta Updated Resume Analytics
No ratings yet
Shashank Gupta Updated Resume Analytics
1 page
Zeek
No ratings yet
Zeek
7 pages
Screenshot 2025-05-29 at 11.15.37 PM
No ratings yet
Screenshot 2025-05-29 at 11.15.37 PM
1 page
Allen Bradley PLC Programming Guide
100% (5)
Allen Bradley PLC Programming Guide
50 pages
ABS Class II Operator Manual
No ratings yet
ABS Class II Operator Manual
64 pages
ALC Unit-4
No ratings yet
ALC Unit-4
15 pages
Lab 1 Operating System
No ratings yet
Lab 1 Operating System
9 pages
1 - Business Requirements Document (Master)
No ratings yet
1 - Business Requirements Document (Master)
4 pages
Cellular Wide-Area and Non-Terrestrial IoT A Survey On 5G Advances and The Road Toward 6G
No ratings yet
Cellular Wide-Area and Non-Terrestrial IoT A Survey On 5G Advances and The Road Toward 6G
58 pages
Jumper Settings 2003 Mode
No ratings yet
Jumper Settings 2003 Mode
9 pages
Cloud Security: Best Practices & Insights
No ratings yet
Cloud Security: Best Practices & Insights
22 pages
EN Module (55) Mercedes Bosch EDC17 - OBD2
No ratings yet
EN Module (55) Mercedes Bosch EDC17 - OBD2
4 pages
Accounting Information Systems 13th Edition Romney Solutions Manual Full Chapter PDF
100% (32)
Accounting Information Systems 13th Edition Romney Solutions Manual Full Chapter PDF
68 pages
Summary Chapter 3 "Managing Digital Business Infrastructure"
No ratings yet
Summary Chapter 3 "Managing Digital Business Infrastructure"
2 pages
Sony Walkman: Rebranding Strategy
No ratings yet
Sony Walkman: Rebranding Strategy
1 page
B.E Cse Batchno 168
No ratings yet
B.E Cse Batchno 168
42 pages
Mold Maintenance Best Practices
No ratings yet
Mold Maintenance Best Practices
1 page

SML Lecture3

Uploaded by

SML Lecture3

Uploaded by

CS-E4715 Supervised Machine Learning

Lecture 3: Learning with infinite hypothesis classes

• A class C is PAC-learnable, if there exist an algorithm A that given

• Sample complexity bound relying on the size of the hypothesis class

• The size of the hypothesis class is a useful measure of complexity for

• VC dimension can be understood as measuring the capacity of a

• The underlying concept in VC dimension is shattering

• How to show that VCdim(H) = d for a hypothesis class

• Let the hypothesis class be intervals in R

• We cannot shatter a three point set, as the middle point cannot be

We conclude that VC dimension for real intervals = 2

• A hypothesis class of lines h(x) = ax + b shatters a set of three

• We conclude that VC dimension is ≥ 3

Four points cannot be shattered by lines in R2 :

• There are only two possible configurations of four points in R2 :

• With axis aligned rectangles we can shatter a set of four points

• Formally VCdim(H) is defined through the growth function

ΠH (m) = max |{(h(x1 ), . . . , h(xm )) : h ∈ H}|

• The growth function gives the maximum number of unique labelings

VCdim(H) = max{m|ΠH (m) = 2m }

• The ratio of the growth function ΠH (m) to the maximum number of

• Any finite hypothesis class has VC dimension VCdim(H) ≤ log2 |H|

Examples of classes with a finite VC dimension:

• convex d-polygons in R2 : VCdim = 2d + 1 (e.g. for general, not

Consider a hypothesis class H = {hθ } of threshold functions

What is the VC dimension of this hypothesis class?

Answer to the poll in Mycourses by 11:15: Go to Lectures page and scroll

• Let our hypothesis class be convex polygons in R2 without

• Let us consider a set of m points with arbitrary binary labels

• Start from an arbitrary positive point (red circles)

• Connect adjacent positive points with an edge

• Define h(x) = +1 for points inside the

• (Mohri, 2018) Let H be a family of functions taking values in

• e ≈ 2.71828 is the base of the natural logarithm

• Consider a set of training examples S0 = {(xi , yi )}m

• Generate M new datasets S1 , . . . , SM from S0 by randomly drawing

Sk = {(xi , σik )}m

• Train a classifier hk minimizing the empirical risk on training set Sk ,

• Compute the average empirical risk over all datasets:

• Observe the quantity

• Rademacher complexity defines complexity as the capacity of

• σi ∈ {−1, +1} are Rademacher random variables, drawn

• Let us rewrite R̂S (H) in terms of empirical error

• Now we have expressed the empirical Rademacher complexity in

The bound is composed of the sum of :

• The empirical risk of h on the training data S (with the original

• Rademacher bound is sharper

Note the differences between Rademacher complexity and VC dimension

• VC dimension is independent of any training sample or distribution

• Generalization bounds based on Rademacher Complexity are

• Statistical learning theory focuses in analyzing the generalization

You might also like