0% found this document useful (0 votes)

13 views26 pages

Lecture 09 Bounds For Bad Hypotheses

The document discusses various inequalities related to random variables, including Chebyschev's bound, Hoeffding's inequality, and Markov's inequality, which help in estimating the probabilities of deviations from the mean. It emphasizes the importance of in-sample and out-of-sample errors in machine learning, and how to choose hypotheses to minimize these errors. Additionally, it addresses the challenges of multiple hypotheses and the implications of union bounds on error estimation in learning models.

Uploaded by

param01alt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views26 pages

Lecture 09 Bounds For Bad Hypotheses

Uploaded by

param01alt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Bounds

• If X is a discrete random variable with E(X) = µ, then, for any ϵ > 0, and
• Chebyschev’s bound:

• Hoeffding’s inequality

• Markov Inequality:
•• Chernoff’s
Chebyschev:bound:
Assumes finite variance and checks how far a R.V
deviates
• is from mean.
an exponentially decreasing upper bound on the tail of a random variable. The minimum of all such
• Hoeffding:
exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than
• exponential
Hoeffding: (e.g. sub-Gaussian / Hoeffding). It is especially useful for sums of independent random
Applicable to sample and bin means (how far they
variables, such as sums of Bernoulli random variables.
deviate from each other)
• If X is a random variable, then for any a R, we can write
So for
• The s>0, we can for
requirement write
Hoeffding RV themselves are bounded — Similarly, for s<0, we can write
no more tails of distributions stretched off toward infinity.
S Karagadde, ME 228, IIT Bombay 2
Example
• Let µ = 0.4. Use Hoeffding’s Inequality

to bound the probability that a sample of 10 fruits will have some oranges with ν
≤ 0.1. What bound do you get?

1. 0.67
Set N = 10 and = 0.3 (0.4-0.1) and you get the
2. 0.4 answer.
3. 0.33
Also note that option # 4 is the actual probability and
4. 0.05 Hoeffding gives only an upper bound to that probability.

S Karagadde, ME 228, IIT Bombay 3

Formal guarantee
• For any fixed ‘h’ in big data (large N)
• in-sample error Ein is probably close to
• the out-of-sample error Eout, within an acceptable

• Also, like before:

• Valid for all N and

• does not depend on 𝒐𝒖𝒕 , no need to ‘know’ 𝒐𝒖𝒕

•  f and P can remain underground

• larger sample size N or looser gap  higher probability for ‘ν ≈ µ’

• 𝒊𝒏 is probably approximately correct (PAC)

S Karagadde, ME 228, IIT Bombay 4
Verification of one h
• For any h, when N is large, and Ein ~ Eout,
• Can we claim good learning (g ~ f )?

• Yes! If Ein is small

• And Algorithm A has picked h as g.
• Therefore, g=f, is PAC.

• No! if Ein was not small, A was forced to pick h as g.

• Ein could be large
• g f is PAC

• Need for choosing the ‘best’ h from multiple h’s. One h may or may not give smallest error
• How do we now define Probability bounds when multiple h’s exist?

S Karagadde, ME 228, IIT Bombay 5

Bad events – example 1
• If you toss a fair coin 10 times, what is the probability that you
will get 10 heads?

• ~ 0.1%

• If you toss 1000 fair coins 10 times each, what is the probability
that some trial will get 10 heads? (binomial trials)

• ~ 63%

What is happening? The if you try hard enough, several ‘disjoint’ probabilities of
a bad event can add up to give a higher chance of that happeneing.
S Karagadde, ME 228, IIT Bombay 6
Question
• If all of you flip a coin 4 times, what is the probability that at least one of the
students get all 4 heads, for their coin ‘g’? (No coin? go digital -
https://flipsimu.com/ )
• Is that coin very special?

• 0.999567

• BAD sample: Ein (that coin) and Eout (a general measure) are far away.
• that coin is an example for a bad h

• Objective of this exercise: individual hypothesis may not always

represent the ‘expected’
S Karagadde, ME 228, IIT Bombay 7
Multiple bins – what is the error bound when multiple hypotheses exist?

• Generalizing the bin model to more than one hypothesis

Both µ and ν depend on which hypothesis h

Hypothesis h1 h2 hM
μ1 μ2 μM
µ is `out of sample'
denoted by Eout(h)

……

ν is `in sample'
denoted by Ein(h)
ν1 ν2 νM

Hoeffding inequality forS Karagadde,

multipleMEbins ? Bombay
228, IIT 8
Union bound for bad data – M different hypotheses

• finite-bin version of Hoeffding, valid for all M, N and

Also, the implication is that more
• does not depend on any Eout(hm), no need to ‘know’ Eout(hm)
sophisticated the model (larger
• —f and P can stay unknown M), more looser the in-sample
• ‘Ein(g) = Eout(g)’ is PAC, regardless of A will track out-of-sample.

most reasonable’ A (like PLA/pocket):

pick the hm with lowest
S Karagadde, ME 228, IITE in(hm) as g
Bombay 9
Summary - Generalization
• Can in-sample error closely represent out-of-sample error?
• Answer: Yes, because of Hoeffding’s inequality

• Can we have low enough Ein?

• Yes, we have many hypotheses, and we choose the one giving lowest error
• We reject the hypothesis if in-sample error (Ein) is large – then change the
model/hypothesis until we get this.

• Above two shall ensure selection of a good enough ‘g’, and hence lead to
‘generalization’

S Karagadde, ME 228, IIT Bombay 10

But,

• However, if M is very large, bound cannot be finite. How do we address this?

• In other words: does M have to be really large?

S Karagadde, ME 228, IIT Bombay 11

M changes with model complexity Linear separator (1D)
This could ALSO very well be the ideal ‘f’!
Size of [w] = 1 or 2
𝟎 𝟐 𝟐
OR
𝟎
𝟎 𝟐 𝟎
𝟏
Simpler model  lower complexity

Linear separator
Size of [w] = 2+1 (for 2D)
Size of [w] = d+1 (for d-D)

Quadratic separator
This could very well be the ideal ‘f’!
Size of [w] = 4 (for 2D)
(More parameters  complexity
increased)
Even larger number of hypotheses
S Karagadde, ME 228, IIT Bombay 12
compared to linear-2D.
Trade-off on M

Small M Large M

Yes! No!

Yes!
No!
More choices, hence better
Too few choices
possibility

Using right MS is important

Karagadde, –Bombay
ME 228, IIT how do we ensure? 13
Background
• Hoeffding’s Inequality for multiple bins, which blows up for when no. of h is large
𝟐
• 𝒊𝒏

• Now, establish a finite quantity that replaces M

𝟐
• 𝒊𝒏

• Justify the feasibility of learning for infinite M

• Study to understand its trade-off for right ‘H’

• Why PLA worked so nicely without all of this?

S Karagadde, ME 228, IIT Bombay 14

Now let us address if M should really be large?
𝟐
• 𝒊𝒏

• M was obtained by adding all the ‘bad’ events  Union bound

• This assumed the worst case  all Bad events were non-overlapping

• Is this really true?

• A ‘bad’ event B :
• Choice of right hypothesis  bound for
• Non-overlapping assumption:
•

S Karagadde, ME 228, IIT Bombay 15

A problem with union bound
• Union bound 

• Bad events: 𝒊𝒏

• Overlapping for similar hypotheses

• 𝒐𝒖𝒕 𝟏 𝒐𝒖𝒕 𝟐
• For most D, 𝒊𝒏 𝟏 𝒊𝒏 𝟐

• Therefore, considerable over-estimation!

• Way out: Group similar hypothesis by type/kind?

S Karagadde, ME 228, IIT Bombay 16

Examples: What is ‘M’ in a PLA
• The hypothesis set
• No. of lines 
• How many “kinds” of lines if viewed w.r.t. one input vector 1?

• 2 kinds (max. number of ‘dichotomies’ with this hypothesis set)

• 1 1 or

S Karagadde, ME 228, IIT Bombay 17

Examples: What is ‘M’ in a PLA
• The hypothesis set
• No. of lines 
• How many “kinds” of lines if viewed w.r.t. one input vector 1?

2
1

• 2 kinds  M =2
• 1 1 or

S Karagadde, ME 228, IIT Bombay 18

Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t. two input vectors 1 and ?

1
• 4 kinds
2

• 1 input  2 kinds
• 2 ...  4 kinds
• 3…?

S Karagadde, ME 228, IIT Bombay 19

Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
three input vectors 1 ?

• 8 kinds
3
2

• 1 input  2 kinds
• 2 ...  4 kinds
• 3…8
• ALWAYS?
S Karagadde, ME 228, IIT Bombay 20
Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
three input vectors 1 ?

1 • 8 kinds

• 6, when points
2 are
degenerate
3 • A special case
• With any 3
points, it’s
• 1 input  2 kinds always a
• 2 ...  4 kinds convex set.
• 3 …  8 kinds
• NOT ALWAYS 2N!
S Karagadde, ME 228, IIT Bombay 21
Question: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
FOUR input vectors 1 2 3, 4?

2
4 • 14 kinds
(check)

• 1 input  2 kinds
• 2 ...  4 kinds
• 3 …  8 (6 only for points on a line, which is nothing but positive rays)
• 4 …  14
S Karagadde, ME 228, IIT Bombay 22
Effective Number of Lines
• Max. kinds of lines w.r.t N inputs x1 … xN

• Effective no. of lines

N effective(N)
• Must be less than <= 2N (why?) 1 2
2 4
3 8
• Finite grouping of infinitely many lines of H
4 14 < 2N
𝟐
• Therefore, 𝒊𝒏

If, N is large enough

(a) effective(N) can replace M and
Good news!
(b) effective(N) << 2N,
learning is possible with infinite lines!
S Karagadde, ME 228, IIT Bombay 23
Define Growth function
• depends on the inputs x’s

• Growth function: remove the dependence by taking max. of all possible x’s

N
•
1 2
2 4
• This is finite, upper bounded with 2N 3 max(…6,8)
4 14 < 2N

• How to calculate this function? Refer to Learning from Data, Chapter 2.

• (exact derivation beyond the scope of this course)

S Karagadde, ME 228, IIT Bombay 24

Summary of growth functions
k : No. of points from which 𝓗
𝒌

 break point is 4 for 2D

Polynomial
• Positive rays: ℋ Break point at 2 instead of
exponential
• Positive intervals: ℋ
Break point at 3
Data points
• Convex sets: ℋ
𝑁 No break point are seldom
convex in
• 2D perceptrons: ℋ
𝑁 in some cases Break point at 4 practice.

• Conjecture:
• No break point: 𝑁 (Guaranteed!)
• Break point k: (Polynomial in N)

Take away: The number of points for data-driven analysis

Good news!
is usually a ‘lot’. Break point can then guarantee that we
are safe with m
S Karagadde, ME 228, IIT Bombay 25
Examples:
ℋ
• H: Positive rays: (break point k = 2)
•
i.e.

• H: Positive intervals: break point k = 3

• H: 2D Perceptrons: break point k = 4

• Note that power of N seems to have a correlation with break point

S Karagadde, ME 228, IIT Bombay 26

Therefore
𝟐 https://www.csie.ntu.edu.tw/~htlin/mooc/
• Instead of ,

𝟐
• We need:

Summary: With different data sets,

the hypothesis set need not increase
as per Union bound. It has to have a
several common hypothesis, and
those duplicates can be removed

S Karagadde, ME 228, IIT Bombay 27

Lecture Summary
No ratings yet
Lecture Summary
2 pages
04 Handout
No ratings yet
04 Handout
53 pages
Lecture26 Growth
No ratings yet
Lecture26 Growth
25 pages
PAC Learning and Sample Complexity in ML
No ratings yet
PAC Learning and Sample Complexity in ML
64 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
AI Assignment: Probability & VC Dimension
No ratings yet
AI Assignment: Probability & VC Dimension
2 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
2021-02-19 Fri 20CS92R02
No ratings yet
2021-02-19 Fri 20CS92R02
6 pages
Math Essentials for Machine Learning
No ratings yet
Math Essentials for Machine Learning
55 pages
Math Essentials1234adadvklop32165adada PDF
No ratings yet
Math Essentials1234adadvklop32165adada PDF
55 pages
1234adadvklop32165adada PDF
No ratings yet
1234adadvklop32165adada PDF
55 pages
Math Essentials1234adadada PDF
No ratings yet
Math Essentials1234adadada PDF
55 pages
SML Lecture3
No ratings yet
SML Lecture3
36 pages
Understanding PAC Learning Framework
No ratings yet
Understanding PAC Learning Framework
30 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
5 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
MLSM Lecture3 190923
No ratings yet
MLSM Lecture3 190923
36 pages
Week 3
No ratings yet
Week 3
56 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
Lecture 14 CLT Marked
No ratings yet
Lecture 14 CLT Marked
33 pages
PAC Learning and VC Dimension Explained
No ratings yet
PAC Learning and VC Dimension Explained
31 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
RIP Routing Protocol
No ratings yet
RIP Routing Protocol
27 pages
Machine Learning Theory Lecture
No ratings yet
Machine Learning Theory Lecture
6 pages
Lecture 01
No ratings yet
Lecture 01
11 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
402 pages
HW 369 Dxer
No ratings yet
HW 369 Dxer
5 pages
Sol All
No ratings yet
Sol All
66 pages
05 VC Bound
No ratings yet
05 VC Bound
27 pages
Introduction To Machine Learning IIT KGP Week 2
100% (1)
Introduction To Machine Learning IIT KGP Week 2
14 pages
How To Deal Overfitting
No ratings yet
How To Deal Overfitting
30 pages
CS 760 Machine Learning Exam Spring 2010
No ratings yet
CS 760 Machine Learning Exam Spring 2010
10 pages
02 First Model of Learning
No ratings yet
02 First Model of Learning
37 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
ML Assignment
No ratings yet
ML Assignment
17 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
Generalization Bounds and Algorithm Stability
No ratings yet
Generalization Bounds and Algorithm Stability
25 pages
Machine Learning Mid-term Exam Solutions
No ratings yet
Machine Learning Mid-term Exam Solutions
12 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
PAC Learning and Complexity
No ratings yet
PAC Learning and Complexity
14 pages
hw2 Sol
No ratings yet
hw2 Sol
3 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
CS464 Ch3 Estimation
No ratings yet
CS464 Ch3 Estimation
56 pages
Machine Learning Module 2 Overview
No ratings yet
Machine Learning Module 2 Overview
20 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
MS Key-4
No ratings yet
MS Key-4
4 pages
Michael Importance Weighting
No ratings yet
Michael Importance Weighting
30 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Advanced Regression Techniques
No ratings yet
Advanced Regression Techniques
52 pages
PTSP Tutorial Problems
No ratings yet
PTSP Tutorial Problems
9 pages
Discrete Probability Distribution Guide
No ratings yet
Discrete Probability Distribution Guide
14 pages
RSHH Qam12 ch02
No ratings yet
RSHH Qam12 ch02
104 pages
TPS5e Ch5.2
No ratings yet
TPS5e Ch5.2
14 pages
CE Module 8 - Statistics and Probability (Answer Key)
No ratings yet
CE Module 8 - Statistics and Probability (Answer Key)
5 pages
Probability Chap 14 HM
No ratings yet
Probability Chap 14 HM
9 pages
Chi Square M&M's-1
No ratings yet
Chi Square M&M's-1
3 pages
Eee251 Obe v2
No ratings yet
Eee251 Obe v2
4 pages
TIA Practice Exam 1
No ratings yet
TIA Practice Exam 1
7 pages
VIT CSE B.Tech Curriculum Overview
50% (2)
VIT CSE B.Tech Curriculum Overview
76 pages
MATH 3075 3975 s1
No ratings yet
MATH 3075 3975 s1
3 pages
Random Variables
No ratings yet
Random Variables
19 pages
U4 Notes
No ratings yet
U4 Notes
10 pages
Petrov CV
No ratings yet
Petrov CV
18 pages
Central Limit Theorem Explained
No ratings yet
Central Limit Theorem Explained
7 pages
Lec-6 - Continuous Random Variable
No ratings yet
Lec-6 - Continuous Random Variable
18 pages
Shynggys Magzanov (1) - 220915 - 194956 PDF
No ratings yet
Shynggys Magzanov (1) - 220915 - 194956 PDF
38 pages
Decision Science Answer Key
No ratings yet
Decision Science Answer Key
35 pages
The Design Inference Eliminating Chance Through Small Probabilities 1
No ratings yet
The Design Inference Eliminating Chance Through Small Probabilities 1
4 pages
Mathematics Probability MCQ PDF
100% (1)
Mathematics Probability MCQ PDF
8 pages
Complete Business Statistics: by Amir D. Aczel & Jayavel Sounderpandian 6 Edition (SIE)
No ratings yet
Complete Business Statistics: by Amir D. Aczel & Jayavel Sounderpandian 6 Edition (SIE)
62 pages
Laplace Distribution Guide
No ratings yet
Laplace Distribution Guide
12 pages
Earthquake Prediction From Imperfect Premonitory Signs: Application Example 4
No ratings yet
Earthquake Prediction From Imperfect Premonitory Signs: Application Example 4
4 pages
10.7 Periodic Review Model With Probabilistic Demand
No ratings yet
10.7 Periodic Review Model With Probabilistic Demand
18 pages
Queuing Theory: Performance & Economics
No ratings yet
Queuing Theory: Performance & Economics
26 pages
P&S 2-2 Unit Iii Questionbank-1
No ratings yet
P&S 2-2 Unit Iii Questionbank-1
6 pages
MET 2019 BTech Physics Syllabus
No ratings yet
MET 2019 BTech Physics Syllabus
11 pages
1.3 Conditional Probability 習題
No ratings yet
1.3 Conditional Probability 習題
4 pages
AMA1110 Tutorial - 5s
No ratings yet
AMA1110 Tutorial - 5s
9 pages
Probability Distributions Formula Sheet
No ratings yet
Probability Distributions Formula Sheet
2 pages