0% found this document useful (0 votes)
13 views26 pages

Lecture 09 Bounds For Bad Hypotheses

The document discusses various inequalities related to random variables, including Chebyschev's bound, Hoeffding's inequality, and Markov's inequality, which help in estimating the probabilities of deviations from the mean. It emphasizes the importance of in-sample and out-of-sample errors in machine learning, and how to choose hypotheses to minimize these errors. Additionally, it addresses the challenges of multiple hypotheses and the implications of union bounds on error estimation in learning models.

Uploaded by

param01alt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

Lecture 09 Bounds For Bad Hypotheses

The document discusses various inequalities related to random variables, including Chebyschev's bound, Hoeffding's inequality, and Markov's inequality, which help in estimating the probabilities of deviations from the mean. It emphasizes the importance of in-sample and out-of-sample errors in machine learning, and how to choose hypotheses to minimize these errors. Additionally, it addresses the challenges of multiple hypotheses and the implications of union bounds on error estimation in learning models.

Uploaded by

param01alt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Bounds

• If X is a discrete random variable with E(X) = µ, then, for any ϵ > 0, and
• Chebyschev’s bound:

• Hoeffding’s inequality

• Markov Inequality:
•• Chernoff’s
Chebyschev:bound:
Assumes finite variance and checks how far a R.V
deviates
• is from mean.
an exponentially decreasing upper bound on the tail of a random variable. The minimum of all such
• Hoeffding:
exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than
• exponential
Hoeffding: (e.g. sub-Gaussian / Hoeffding). It is especially useful for sums of independent random
Applicable to sample and bin means (how far they
variables, such as sums of Bernoulli random variables.
deviate from each other)
• If X is a random variable, then for any a R, we can write
So for
• The s>0, we can for
requirement write
Hoeffding RV themselves are bounded — Similarly, for s<0, we can write
no more tails of distributions stretched off toward infinity.
S Karagadde, ME 228, IIT Bombay 2
Example
• Let µ = 0.4. Use Hoeffding’s Inequality

to bound the probability that a sample of 10 fruits will have some oranges with ν
≤ 0.1. What bound do you get?

1. 0.67
Set N = 10 and = 0.3 (0.4-0.1) and you get the
2. 0.4 answer.
3. 0.33
Also note that option # 4 is the actual probability and
4. 0.05 Hoeffding gives only an upper bound to that probability.

S Karagadde, ME 228, IIT Bombay 3


Formal guarantee
• For any fixed ‘h’ in big data (large N)
• in-sample error Ein is probably close to
• the out-of-sample error Eout, within an acceptable

• Also, like before:

• Valid for all N and

• does not depend on 𝒐𝒖𝒕 , no need to ‘know’ 𝒐𝒖𝒕


•  f and P can remain underground

• larger sample size N or looser gap  higher probability for ‘ν ≈ µ’

• 𝒊𝒏 is probably approximately correct (PAC)


S Karagadde, ME 228, IIT Bombay 4
Verification of one h
• For any h, when N is large, and Ein ~ Eout,
• Can we claim good learning (g ~ f )?

• Yes! If Ein is small


• And Algorithm A has picked h as g.
• Therefore, g=f, is PAC.

• No! if Ein was not small, A was forced to pick h as g.


• Ein could be large
• g f is PAC

• Need for choosing the ‘best’ h from multiple h’s. One h may or may not give smallest error
• How do we now define Probability bounds when multiple h’s exist?

S Karagadde, ME 228, IIT Bombay 5


Bad events – example 1
• If you toss a fair coin 10 times, what is the probability that you
will get 10 heads?

• ~ 0.1%

• If you toss 1000 fair coins 10 times each, what is the probability
that some trial will get 10 heads? (binomial trials)

• ~ 63%

What is happening? The if you try hard enough, several ‘disjoint’ probabilities of
a bad event can add up to give a higher chance of that happeneing.
S Karagadde, ME 228, IIT Bombay 6
Question
• If all of you flip a coin 4 times, what is the probability that at least one of the
students get all 4 heads, for their coin ‘g’? (No coin? go digital -
https://flipsimu.com/ )
• Is that coin very special?

• 0.999567

• BAD sample: Ein (that coin) and Eout (a general measure) are far away.
• that coin is an example for a bad h

• Objective of this exercise: individual hypothesis may not always


represent the ‘expected’
S Karagadde, ME 228, IIT Bombay 7
Multiple bins – what is the error bound when multiple hypotheses exist?

• Generalizing the bin model to more than one hypothesis


Both µ and ν depend on which hypothesis h

Hypothesis h1 h2 hM
μ1 μ2 μM
µ is `out of sample'
denoted by Eout(h)

……

ν is `in sample'
denoted by Ein(h)
ν1 ν2 νM

Hoeffding inequality forS Karagadde,


multipleMEbins ? Bombay
228, IIT 8
Union bound for bad data – M different hypotheses

• finite-bin version of Hoeffding, valid for all M, N and


Also, the implication is that more
• does not depend on any Eout(hm), no need to ‘know’ Eout(hm)
sophisticated the model (larger
• —f and P can stay unknown M), more looser the in-sample
• ‘Ein(g) = Eout(g)’ is PAC, regardless of A will track out-of-sample.

most reasonable’ A (like PLA/pocket):


pick the hm with lowest
S Karagadde, ME 228, IITE in(hm) as g
Bombay 9
Summary - Generalization
• Can in-sample error closely represent out-of-sample error?
• Answer: Yes, because of Hoeffding’s inequality

• Can we have low enough Ein?


• Yes, we have many hypotheses, and we choose the one giving lowest error
• We reject the hypothesis if in-sample error (Ein) is large – then change the
model/hypothesis until we get this.

• Above two shall ensure selection of a good enough ‘g’, and hence lead to
‘generalization’

S Karagadde, ME 228, IIT Bombay 10


But,

• However, if M is very large, bound cannot be finite. How do we address this?

• In other words: does M have to be really large?

S Karagadde, ME 228, IIT Bombay 11


M changes with model complexity Linear separator (1D)
This could ALSO very well be the ideal ‘f’!
Size of [w] = 1 or 2
𝟎 𝟐 𝟐
OR
𝟎
𝟎 𝟐 𝟎
𝟏
Simpler model  lower complexity

Linear separator
Size of [w] = 2+1 (for 2D)
Size of [w] = d+1 (for d-D)

Quadratic separator
This could very well be the ideal ‘f’!
Size of [w] = 4 (for 2D)
(More parameters  complexity
increased)
Even larger number of hypotheses
S Karagadde, ME 228, IIT Bombay 12
compared to linear-2D.
Trade-off on M

Small M Large M

Yes! No!

Yes!
No!
More choices, hence better
Too few choices
possibility

Using right MS is important


Karagadde, –Bombay
ME 228, IIT how do we ensure? 13
Background
• Hoeffding’s Inequality for multiple bins, which blows up for when no. of h is large
𝟐
• 𝒊𝒏

• Now, establish a finite quantity that replaces M


𝟐
• 𝒊𝒏

• Justify the feasibility of learning for infinite M


• Study to understand its trade-off for right ‘H’

• Why PLA worked so nicely without all of this?

S Karagadde, ME 228, IIT Bombay 14


Now let us address if M should really be large?
𝟐
• 𝒊𝒏

• M was obtained by adding all the ‘bad’ events  Union bound


• This assumed the worst case  all Bad events were non-overlapping

• Is this really true?

• A ‘bad’ event B :
• Choice of right hypothesis  bound for
• Non-overlapping assumption:

S Karagadde, ME 228, IIT Bombay 15


A problem with union bound
• Union bound 

• Bad events: 𝒊𝒏

• Overlapping for similar hypotheses

• 𝒐𝒖𝒕 𝟏 𝒐𝒖𝒕 𝟐
• For most D, 𝒊𝒏 𝟏 𝒊𝒏 𝟐

• Therefore, considerable over-estimation!

• Way out: Group similar hypothesis by type/kind?

S Karagadde, ME 228, IIT Bombay 16


Examples: What is ‘M’ in a PLA
• The hypothesis set
• No. of lines 
• How many “kinds” of lines if viewed w.r.t. one input vector 1?

• 2 kinds (max. number of ‘dichotomies’ with this hypothesis set)


• 1 1 or

S Karagadde, ME 228, IIT Bombay 17


Examples: What is ‘M’ in a PLA
• The hypothesis set
• No. of lines 
• How many “kinds” of lines if viewed w.r.t. one input vector 1?

2
1

• 2 kinds  M =2
• 1 1 or

S Karagadde, ME 228, IIT Bombay 18


Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t. two input vectors 1 and ?

1
• 4 kinds
2

• 1 input  2 kinds
• 2 ...  4 kinds
• 3…?

S Karagadde, ME 228, IIT Bombay 19


Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
three input vectors 1 ?

• 8 kinds
3
2

• 1 input  2 kinds
• 2 ...  4 kinds
• 3…8
• ALWAYS?
S Karagadde, ME 228, IIT Bombay 20
Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
three input vectors 1 ?

1 • 8 kinds

• 6, when points
2 are
degenerate
3 • A special case
• With any 3
points, it’s
• 1 input  2 kinds always a
• 2 ...  4 kinds convex set.
• 3 …  8 kinds
• NOT ALWAYS 2N!
S Karagadde, ME 228, IIT Bombay 21
Question: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
FOUR input vectors 1 2 3, 4?

2
4 • 14 kinds
(check)

• 1 input  2 kinds
• 2 ...  4 kinds
• 3 …  8 (6 only for points on a line, which is nothing but positive rays)
• 4 …  14
S Karagadde, ME 228, IIT Bombay 22
Effective Number of Lines
• Max. kinds of lines w.r.t N inputs x1 … xN

• Effective no. of lines


N effective(N)
• Must be less than <= 2N (why?) 1 2
2 4
3 8
• Finite grouping of infinitely many lines of H
4 14 < 2N
𝟐
• Therefore, 𝒊𝒏

If, N is large enough


(a) effective(N) can replace M and
Good news!
(b) effective(N) << 2N,
learning is possible with infinite lines!
S Karagadde, ME 228, IIT Bombay 23
Define Growth function
• depends on the inputs x’s

• Growth function: remove the dependence by taking max. of all possible x’s

N

1 2
2 4
• This is finite, upper bounded with 2N 3 max(…6,8)
4 14 < 2N

• How to calculate this function? Refer to Learning from Data, Chapter 2.


• (exact derivation beyond the scope of this course)

S Karagadde, ME 228, IIT Bombay 24


Summary of growth functions
k : No. of points from which 𝓗
𝒌

 break point is 4 for 2D

Polynomial
• Positive rays: ℋ Break point at 2 instead of
exponential
• Positive intervals: ℋ
Break point at 3
Data points
• Convex sets: ℋ
𝑁 No break point are seldom
convex in
• 2D perceptrons: ℋ
𝑁 in some cases Break point at 4 practice.

• Conjecture:
• No break point: 𝑁 (Guaranteed!)
• Break point k: (Polynomial in N)

Take away: The number of points for data-driven analysis


Good news!
is usually a ‘lot’. Break point can then guarantee that we
are safe with m
S Karagadde, ME 228, IIT Bombay 25
Examples:

• H: Positive rays: (break point k = 2)

i.e.

• H: Positive intervals: break point k = 3

• H: 2D Perceptrons: break point k = 4

• Note that power of N seems to have a correlation with break point

S Karagadde, ME 228, IIT Bombay 26


Therefore
𝟐 https://www.csie.ntu.edu.tw/~htlin/mooc/
• Instead of ,

𝟐
• We need:

Summary: With different data sets,


the hypothesis set need not increase
as per Union bound. It has to have a
several common hypothesis, and
those duplicates can be removed

S Karagadde, ME 228, IIT Bombay 27

You might also like