Bounds
• If X is a discrete random variable with E(X) = µ, then, for any ϵ > 0, and
• Chebyschev’s bound:
• Hoeffding’s inequality
• Markov Inequality:
•• Chernoff’s
Chebyschev:bound:
Assumes finite variance and checks how far a R.V
deviates
• is from mean.
an exponentially decreasing upper bound on the tail of a random variable. The minimum of all such
• Hoeffding:
exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than
• exponential
Hoeffding: (e.g. sub-Gaussian / Hoeffding). It is especially useful for sums of independent random
Applicable to sample and bin means (how far they
variables, such as sums of Bernoulli random variables.
deviate from each other)
• If X is a random variable, then for any a R, we can write
So for
• The s>0, we can for
requirement write
Hoeffding RV themselves are bounded — Similarly, for s<0, we can write
no more tails of distributions stretched off toward infinity.
S Karagadde, ME 228, IIT Bombay 2
Example
• Let µ = 0.4. Use Hoeffding’s Inequality
to bound the probability that a sample of 10 fruits will have some oranges with ν
≤ 0.1. What bound do you get?
1. 0.67
Set N = 10 and = 0.3 (0.4-0.1) and you get the
2. 0.4 answer.
3. 0.33
Also note that option # 4 is the actual probability and
4. 0.05 Hoeffding gives only an upper bound to that probability.
S Karagadde, ME 228, IIT Bombay 3
Formal guarantee
• For any fixed ‘h’ in big data (large N)
• in-sample error Ein is probably close to
• the out-of-sample error Eout, within an acceptable
• Also, like before:
• Valid for all N and
• does not depend on 𝒐𝒖𝒕 , no need to ‘know’ 𝒐𝒖𝒕
• f and P can remain underground
• larger sample size N or looser gap higher probability for ‘ν ≈ µ’
• 𝒊𝒏 is probably approximately correct (PAC)
S Karagadde, ME 228, IIT Bombay 4
Verification of one h
• For any h, when N is large, and Ein ~ Eout,
• Can we claim good learning (g ~ f )?
• Yes! If Ein is small
• And Algorithm A has picked h as g.
• Therefore, g=f, is PAC.
• No! if Ein was not small, A was forced to pick h as g.
• Ein could be large
• g f is PAC
• Need for choosing the ‘best’ h from multiple h’s. One h may or may not give smallest error
• How do we now define Probability bounds when multiple h’s exist?
S Karagadde, ME 228, IIT Bombay 5
Bad events – example 1
• If you toss a fair coin 10 times, what is the probability that you
will get 10 heads?
• ~ 0.1%
• If you toss 1000 fair coins 10 times each, what is the probability
that some trial will get 10 heads? (binomial trials)
• ~ 63%
What is happening? The if you try hard enough, several ‘disjoint’ probabilities of
a bad event can add up to give a higher chance of that happeneing.
S Karagadde, ME 228, IIT Bombay 6
Question
• If all of you flip a coin 4 times, what is the probability that at least one of the
students get all 4 heads, for their coin ‘g’? (No coin? go digital -
https://flipsimu.com/ )
• Is that coin very special?
• 0.999567
• BAD sample: Ein (that coin) and Eout (a general measure) are far away.
• that coin is an example for a bad h
• Objective of this exercise: individual hypothesis may not always
represent the ‘expected’
S Karagadde, ME 228, IIT Bombay 7
Multiple bins – what is the error bound when multiple hypotheses exist?
• Generalizing the bin model to more than one hypothesis
Both µ and ν depend on which hypothesis h
Hypothesis h1 h2 hM
μ1 μ2 μM
µ is `out of sample'
denoted by Eout(h)
……
ν is `in sample'
denoted by Ein(h)
ν1 ν2 νM
Hoeffding inequality forS Karagadde,
multipleMEbins ? Bombay
228, IIT 8
Union bound for bad data – M different hypotheses
• finite-bin version of Hoeffding, valid for all M, N and
Also, the implication is that more
• does not depend on any Eout(hm), no need to ‘know’ Eout(hm)
sophisticated the model (larger
• —f and P can stay unknown M), more looser the in-sample
• ‘Ein(g) = Eout(g)’ is PAC, regardless of A will track out-of-sample.
most reasonable’ A (like PLA/pocket):
pick the hm with lowest
S Karagadde, ME 228, IITE in(hm) as g
Bombay 9
Summary - Generalization
• Can in-sample error closely represent out-of-sample error?
• Answer: Yes, because of Hoeffding’s inequality
• Can we have low enough Ein?
• Yes, we have many hypotheses, and we choose the one giving lowest error
• We reject the hypothesis if in-sample error (Ein) is large – then change the
model/hypothesis until we get this.
• Above two shall ensure selection of a good enough ‘g’, and hence lead to
‘generalization’
S Karagadde, ME 228, IIT Bombay 10
But,
• However, if M is very large, bound cannot be finite. How do we address this?
• In other words: does M have to be really large?
S Karagadde, ME 228, IIT Bombay 11
M changes with model complexity Linear separator (1D)
This could ALSO very well be the ideal ‘f’!
Size of [w] = 1 or 2
𝟎 𝟐 𝟐
OR
𝟎
𝟎 𝟐 𝟎
𝟏
Simpler model lower complexity
Linear separator
Size of [w] = 2+1 (for 2D)
Size of [w] = d+1 (for d-D)
Quadratic separator
This could very well be the ideal ‘f’!
Size of [w] = 4 (for 2D)
(More parameters complexity
increased)
Even larger number of hypotheses
S Karagadde, ME 228, IIT Bombay 12
compared to linear-2D.
Trade-off on M
Small M Large M
Yes! No!
Yes!
No!
More choices, hence better
Too few choices
possibility
Using right MS is important
Karagadde, –Bombay
ME 228, IIT how do we ensure? 13
Background
• Hoeffding’s Inequality for multiple bins, which blows up for when no. of h is large
𝟐
• 𝒊𝒏
• Now, establish a finite quantity that replaces M
𝟐
• 𝒊𝒏
• Justify the feasibility of learning for infinite M
• Study to understand its trade-off for right ‘H’
• Why PLA worked so nicely without all of this?
S Karagadde, ME 228, IIT Bombay 14
Now let us address if M should really be large?
𝟐
• 𝒊𝒏
• M was obtained by adding all the ‘bad’ events Union bound
• This assumed the worst case all Bad events were non-overlapping
• Is this really true?
• A ‘bad’ event B :
• Choice of right hypothesis bound for
• Non-overlapping assumption:
•
S Karagadde, ME 228, IIT Bombay 15
A problem with union bound
• Union bound
• Bad events: 𝒊𝒏
• Overlapping for similar hypotheses
• 𝒐𝒖𝒕 𝟏 𝒐𝒖𝒕 𝟐
• For most D, 𝒊𝒏 𝟏 𝒊𝒏 𝟐
• Therefore, considerable over-estimation!
• Way out: Group similar hypothesis by type/kind?
S Karagadde, ME 228, IIT Bombay 16
Examples: What is ‘M’ in a PLA
• The hypothesis set
• No. of lines
• How many “kinds” of lines if viewed w.r.t. one input vector 1?
• 2 kinds (max. number of ‘dichotomies’ with this hypothesis set)
• 1 1 or
S Karagadde, ME 228, IIT Bombay 17
Examples: What is ‘M’ in a PLA
• The hypothesis set
• No. of lines
• How many “kinds” of lines if viewed w.r.t. one input vector 1?
2
1
• 2 kinds M =2
• 1 1 or
S Karagadde, ME 228, IIT Bombay 18
Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t. two input vectors 1 and ?
1
• 4 kinds
2
• 1 input 2 kinds
• 2 ... 4 kinds
• 3…?
S Karagadde, ME 228, IIT Bombay 19
Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
three input vectors 1 ?
• 8 kinds
3
2
• 1 input 2 kinds
• 2 ... 4 kinds
• 3…8
• ALWAYS?
S Karagadde, ME 228, IIT Bombay 20
Examples: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
three input vectors 1 ?
1 • 8 kinds
• 6, when points
2 are
degenerate
3 • A special case
• With any 3
points, it’s
• 1 input 2 kinds always a
• 2 ... 4 kinds convex set.
• 3 … 8 kinds
• NOT ALWAYS 2N!
S Karagadde, ME 228, IIT Bombay 21
Question: What is ‘M’ in a PLA
• The hypothesis set
• How many “kinds” of lines if viewed w.r.t.
FOUR input vectors 1 2 3, 4?
2
4 • 14 kinds
(check)
• 1 input 2 kinds
• 2 ... 4 kinds
• 3 … 8 (6 only for points on a line, which is nothing but positive rays)
• 4 … 14
S Karagadde, ME 228, IIT Bombay 22
Effective Number of Lines
• Max. kinds of lines w.r.t N inputs x1 … xN
• Effective no. of lines
N effective(N)
• Must be less than <= 2N (why?) 1 2
2 4
3 8
• Finite grouping of infinitely many lines of H
4 14 < 2N
𝟐
• Therefore, 𝒊𝒏
If, N is large enough
(a) effective(N) can replace M and
Good news!
(b) effective(N) << 2N,
learning is possible with infinite lines!
S Karagadde, ME 228, IIT Bombay 23
Define Growth function
• depends on the inputs x’s
• Growth function: remove the dependence by taking max. of all possible x’s
N
•
1 2
2 4
• This is finite, upper bounded with 2N 3 max(…6,8)
4 14 < 2N
• How to calculate this function? Refer to Learning from Data, Chapter 2.
• (exact derivation beyond the scope of this course)
S Karagadde, ME 228, IIT Bombay 24
Summary of growth functions
k : No. of points from which 𝓗
𝒌
break point is 4 for 2D
Polynomial
• Positive rays: ℋ Break point at 2 instead of
exponential
• Positive intervals: ℋ
Break point at 3
Data points
• Convex sets: ℋ
𝑁 No break point are seldom
convex in
• 2D perceptrons: ℋ
𝑁 in some cases Break point at 4 practice.
• Conjecture:
• No break point: 𝑁 (Guaranteed!)
• Break point k: (Polynomial in N)
Take away: The number of points for data-driven analysis
Good news!
is usually a ‘lot’. Break point can then guarantee that we
are safe with m
S Karagadde, ME 228, IIT Bombay 25
Examples:
ℋ
• H: Positive rays: (break point k = 2)
•
i.e.
• H: Positive intervals: break point k = 3
• H: 2D Perceptrons: break point k = 4
• Note that power of N seems to have a correlation with break point
S Karagadde, ME 228, IIT Bombay 26
Therefore
𝟐 https://www.csie.ntu.edu.tw/~htlin/mooc/
• Instead of ,
𝟐
• We need:
Summary: With different data sets,
the hypothesis set need not increase
as per Union bound. It has to have a
several common hypothesis, and
those duplicates can be removed
S Karagadde, ME 228, IIT Bombay 27