Today, Lecture 4
1. Bayesian Statistics
• Bayesian prediction, testing
2. E-Processes with simple nulls
• Simple alternative: GRO e-variable
• Composite alternative: learning in Bayesian & non-Bayesian manner
3. Bayesian vs. Neyman-Pearson vs. E-Process Testing with simple nulls
1
Models
• Let Ω = 𝒳 ! be a sample space and suppose we observe data
𝑥", … , 𝑥! ∈ 𝒳 !
• We call a set of distributions ℳ = {𝑃# : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by
Models
• Let Ω = 𝒳 ! be a sample space and suppose we observe data
𝑥", … , 𝑥! ∈ 𝒳 !
• We call a set of distributions ℳ = {𝑃# : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by
Note: for all distributions on Ω,
Bernoulli is the restriction to those distrs with
Models
• Let Ω = 𝒳 ! be a sample space and suppose we observe data
𝑥", … , 𝑥! ∈ 𝒳 !
• We call a set of distributions ℳ = {𝑃# : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by
• NOTE: not
Models
• Let Ω = 𝒳 ! be a sample space and suppose we observe data
𝑥", … , 𝑥! ∈ 𝒳 !
• We call a set of distributions ℳ = {𝑃# : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳ is the Bernoulli model, defined
by
Maximum Likelihood
• Let Ω = 𝒳 ! be a sample space and suppose we observe data
𝑥", … , 𝑥! ∈ 𝒳 !
• We call a set of distributions ℳ = {𝑃# : 𝜃 ∈ Θ} on Ω a statistical model (or
often hypothesis) for the data
• Simple example: 𝒳 = 0,1 , Θ = 0,1 , ℳis the Bernoulli model, defined
by
• The method of maximum likelihood (Fisher, 1922) tells us to pick, as a
‘best guess’ of the true 𝜃 , the value 𝜃2 maximizing the probability of the
actually observed data.
The Likelihood Function
𝒏
𝒑𝜽 𝑿 as function of 𝜽
The Bayesian Posterior
• From the Bayesian perspective, you do not necessarily want to make a
‘single’ estimate of 𝜃
• Rather, you want to report the full posterior – this encapsulates
everything you have learned from the data
• Example – Bernoulli model with prior 𝑃 on Θ = [0,1]
" % "
• We have already seen the example with 𝑃 𝜃 = =𝑃 𝜃= = ;
$ $ %
posterior was 𝑃(𝜃|𝐷), a probability distr on 2 parameter values
The Bayesian Posterior
• From the Bayesian perspective, you do not necessarily want to make a
‘single’ estimate of 𝜃
• Rather, you want to report the full posterior – this encapsulates
everything you have learned from data
• Example – Bernoulli model with prior on Θ = [0,1]
• If we want to take a prior on the full Bernoulli model, we should take
one with a continuous probability density 𝑝 𝜃
• Everything works as before: posterior is
The Bayesian Posterior
• Posterior is
• If we take uniform prior 𝑝 𝜃 ≡ 1, this is proportional to likelihood
function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, perhaps not desirable!)
• Why not desirable?
The Bayesian Posterior
• Posterior is
• If we take uniform prior 𝑝 𝜃 ≡ 1, this is proportional to likelihood
function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, not desirable!)
• Not invariant to reparametrization
• …we could just as well have defined 𝑝# 𝑋& = 1 = 𝜃 %
The Bayesian Posterior
• Posterior is
• If we take uniform prior 𝑝 𝜃 ≡ 1, this is proportional to likelihood
function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, not desirable!)
• For general parametric models and continuous priors, posterior looks
more and more like a normal distribution as 𝑛 increases, centered
around 𝜃,2 with variance of order 1/√𝑛
The Bayesian Posterior
• Posterior is
• If we take uniform prior, this is proportional to likelihood function!
• For more general priors, uniform prior not always well-defined (and even
for Bernoulli, not desirable!)
• For general parametric models and continuous priors, posterior looks
more and more like a normal distribution as 𝑛 increases, centered
around 𝜃,2 with variance of order 1/√𝑛
A Note On Notation
• We will henceforth use 𝑤(𝜃) and 𝑤 𝜃 𝐷 = 𝑤(𝜃|𝑋 ! )
for prior and posterior (𝑤 stands for “weight”) and write
𝑝# (𝑋 ! ) instead of 𝑝 𝑋 ! 𝜃) and 𝑝' 𝑋 ! for 𝑝 𝑋 ! , the
marginal probability of the data.
• So Bayes theorem becomes
…and
Bayesian Prediction/
Predictive Estimation
• As a Bayesian you prefer to output the full posterior
• But what if you are asked to make a specific prediction for the next
outcome? Then you have to come up with a distribution after all
• Bayesian predictive distribution:
Laplace Rule of Succession
• For the Bernoulli model with uniform prior 𝑊,
…a formula first derived by Laplace, around 1800.
We can also view these predictions as a ‘Bayesian estimate’ of 𝜃 ….
Laplace Rule of Succession
• For the Bernoulli model with uniform prior 𝑊,
…a formula first derived by Laplace, around 1800.
We can also view these predictions as a ‘Bayesian estimate’ of 𝜃 ….
Two Fundamentally Different Uses of Bayes
Theorem
1. A Priori Probabilities can be meaningfully estimated
(medical testing, for example!)
2. A Priori Probabilities are wild guess (and conceivably do not exist)
• Sweden/France
• “Bayesian inference” in statistics
(…in reality it’s often ‘somewhere in the middle’)
Hypothesis Testing
via Bayes Factors
• Bayes factor testing: alternative to Neyman-Pearson / E-based testing
• First, very special case: 𝐻( and 𝐻" are both point (simple) hypotheses,
just like last two weeks
• E.g. our example -
Posterior odds
Bayes Factor
Hypothesis Testing
via Bayes Factors
• Bayes factor testing: alternative to Neyman-Pearson
• First, very special case: 𝐻( and 𝐻" are both point (simple) hypotheses,
just like last week
• E.g. our example - Bayes Factor
• Jeffreys: evidence in favor of 𝐻", against 𝐻(, should be measured by the
Bayes factor
= likelihood ratio (but only if 𝐻( and 𝐻" are simple)
= posterior odds if prior odds are equal
Hypothesis Testing
via Bayes Factors
• Composite case: still Bayes Factor
• …with now 𝑝 𝐷 𝐻) ) = ∫ 𝑝 𝐷 𝜃 𝑝 𝜃 𝐻) given by the marginal likelihood
(probability of the data, averaged according to the prior ‘within’ 𝐻) )
• Evidence in favor of 𝐻", against 𝐻( ,still measured by the Bayes factor
= marginal likelihood ratio, ≠standard likelihood ratio
= posterior odds if prior odds are equal
Example:
testing whether a coin is fair
• Under 𝑃# , data are i.i.d. Bernoulli 𝜃
" "
Θ( = %
, Θ" = 0,1 ∖ %
• Θ( is simple so no need to put prior on its elements
• Θ" represented by (for example) 𝑤" 𝜃 , uniform prior density on 0,1
"
(puts mass 0 on % so this seems o.k.)
• Evidence against 𝐻( measured by Bayes factor
Bayes factor testing in
‘non-Bayesian’ notation
𝐻( = 𝑝# 𝜃 ∈ Θ(} vs 𝐻" = 𝑝# 𝜃 ∈ Θ"} :
Evidence in favour of 𝐻" provided by the data measured by
where
Example:
testing whether a coin is fair
• Under 𝑃# , data are i.i.d. Bernoulli 𝜃
" "
Θ( = %
, Θ" = 0,1 ∖ %
• Θ( is simple so no need to put prior on its elements
• Evidence against 𝐻( measured by Bayes factor
• …wait! Last week we saw the same formula as an e-process for testing
𝐻( against 𝐻" !?
Today, Lecture 4
1. Bayesian Statistics
• Bayesian prediction, testing
2. E-Processes with simple nulls
• Simple alternative: GRO e-variable
• Composite alternative: learning in Bayesian & non-Bayesian manner
3. Bayesian vs. Neyman-Pearson vs. E-Process Testing with simple nulls
25
E-Processes and Betting
• Let 𝒳 = {1, … , 𝐾}.
At each time 𝑡 = 1,2, … there are 𝐾 tickets available. Ticket 𝑘 pays off
1/𝑝((𝑘) if outcome is 𝑘, and 0 otherwise.
You may buy multiple and fractional nrs of tickets.
• You start by investing 1$ in ticket 1.
• At each time t you put fraction 𝑃I" 𝑋* = 𝑘|𝑋 *+" of your money on outcome
𝑘. Then your total capital 𝑀()) gets multiplied by 𝑀) : = 𝑝"̅ 𝑋* |𝑋 *+" /𝑝((𝑋* )
• After 1 outcome you either stop with end-capital 𝑀" or continue, putting
fraction 𝑃I% 𝑋% = 𝑘|𝑋" of 𝑀" on outcome 𝑋% = 𝑘 (“reinvest everything”).
After 2nd outcome you stop with end capital 𝑀 % = 𝑀" ⋅ 𝑀% or you
continue, and so on… 26
Good Betting Strategies
• If the null is true, you do not expect to gain any money, under any
stopping time, no matter what strategy 𝑝"̅ you use
• If you think alternative is a specific 𝑝" , then using 𝑝"̅ = 𝑝" is a good idea
• “constant” strategy
• If you think 𝐻( is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝"
• Use a 𝑝"̅ that better and better mimics the true, or just “best” fixed 𝑝"
27
Simple 𝑯𝟏 , log-optimal betting
If null and alternative are simple, 𝐻( = 𝑃( , 𝐻" = {𝑃"} , 𝑋", 𝑋%, … are i.i.d.
according to 𝑃", then using 𝑝"̅ = 𝑝" is a good idea. Why?
• For any choice of e-variable 𝑆& = 𝑠(𝑋& ), we have, with 𝑆 (!) = ∏!&." 𝑠(𝑋& ),
!
1 !
1
log 𝑆 = T log 𝑆& → 𝐄/∼1! [log 𝑠(𝑋)] , 𝑃" − a. s.
n 𝑛
&."
• …hence if we measure evidence against 𝐻( with same e-variable 𝑠 𝑋&
at each 𝑖 , we would like to pick 𝑠 ∗ (𝑋) maximizing
𝐄/∼1! [log 𝑠(𝑋)] over all e-variables 𝑠 𝑋 for 𝐻(
leads a.s. to exponentially more money than any other e-variable!
28
Simple 𝑯𝟏 , log-optimal betting
If null and alternative are simple, 𝐻( = 𝑃( , 𝐻" = {𝑃"} , 𝑋", 𝑋%, … are i.i.d.
according to 𝑃", then using 𝑝"̅ = 𝑝" is a good idea. Why?
• For any choice of e-variable 𝑆& = 𝑠(𝑋& ), we have, with 𝑆 (!) = ∏!&." 𝑠(𝑋& ),
!
1 !
1
log 𝑆 = T log 𝑆& → 𝐄/∼1! [log 𝑠(𝑋)] , 𝑃" − a. s.
n 𝑛
&."
• …hence if we measure evidence against 𝐻( with same e-variable 𝑠 𝑋&
at each 𝑖 , we would like to pick 𝑠 ∗ (𝑋) maximizing
𝐄/∼1! [log 𝑠(𝑋)] over all e-variables 𝑠 𝑋 for 𝐻(
leads a.s. to exponentially more money than any other e-variable!
• argument can be extended: ∏!&." 𝑠 ∗ (𝑋& ) remains best even among all
29
(non-time-constant) e-processes
Simple 𝑯𝟏 , log-optimal betting
We aim to to pick 𝑠 ∗ (𝑋) maximizing
𝐄/∼1! [log 𝑠(𝑋)] over all e-variables 𝑠 𝑋 for 𝐻(
3! /
It turns out that maximum is achieved for 𝑠∗ 𝑋 = : the LR e-variable
3" (/)
• We say: betting according to 𝑝" 𝑋& at each 𝑋& is log-optimal or GRO
(GRO = Growth-Optimal)
• We say that the LR e-variable 𝑠 ∗ (𝑋) is log-optimal/GRO
• Note that many sub-log-optimal e-variables exist as well…
3! /
e.g. 𝜆 + 1 − 𝜆 3" (/)
for any 𝜆 ∈ [0,1] or Neyman-Pearson e-variable
30
Simple 𝑯𝟏 , log-optimal betting
We aim to to pick 𝑠 ∗ (𝑋) maximizing
𝐄/∼1! [log 𝑠(𝑋)] over all e-variables 𝑠 𝑋 for 𝐻(
3! /
maximum is achieved for 𝑠∗ 𝑋 = 3" (/)
Proof: homework (with substantial hint)
31
Composite 𝑯𝟏
• If you think 𝐻( is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝"
• Use a 𝑝"̅ that better and better mimics the true, or just “best” fixed 𝑝"
" "
Example, 𝐻(: 𝑋& ∼ Ber %
, H": X 4 ∼ Ber 𝜃 , 𝜃 ≠ % : set:
!! 5"
𝑝"̅ 𝑋!5" = 1 𝑥! ≔ !5%
, where 𝑛" is nr of 1s in 𝑥 !
…we use notation for conditional probabilities, but we should really think of
𝑝"̅ as a sequential betting strategy with the “conditional probabilities”
indicating how to bet/invest in the next round, given the past data
32
Composite 𝑯𝟏
• If you think 𝐻( is wrong, but you do not know which alternative is true,
then… you can try to learn 𝑝"
• Use a 𝑝"̅ that better and better mimics the true, or just “best” fixed 𝑝"
"
Example, 𝐻(: 𝑋& ∼ Ber %
, set:
!! 5"
𝑝"̅ 𝑋!5" = 1 𝑥! ≔ !5%
, where 𝑛" is nr of 1s in 𝑥 !
…still, formally, using telescoping-in-reverse, we find that 𝑝"̅ also uniquely
defines a marginal probability distribution for 𝑋 ! , for each 𝑛 , and our
accumulated capital at time 𝑛 is again given by the likelihood ratio.
3̅! /# 3̅! (/$ ∣/$%! )
= ∏&."..!
3" (/# ) 3" (/$ ∣𝑿𝒊%𝟏 )
33
Composite 𝑯𝟏
"
Example, 𝐻(: 𝑋& ∼ Ber %
, set:
!! 5"
𝑝"̅ 𝑋!5" = 1 𝑥 ! ≔ , where 𝑛" is nr of 1s in 𝑥 !
!5%
using telescoping-in-reverse, we find that 𝑝"̅ also uniquely defines a
marginal probability distribution for 𝑋 ! , for each 𝑛 , and our accumulated
capital at time 𝑛 is again given by the likelihood ratio.
3̅! (/$ ∣/$%! ) 3̅! /# ∫ 3( /# ; # <#
∏&."..! = =
3" (/$ ) 3" (/# ) 3" (/# )
Last week’s “plug-in” strategy turns out to be equal to a Bayesian strategy:
Laplace Rule of Succession
34
Composite 𝑯𝟏 : plug-in vs. Bayes
Two general strategies for learning 𝑃" ∈ 𝐻" ∶
• ”prequential plug-in” (or simply “plug-in”) vs.
• “method-of-mixture” (or, in present simple context, simply “Bayesian”)
𝐻" Bernoulli model:
!! 5"
• plug-in based on the regularized MLE !5%
is precisely equal to Bayesian
strategy based on uniform prior
35
Composite 𝑯𝟏 : plug-in vs. Bayes
Two general strategies for learning 𝑃" ∈ 𝐻" ∶
• ”prequential plug-in” (or simply “plug-in”) vs.
• “method-of-mixture” (or, in present simple context, simply “Bayesian”)
𝐻" Bernoulli model:
!! 5=!
• plug-in based on the regularized MLE !5= is precisely equal to
! 5=)
Bayesian strategy based on beta prior 𝐵(𝑚", 𝑚%)
36
Composite 𝑯𝟏 : plug-in vs. Bayes
𝐻" Bernoulli model:
• plug-in can be precisely equal to Bayesian strategy
• Highly specific to Bernoulli/multinomial, e.g.:
𝐻" = {𝑁 𝜇, 1 : 𝜇 ∈ ℝ}
∑#
$*! /$ 5?
• plug-in: normal density with mean !5"
variance 1
• Bayes with normal prior 𝑁 𝑎, 𝜌 : Bayes predictive distribution with same
@
mean but variance 1 + ! > 1 (“out-model”)
Other models: differences even more substantial
37
General Insight for
Simple Nulls, Composite Alternatives
• If the null is simple, every Bayes factor defines an e-process:
3+! /# A /#
𝐄 3" /#
= ∫ 𝑝( 𝑋! ⋅ 3 /# 𝑑𝑋 ! = ∫ 𝑞 𝑋 ! 𝑑𝑋 ! = 1
"
• … but there are e-processes which are not Bayes factors
• general plug-in processes, e.g. for non-Bernoulli models
38
Today, Lecture 4
1. Bayesian Statistics
• Bayesian prediction, testing
2. E-Processes with simple nulls
• Simple alternative: GRO e-variable
• Composite alternative: learning in Bayesian & non-Bayesian manner
3. Bayesian vs. Neyman-Pearson vs. E-Process Testing with simple nulls
39
Similarities & Differences
Bayes Factor vs Neyman Pearson vs E-Testing
• In Bayesian testing, the roles of 𝐻( and 𝐻( are symmetrical
• In NP and E-testing they are not
• Type-I error control is the most important
• May seem like a bug, but turns out to be a feature when moving to
confidence intervals
• Likelihood ratios play an important role in all three theories
• NP: via the NP Lemma
• E: via growth-rate optimality of the likelihood ratio
• Bayes: via occurrence of likelihood in Bayes’ theorem
Differences
Bayes Factor vs Neyman Pearson
• The Bayesian views (marginal) likelihood ratios as evidence in favour of
either hypothesis and views the goal of testing as induction: one wants
to find out which is true, 𝐻( or 𝐻", and gets statements like ‘the
probability that 𝐻" is true is close to 95%’
• The Neymanian thinks that statements like ‘the probability of 𝐻" is…’ are
meaningless and finding out which one is true is too ambitious. She is
only interested in inductive behavior: not making mistakes too often if
one does many hypothesis tests in one’s lifetime
BF vs NP vs E
• Even though philosophies are different, we can still try to compare the
methods more closely
• As a Bayesian you can report the full posterior but it is also fine to
merely use the posterior as a tool if your goal is to make a specific
decision (which like in the NP theory can e.g. be ‘accept’ or ‘reject’)
• It then makes sense to reject the null if the Bayes posterior for 𝐻( is
smaller than 𝛼 , since then the conditional (on the data) Type-I error,
i.e. the probability that 𝐻( is true given that you reject it, is bounded by 𝛼:
𝑃 𝐻( is true 𝛿 𝑋 B = reject) ≤ 𝛼
The Bayesian’s Conditional Type-I Error
𝑃 𝐻( is true 𝛿 𝑋 B = reject) ≤ 𝛼
• This is intuitively correct but it does need proof!
• 𝑃 𝐻( is true {𝑋 B : 𝛿 𝑋 B = reject}) =
𝐄/, ∼1|{/, : F /, .GHIHJK} [𝑃 𝐻( is true 𝑋 B )] ≤ 𝐄/, ∼1|{/, : F /, .GHIHJK} 𝛼 =𝛼
43
The Bayesian’s Conditional Type-I Error
𝑃 𝐻( is true 𝛿 𝑋 B = reject) ≤ 𝛼
• This is intuitively correct but it does need proof:
• 𝑃 𝐻( is true {𝑋 B : 𝛿 𝑋 B = reject}) =
𝐄/, ∼1|{/, : F /, .GHIHJK} [𝑃 𝐻( is true 𝑋 B )] ≤ 𝐄/, ∼1|{/, : F /, .GHIHJK} 𝛼 =𝛼
44
BF in “some sense”
less conservative than E
" "
• With 𝛼 = 0.05 = and 𝑤(𝐻() = 𝑤 𝐻" = , 𝑃 𝐻( 𝑋 ! ) ≤ 1/20 is
%( %
equivalent to Bayes factor ≥ 19
• The Bayesian would reject the null if BF ≥ 19 and would get a
conditional Type-I error probability bound of 0.05
• The E-Statistician, who uses Bayesian learning for 𝐻", would reject null if
BF ≥ 20 and get an unconditional Type-I error probability bound of 0.05
• Conditional bounds imply unconditional ones (why?) but not vice versa.
• It seems the Bayesian gets better bound with less conservative rule!?!?
BF in “some sense”
less conservative than E
" "
• With 𝛼 = 0.05 = and 𝑤(𝐻() = 𝑤 𝐻" = , 𝑃 𝐻( 𝑋 ! ) ≤ 1/20 is
%( %
equivalent to Bayes factor ≥ 19
• The Bayesian would reject the null if BF ≥ 19 and would get a
conditional Type-I error probability bound of 0.05
• The E-Statistician, who uses Bayesian learning for 𝐻", would reject null if
BF ≥ 20 and get an unconditional Type-I error probability bound of 0.05
• Conditional bounds imply unconditional ones (why?) but not vice versa.
• It seems the Bayesian gets better bound with less conservative rule!?!?
This is possible because the Bayesian makes much stronger assumptions
E-bounds hold irrespective of whether (uniform) prior on 𝐻" is “correct”
Bayesian bounds rely on correctness of this prior.
BF usually
more conservative than NP
" "
• With 𝛼 = 0.05 = %(
and 𝑤(𝐻() = 𝑤 𝐻" = , 𝑃 𝐻( 𝑋 ! ) < 1/20 eqv to BF > 19
%
• Suppose 𝐻(,𝐻" simple (so Bayes factor=𝐿𝑅), 𝛼 = 0.05
• NP: reject null if 𝐿𝑅 ≥ ℓ Such that 𝑃M" 𝐿𝑅 ≥ ℓ = 0.05, i.e. 𝑝 ≤ 0.05
• (in contrast to BF and E, the NP test does not depend on the actual alternative
𝑃" ∈ 𝐻" or a prior thereon; this is one advantage of it!)
How difficult is p < 0.05 as function of 𝒏?
10 20 30 .. 50 .. 100 .. 200 .. 500
≥9 ≥ 15 ≥ 20 ≥ 32 ≥ 59 ≥ 113 ≥ 269
90% 75% 67% 64% 59% 56% 54%
How difficult is BF > 19?
10 20 30 .. 50 .. 100 .. 200 .. 500
≥ 10 ≥ 17 ≥ 24 ≥ 36 ≥ 66 ≥ 124 ≥ 289
100% 85% 80% 72% 66% 62% 58%
Upcoming Weeks
• Beyond Testing: Confidence Intervals
• Composite null hypotheses
• Math: exponential families, concentration inequalities
49