0% found this document useful (0 votes)
13 views20 pages

Bayesian-Inference Class Notes

The document discusses the concept of time and its value, introduces the St. Petersburg Paradox which illustrates the limitations of expected monetary value, and contrasts Bayesian and frequentist statistical methods. It explains Bayesian inference, its historical context, and the application of Bayes' theorem in estimating probabilities based on prior beliefs and observed data. Additionally, it highlights the shortcomings of frequentist inference and provides examples to illustrate Bayesian principles.

Uploaded by

debbiegitau9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Bayesian-Inference Class Notes

The document discusses the concept of time and its value, introduces the St. Petersburg Paradox which illustrates the limitations of expected monetary value, and contrasts Bayesian and frequentist statistical methods. It explains Bayesian inference, its historical context, and the application of Bayes' theorem in estimating probabilities based on prior beliefs and observed data. Additionally, it highlights the shortcomings of frequentist inference and provides examples to illustrate Bayesian principles.

Uploaded by

debbiegitau9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

increases, the “utility” that one receives from the additional increase in wealth grows less than
proportionally. In the St. Petersberg Paradox, prizes go up at the same rate that the probabilities
decline. In order to obtain a finite valuation, the trick would be to allow the “value” or “utility” of
prizes to increase slower than the rate probabilities decline.

The St. Petersburg Paradox


Consider the following game. A coin is tossed until a head appears. If the first head appears on the jth
toss, then the payoff is £2j. How much should you pay to play this game?
Let Sj denote the event that the first head appears on the jth toss. Then p(Sj )  1 j and the payoff for
2
n
Si is Vij  2 j ,  EMV(A1 )   Vij p(Sj )   2 j 1
j 1
n

j 1
 2  
j

The expected monetary payoff is infinite. However much you pay to play the game, you may expect to
win more. Would you risk everything that you possess to play this game? One would suppose that real-
world people would not be willing to risk an infinite amount to play this game.

Bayesian Estimation and Inference


Introduction
The most frequently used statistical methods are known as frequentist (or classical) methods. These
methods assume that unknown parameters are fixed constants, and they define probability by using limiting
relative frequencies. It follows from these assumptions that probabilities are objective and that you cannot
make probabilistic statements about parameters because they are fixed. Bayesian methods offer an
alternative approach; they treat parameters as random variables and define probability as “degrees of belief”
(that is, the probability of an event is the degree to which you believe the event is true). It follows from
these postulates that probabilities are subjective and that you can make probability statements about
parameters. The term “Bayesian” comes from the prevalent usage of Bayes’ theorem, which was named
after the Reverend Thomas Bayes, an eighteenth century Presbyterian minister. Bayes was interested in
solving the question of inverse probability: after observing a collection of events, what is the probability of
one event?

As opposed to the point estimators (means, variances) used by classical statistics, Bayesian statistics is
concerned with generating the posterior distribution of the unknown parameters given both the data and
some prior density for these parameters. As such, Bayesian statistics provides a much more complete
picture of the uncertainty in the estimation of the unknown parameters, especially after the confounding
effects of nuisance parameters are removed.

Suppose you are interested in estimating 𝜃 from data 𝒚 = {𝑦1 , 𝑦2 . . . .. 𝑦𝑛 } by using a statistical model
described by a density 𝑝(𝑦/𝜃). Bayesian philosophy states that 𝜃 cannot be determined exactly, and
uncertainty about the parameter is expressed through probability statements and distributions. You can say
that 𝜃 follows a normal distribution with mean 0 and variance 1, if it is believed that this distribution best
describes the uncertainty associated with the parameter. The following steps describe the essential elements
of Bayesian inference:
i) A probability distribution for 𝜃 is formulated as 𝜋(𝜃), which is known as the prior distribution, or just
the prior. The prior distribution expresses your beliefs about the parameter before observing the data.

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 49
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

ii) Given the observed data y, you choose a statistical model 𝑝(𝑦/𝜃)to describe the distribution of y given
𝜃.
iii) You update your beliefs about 𝜃 by combining information from the prior distribution and the data
through the calculation of the posterior distribution, 𝜋(𝜃/𝒚),.
The third step is carried out by using Bayes’ theorem, which enables you to combine the prior distribution
and the model.

History of Bayesian Statistics


Bayesian methods originated with Bayes and Laplace (late 1700s to mid 1800s). In the early 1920’s, Fisher
put forward an opposing viewpoint, that statistical inference must be based entirely on probabilities with
direct experimental interpretation i.e. the repeated sampling principle.

In 1939 Jeffrey’s book ’The theory of probability’ started a resurgence of interest in Bayesian inference.
This continued throughout the 1950-60s, especially as problems with the Frequentist approach started to
emerge. The development of simulation based inference has transformed Bayesian statistics in the last 20-
30 years and it now plays a prominent part in modern statistics.

Problems with Frequentist Inference


1) Frequentist Inference generally does not condition on the observed data
A confidence interval is a set-valued function C( X )   of the data X which covers the parameter
  C( X ) a fraction 1   of repeated draws of X taken under the null H0.
This is not the same as the statement that, given data X  x , the interval C(x) covers  with
probability 1   . But this is the type of statement we might wish to make.
Example
Suppose X1 , X 2 ~ U   12 ,  12  so that X (1) and X ( 2) are the order statistics. Then C( X )  X (1) , X ( 2)  
is   level CI for  . Suppose in your data X  x , X ( 2)  X (1)  12 , (this happens in an eighth of data
1
2

 
sets), then   X (1) , X ( 2) with probability one.

2) Frequentist Inference depends on data that were never observed


The likelihood principle Suppose that two experiments relating to  , E1,E2 give rise to data y1, y2 such
that the corresponding likelihoods are proportional, that is, for all 
L( , y1 , E1 )  cL( , y2 , E2 )
then the two experiments lead to identical conclusions about  . Key point, MLE’s respect the
likelihood principle. i.e. the MLEs for  are identical in both experiments. But significance tests do
not respect the likelihood principle.
Consider a pure test for significance where we specify just H0. We must choose a test statistic T(x),
and define the p-value for data T(x) = t as p-value = P(T(X) at least as extreme as t/H0):
The choice of T(X) amounts to a statement about the direction of likely departures from the null, which
requires some consideration of alternative models.

Note
(i) The calculation of the p-value involves a sum (or integral) over data that was not observed, and
this can depend upon the form of the experiment.
(ii) A p-value is not P(H0/T(X) = t).

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 50
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Bayes’ Theorem
Bayes' theorem shows the relation between two conditional probabilities that are the reverse of each other.
This theorem is named after Reverend Thomas Bayes (1701-1761), and is also referred to as Bayes' law or
Bayes' rule (Bayes and Price 1763). The foundation of Bayesian statistics is Bayes’ theorem. Suppose we
observe a random variable X and wish to make inferences about another random variable 𝜃, where 𝜃 is
drawn from some distribution 𝑝(𝜃). From the definition of conditional probability,
𝑝(𝑥 , 𝜃)
𝑝(𝜃/𝑥) =
𝑝(𝑥)
Again from the definition of conditional probability, we can express the joint probability by conditioning on
𝜃 to give 𝑝(𝑥 , 𝜃) = 𝑝(𝜃)𝑝(𝑥/𝜃). Putting these together gives Bayes’ theorem:
𝑝(𝜃)𝑝(𝑥/𝜃)
𝑝(𝜃/𝑥) =
𝑝(𝑥)
With n possible outcomes (𝜃1 , 𝜃2 , . . . . . 𝜃𝑛 )
𝑝(𝜃𝑖 )𝑝(𝑥/𝜃𝑖 ) 𝑝(𝜃𝑖 )𝑝(𝑥/𝜃𝑖 )
𝑝(𝜃𝑖 /𝑥) = = 𝑛
𝑝(𝑥) ∑𝑖=1 𝑝(𝜃𝑖 )𝑝(𝑥/𝜃𝑖 )
𝑝(𝜃) is the prior distribution of the possible 𝜃 values, while 𝑝(𝜃/𝑥). is the posterior distribution of 𝜃
given the observed data x. The origin of Bayes’ theorem has a fascinating history (Stigler 1983). It is named
after the Rev. Thomas Bayes, a priest who never published a mathematical paper in his lifetime. The paper
in which the theorem appears was posthumously read before the Royal Society by his friend Richard Price
in 1764. Stigler suggests it was first discovered by Nicholas Saunderson, a blind mathematician/optician
who, at age 29, became Lucasian Professor of Mathematics at Cambridge (the position held earlier by Issac
Newton).

Example 1
At a certain assembly plant, three machines make 30%, 45%, and 25%, respectively, of the products. It is
known from the past experience that 2%, 3% and 2% of the products made by each machine, respectively,
are defective. Now, suppose that a finished product is randomly selected.
a) What is the probability that it is defective?
b) If a product were chosen randomly and found to be defective, what is the probability that it was made
by machine 3?

Solution Consider the following events:


A: the product is defective and Bi : the product is made by machine i=1, 2, 3,
Applying additive and multiplicative rules, we can write
(a) P(A)  P(B1 )  P(A/B1 )  P(B2 )  P(A/B2 )  P(B3 )  P(A/B3 )
 (0.3)(0.02) + (0.45)(0.03) + (0.25)(0.02) = 0.006 + 0.0135 + 0.005 = 0.0245
P(B3 )  P(A/B3 ) 0.005
(b) Using Bayes' rule P(B3/A )    0.2041
P(A) 0.0245

Example 2 Suppose one in every 1000 families has a genetic disorder (sex-bias) in which they produce
only female offspring. For any particular family we can define the (indicator) random variable
0 normal family
𝜃={
1 sex bias family
Suppose we observe a family with 5 girls and no boys. What is the probability that this family is a sex-bias?
Solution
From prior information, there is a 1/1000 chance that any randomly-chosen family is a sex-bias family, so
𝑝(𝜃 = 1) = 0.001

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 51
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Likewise 𝑥 = five girls , and 𝑃(five girls / sex bias family) = 1. This is 𝑝(𝑥/𝜃). It remains to compute
the probability that a random family from the population with five children has all girls. Conditioning over
all types of families (normal + sexbias),
𝑃𝑟( 5 girls) = 𝑃𝑟(5 girls / normal) × 𝑃𝑟(normal) + 𝑃𝑟(5 girls / sex − bias) × 𝑃𝑟(sexbias),
𝑔𝑖𝑣𝑖𝑛𝑔
𝑃𝑟(𝑥) = 0.55 × 0.999 + 1 × 0.001 = 0: 0322
Hence,
𝑝(𝜃 = 1) × 𝑝(𝑥 = 5𝑔𝑖𝑟𝑙𝑠/𝜃 = 1) 0.001 × 1
𝑝(𝜃 = 1/𝑥 = 5𝑔𝑖𝑟𝑙𝑠) = = = 0.03106
p(x = 5𝑔𝑖𝑟𝑙𝑠) 0.0322
Thus, a family with five girls is 31 times more likely than a random family to have the sex-bias disorder.

Model-Based Bayesian Inference


f(y/)   ()
 (/y) 
p( y )
where p( y ) will be discussed below,  () is the set of prior distributions of parameter set  before y is
observed, f(y/) is the likelihood of y under a model, and  (/y) is the joint posterior distribution,
sometimes called the full posterior distribution, of parameter set  that expresses uncertainty about
parameter set  after taking both the prior and data into account. Since there are usually multiple
parameters,  represents a set of n parameters, and may be considered hereafter in this article as
  {1 ,2 , ... , n}
The denominator, p( y ) .is an integral over all values of  of the product f(y/)   () ie
p( y)   f(y/)   ()d and can be regarded as a normalising constant whose presence ensure that

 (/y)is a proper density and integrates to one. By replacing p( y) with c, which is short for a `constant of
proportionality', the model-based formulation of Bayes' theorem becomes
 (/y)  1c f(y/)   ()
By removing c from the equation means one can express the Bayes theorem as,
 (/y)  f(y/)   ()
This form can be stated as the unnormalized joint posterior being proportional to the likelihood times the
prior. However, the goal in model-based Bayesian inference is usually not to summarize the unnormalized
joint posterior distribution, but to summarize the marginal distributions of the parameters. The full
parameter set  can typically be partitioned into   { , } where  is the sub-vector of interest, and 
is the complementary sub-vector of  , often referred to as a vector of nuisance parameters. In a Bayesian
framework, the presence of nuisance parameters does not pose any formal, theoretical problems. A
nuisance parameter is a parameter that exists in the joint posterior distribution of a model, though it is not a
parameter of interest. The marginal posterior distribution of  , the parameter of interest, can simply be
written as
 ( / y)    ( ,  / y)d

In model-based Bayesian inference, Bayes' theorem is used to estimate the unnormalized joint posterior
distribution, and finally the user can assess and make inferences from the marginal posterior distributions.

Likelihood Function
The likelihood function L( / x) is a function of  that shows how “likely” are various parameter values 
to have produced the data X that were observed. In classical statistics, the specific value of  that
maximizes L( / x) is the maximum likelihood estimator (MLE) of  .
In many common probability models, when the sample size n is large, L( / x) is unimodal in  .
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 52
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Definition: Let X 1 , X 2 , ... , X n have a joint density function f ( X1 , X 2 , ... , X n /  ) . Given


X 1  x1 , X 2  x2 , ... , X n  xn is observed, the function of  defined by:
L( / x)  L( / x1 , x2 , ... , xn )  f ( x1 , x2 , ... , xn /  )
Mathematically, if X 1 , X 2 , ... , X n are independently and identically distributed as X i ~ f ( xi /  ) , then
n
L( / x)   f ( xi /  ) (where x1 , x2 , ..., xn are the n data vectors).
i 1

The Likelihood Principle of Birnbaum states that (given the data) all of the evidence about  is contained
in the likelihood function. Often it is more convenient to use the log likelihood
l( / x)  lnL( / x)).
This definition almost seems to be defining the likelihood function to be the same as the pdf or pmf. The
only distinction between these two functions is which variable is considered fixed and which is varying.
When we consider the pdf or pmf f(x/  ) , we are considering  as fixed and x as the variable; when we
consider the likelihood function L( / x) , we are considering x to be the observed sample point and  to be
varying over all possible parameter values.

Remarks
• The likelihood function is not a probability density function.
• It is an important component of both frequentist and Bayesian analyses
• It measures the support provided by the data for each possible value of the parameter. If we compare
the likelihood function at two parameter points and find that L(1 / x)  L(2 / x) ), then the sample we
actually observed is more likely to have occurred if   1 than if    2 which can be interpreted as
saying that 1 is a more plausible value for the true value of  than is  2 . We carefully use the word
“plausible” rather than “probable” because we often think of  as a fixed value.

Example: Normal distribution. Assume that x1 , x2 , .... , xn is a random sample from N( ,  2 ) , where both
 and ,  2 are unknown parameters   R and , 2  0 . With    ,  2 , the likelihood is

 1  xi    
2
 1 n 
L( / x)   2  exp  2   2 2  2 exp 
n

2 
2  2 n
1

  ( xi   ) 2 
    
  2 i 1 
i 1

n
1
and the log-likelihood is l( / x)   n2 ln( 2 )  n ln   2  ( xi   )2
2 i 1

Example: Poisson distribution. Assume that x1; : : : ; xn is a random sample from Poisson(  ), with
unknown   0 ; then the likelihood is
n

 xi
n  e  e  i1 xi   n
L( / x)     n
i 1 !
 i 
x  ( xi !)
i 1
n
and the log-likelihood is l( / x)  n  (ln  )  xi   ln( xi !)
n

i 1 i 1

Example: M&M’s sold in the United States have 50% red candies compared to 30% in those sold in
Canada. In an experimental study, a sample of 5 candies was drawn from an unlabelled bag and 2 red
candies were observed. Is it more plausible that this bag was from the United States or from Canada?

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 53
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Solution
The likelihood function is: L( / x)   2 (1   )3 ,   0.3 or 0.5
L(0.3 / x)  0.03087  0.03125  L(0.5 / x) suggesting that it is more plausible that the bag used in the
experiment was from the United States.

Likelihood Principle
If x and y are two sample points such that L( / x)  L( / y)  then the conclusions drawn from x and y
should be identical. Thus the likelihood principle implies that likelihood function can be used to compare
the plausibility of various parameter values. For example, if L(2 / x)  2L(1 / x) and
L( / x)  L( / y)  , then L(2 / y)  2L(1 / y) . Therefore, whether we observed x or y we would come
to the conclusion that  2 is twice as plausible as 1 .

Example: Consider the distribution Multinomial(n  6 , , ,1  2 ). The following two samples drawn
from this distribution have the same likelihood:
6!
X  (1, 3 , 2)   3 (1  2 )2 and
1!3!2!
6! 2 2
X  (2 , 2 , 2)    (1  2 )2
2!2!2!
This means both samples would lead us to the same conclusion regarding the relative plausibility of
different values of  .

The Bayesian Framework


Suppose we observe an iid sample of data x  ( x1 , x2 , ..., xn ) . Now x is considered fixed and known. We
also must specify  ( ) , the prior distribution for  , based on any knowledge we have about  before
observing the data.
Our model for the distribution of the data will give us the likelihood
n
L( / x)   f ( xi /  )
i 1
Then by Bayes’ Law, our posterior distribution
 ( )L( / x)  ( )L( / x)
 ( / x)  
p( x) 

 ( )L( / x)d
Note that the marginal distribution of X, p(x), is simply the joint density p( , x) (i.e., the numerator) with
 integrated out. With respect to  , it is simply a normalizing constant.
Therefore  ( / x)   ( )L( / x)
Often we can calculate the posterior distribution by multiplying the prior by the likelihood and then
normalizing the posterior at the last step, by including the necessary constant.

Conjugate Priors
In the Bayesian setting it is important to compute posterior distributions. This is not always an easy task.
The main difficulty is to compute the normalizing constant in the denominator of Bayes theorem. The
appropriate likelihood function (Binomial, Gaussian, Poisson, Bernoulli,...) is typically clear from the
data, but there is a great deal of flexibility when choosing the prior distribution. However, for certain
parametric families there are convenient choices of prior distributions. Particularly convenient is when
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 54
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

the posterior belongs to the same family of distributions as the prior. Such families are called conjugate
families.
Definition (Conjugate Priors) a prior  ( ) for a sampling model is called a conjugate prior if the resulting
posterior  ( / x) is in the same distributional family as the prior. For example, in Beta Prior and binomial
likelihood  Posterior is beta (with different parameter values!)
Remarks
 The parameters of the prior distribution are called prior hyperparameters. We choose them to best represent our
beliefs about the distribution of θ. The parameters of the posterior distribution are called posterior
hyperparameters.
 Any time a likelihood model is used together with its conjugate prior, we know the posterior is from
the same family of the prior, and moreover we have an explicit formula for the posterior
hyperparameters. A table summarizing some of the useful conjugate prior relationships follows. There
are many more conjugate prior relationships that are not shown in the following table but that can be
1
found in reference books on Bayesian statistics
Conjugate prior posterior
Likelihood Prior hyperparameters hyperparameters
Bernoulii Beta  ,   x ,  1 x
Binomial Beta  ,   x ,  nx
Poisson Gamma  ,  x , n 
Geometric Beta  ,  1 ,   x
Uniform [0 ,  ] Pareto xs , k Max{max x i , xs }, k  1
Exponential Gamma  ,  1 ,   x
N0rmal nx  2   2  2 2
,
(unknown mean n 2   2 n 2   2
& known  )
2
Normal  , 2
We will now discuss a few of these conjugate prior relationships to try to gain additional insight.

Complete Derivation of Beta/Binomial Bayesian Model


n
Suppose we observe X 1 , X 2 , . .. , X n which are iid Bernoulli(  ) r.v.’s and put Y   X i Then
i 1
Y ~ Bin (n , ) . Let the prior distribution be Beta( ,  ) which is a conjugate is prior to the binomial
likelihood. Then the posterior of  given Y  y is Beta(y   , n - y   ) .
We first write the joint density of Y and 
 n    (   )  1 
 ( , y )  f ( y /  ) ( )    y (1   ) n  y    (1   )  1 
 y    ( )(  ) 
(n  1) (   ) y  1
  (1   ) n  y   1
( y  1)(n  y  1) ( )(  )
Although it is not really necessary, let’s derive the marginal density of Y :
1 (n  1) (   ) 1 y  1
p( y )    ( , y )d 
( y  1)(n  y  1) ( )(  ) 0
 (1   ) n  y   1d
0

(n  1)(   ) (  y )(n  y   )


 
( )(  )( y  1)(n  y  1) ( n     )

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 55
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Then the posterior  ( / x) is


(n  1)(   )
 y  1 (1   ) n  y   1
f ( , y ) ( )(  )( y  1)(n  y  1)
 ( / x)  
p( y ) (n  1)(   ) (  y )(n  y   )

( )(  )( y  1)(n  y  1) (n     )
(n     )
  y  1 (1   ) n  y   1 , 0    1
(  y )(n  y   )
Clearly this posterior is a Beta(  y , n  y   ) distribution.

Inference with Beta/Binomial Model


As an interval estimate for  , we could use a (quantile-based or HPD) credible interval based on this
posterior.
As a point estimator of  , we could use:
i) The posterior mean E ( / x) (the usual Bayes estimator)
ii) The posterior median iii) The posterior mode
y y
The mean of the (posterior) beta distribution is E ( / x)  
(  y )  (n  y   ) n    
y  y n      
Note E ( / x)        
n     n     n  n          n     
So the Bayes estimator E ( / x) is a weighted average of the usual frequentist estimator (sample mean) and
the prior mean. As n increases, the sample data are weighted more heavily and the prior information less
heavily. In general, with Bayesian estimation, as the sample size increases, the likelihood dominates the
prior.

The Gamma/Poisson Bayesian Model


If our data X 1 , X 2 , ... , X n are iid Poisson(  ), then a gamma gamma(  ,  ) ) prior on  is a conjugate prior.
e  xi e n    1e 
xi
n
Likelihood: L( / x)    n and prior p( )  ,   0 . Thus posterior
i 1 xi ! xi !  ( )
i 1

 ( / x)  
 1  xe( n   ) ,   0 which is a gamma(    x , n   ) (Conjugate!)
   xi  xi  n   xi    
The posterior mean is E ( / x)        
n n   n   n    n  n     
Again, the data get weighted more heavily as n   .

Loss and Risk Functions and Minimax Theory


Suppose we want to estimate a parameter  using data X  ( X1 , X 2 , ... , X n ) . What is the best possible
estimator ˆ  ˆ( X , X , ... , X ) for  ? Minimax theory provides a framework for answering this question.
1 2 n

Loss Function
Let ˆ  ˆ(X) be an estimator for the parameter  . We start with a loss function L( ˆ) that measures how
good the estimator is. Effectively L( ,ˆ) is used to quantify the consequence that would be incurred for
each possible decision for various possible values of  .

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 56
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Examples of Loss Functions include;


i) squared error loss function L( ˆ)  (  ˆ)2 0 if   ˆ
iv) Zero-one loss function L( ˆ )  
ii) absolute error loss: L( ˆ) |   ˆ | 1 if   ˆ
iii) Lp loss: L( ˆ) |   ˆ | p
In general, we use a non-negative loss L( ,ˆ)  0

Risk Function
Intuitively, we prefer decision rules with small “expected loss" resulting from the use of ˆ( x) repeatedly
with varying x. This leads to the risk function of a decision rule.
The risk function of an estimator ˆ is
  L( , ˆ)f(x/  ) x is discrete

R( ,ˆ)  E L( , ˆ)  x X  
 L( , ˆ)f(x/  )dx x is continuous
x
where X is the sample space (the set of possible outcomes) of x)

When the loss function is squared error, the risk is just the MSE (mean squared error):

R( , ˆ)  E (  ˆ) 2  var(ˆ) + bias 2 (ˆ) 
Note If the loss function is unspecified, assume the squared error loss function.

Bias-Variance Decomposition of MSE


Consider the squared loss function. The risk is known as the mean squared error (MSE) MSE  E ( ˆ)2
We show that MSE has the following decomposition
         
MSE  E ˆ( x)    E ˆ( x)  E ˆ( x)  E ˆ( x)  
2 2
 

 E ˆ( x)  E ˆ( x)  E ˆ( x)     Var ˆ( x)  Bias ˆ( x)
2 2 2
  
This is known as bias-variance trade-offs.

Risk Comparison
How do we compare two estimators?
Given ˆ1 ( x) and ˆ2 ( x) , if R ( , ˆ1 )  R ( , ˆ2 )    then ˆ1 ( x) is the preferred estimator.
Ideally, we would like to use the decision rule ˆ( x) which minimizes the risk R ( , ˆ) for all values of  .
However, this problem has no solution, as it is possible to reduce the risk at a specific  0 to zero by making
ˆ( x) equal to  for all X.
0

Minimax rules
A rule ˆ is a minimax rule if max R ( , ˆ)  max R ( , ˆ' ) for any other rule ˆ' . It minimizes the
 
maximum risk. Sometimes this doesn’t produce a sensible choice of decision rule.

Since minimax minimizes the maximum risk (ie, the loss averaged over all possible data X ) the choice of
rule is not influenced by the actual data X = x (though given the rule ˆ , the action ˆ( x) is data-dependent).
It makes sense when the maximum loss scenario must be avoided, but can can lead to poor performance on
average.

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 57
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

The minimax risk is R n  inf


ˆ  
 
sup R( , ˆ) where the infmum is over all estimators. An estimator ˆ is a


minimax estimator if sup R( , ˆ)  inf sup R( , ˆ)

 ^

 

Bayes Rule and the Posterior Risks


Definition Suppose we have a prior probability  ( ) for  . Denote
r( , ˆ)  R( , ˆ) ( )d the Bayes risk of rule ˆ . A Bayes rule is a rule that minimizes the

Bayes risk. A Bayes rule is sometimes called a Bayes procedure

L( / x) (
Let  ( / x)  ) denote the posterior following from likelihood L( / x) L and prior  ( ) . The
p ( x)
expected posterior loss (posterior risk) is defined as L ( , ˆ) ( / x)d  S

Lemma A Bayes rule minimizes the expected posterior loss.


Proof
 R( , ˆ) ( )d   LS ( , ˆ)L( / x) ( )dxd   LS ( , ˆ) ( / x) p(x )dxd

  L ( , ˆ) ( / x)d p(x )dx


 S

That is for each x we choose ˆ( x) to minimize the integral L ( , ˆ) ( / x)d
S

The form of the Bayes rule depends upon the loss function in the following way
 Zero-one loss (as b   ) leads to the posterior mode.
 Absolute error loss leads to the posterior median.
 Quadratic loss leads to the posterior mean.
Note These are not the only loss functions one could use in a given situation, and other loss functions will
lead to different Bayes rules

Example X ~ Bin(n ,  ) ), and the prior  ( ) is a Beta ( ,  ) distribution. The distribution is unimodal if
 1 
 ,   1 with mode and E ( )  . The posterior distribution of  / x is
  2  
Beta (  x, n  x   ) .
  x 1
With zero-one loss and b   the Bayes estimator is ˆ 
  n2
x
For a quadratic loss function, the Bayes estimator is ˆ  .
  n
For an absolute error loss function is the median of the posterior.

Example 1 Let X1 , X 2 , ..., X n ~ N ( ,1) . We will see that x is minimax with respect to many different loss
functions. The risk is 1/n.

Example 2 Let X1 , X 2 , ..., X n be a sample from a density f. Let F be the class of smooth densities (defined
more precisely later). We will see (later in the course) that the minimax
4
risk for estimating f is Cn 5 .

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 58
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Prior Distributions
A prior distribution of a parameter is the probability distribution that represents your uncertainty about the
parameter before the current data are examined. Multiplying the prior distribution and the likelihood
function together leads to the posterior distribution of the parameter. You use the posterior distribution to
carry out all inferences. You cannot carry out any Bayesian inference or perform any modeling without
using a prior distribution.

Objective Priors versus Subjective Priors


Bayesian probability measures the degree of belief that you have in a random event. By this definition,
probability is highly subjective. It follows that all priors are subjective priors. Not everyone agrees with this
notion of subjectivity when it comes to specifying prior distributions. There has long been a desire to obtain
results that are objectively valid. Within the Bayesian paradigm, this can be somewhat achieved by using
prior distributions that are “objective” (that is, that have a minimal impact on the posterior distribution).
Such distributions are called objective or non-informative priors (see the next section). However, while
non-informative priors are very popular in some applications, they are not always easy to construct.

Non-informative Priors
Roughly speaking, a prior distribution is non-informative if the prior is “flat” relative to the likelihood
function. Thus, a prior 𝜋(𝜃) is noninformative if it has minimal impact on the posterior distribution of 𝜃.
Other names for the non-informative prior are vague, diffuse, and flat prior. Many statisticians favor non-
informative priors because they appear to be more objective. However, it is unrealistic to expect that non-
informative priors represent total ignorance about the parameter of interest. In some cases, non-informative
priors can lead to improper posteriors (non-integrable posterior density). You cannot make inferences with
improper posterior distributions. In addition, non-informative priors are often not invariant under
transformation; that is, a prior might be non-informative in one parameterization but not necessarily non-
informative if a transformation is applied.

Improper Priors
A prior 𝜋(𝜃) is said to be improper if ∫ 𝜋(𝜃)𝑑𝜃 = ∞
For example, a uniform prior distribution on the real line, 𝜋(𝜃) ∝ 1 for − ∞ < 𝜃 < ∞, is an improper
prior. Improper priors are often used in Bayesian inference since they usually yield non-informative priors
and proper posterior distributions. Improper prior distributions can lead to an improper posterior
distribution. To determine whether a posterior distribution is proper, you need to make sure that the
normalizing constant ∫ 𝐿(𝜃/𝑥)𝜋(𝜃)𝑑𝜃 is finite for all x. If an improper prior distribution leads to an
improper posterior distribution, inference based on the improper posterior distribution is invalid.

Informative Priors
An informative prior is a prior that is not dominated by the likelihood and that has an impact on the
posterior distribution. If a prior distribution dominates the likelihood, it is clearly an informative prior.
These types of distributions must be specified with care in actual practice. On the other hand, the proper use
of prior distributions illustrates the power of the Bayesian method: information gathered from the previous
study, past experience, or expert opinion can be combined with current information in a natural way.

Conjugate Priors
A prior is said to be a conjugate prior for a family of distributions if the prior and posterior distributions are
from the same family, which means that the form of the posterior has the same distributional form as the
prior distribution. For example, if the likelihood is binomial, 𝑦~ 𝐵𝑖𝑛(𝑛 , 𝜃) a conjugate prior on 𝜃 is the
beta distribution; it follows that the posterior distribution of _ is also a beta distribution. Other commonly
used conjugate prior/likelihood combinations include the normal/normal, gamma/Poisson, gamma/gamma,

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 59
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

and gamma/beta cases. The development of conjugate priors was partially driven by a desire for
computational
Convenience—conjugacy provides a practical way to obtain the posterior distributions. The Bayesian
procedures do not use conjugacy in posterior sampling.

Jeffreys’ Prior
A very useful prior is Jeffreys’ prior (Jeffreys 1961). It satisfies the local uniformity property: a prior that
does not change much over the region in which the likelihood is significant and does not assume large
values outside that range. It is based on the Fisher information matrix. Jeffreys’ prior is defined as 𝜋(𝜃) ∝
⌊𝐼(𝜃)⌋0.5 where ⌊ ⌋ denotes the determinant and 𝐼(𝜃) is the Fisher information matrix based on the
likelihood function
𝛿 2 log f(𝑦/𝜃)
𝐼(𝜃) = −𝐸 [ ]
𝛿𝜃 2
Jeffreys’ prior is locally uniform and hence non-informative. It provides an automated scheme for finding a
non-informative prior for any parametric model P(𝑦/𝜃). Another appealing property of Jeffreys’ prior is
that it is invariant with respect to one-to-one transformations. The invariance property means that if you
have a locally uniform prior on 𝜃 and 𝜑(𝜃) is a one-to-one function of 𝜃, then 𝑃(𝜑(𝜃)) = 𝜋(𝜃) ∙ ⌊𝜑′(𝜃)⌋−1
is a locally uniform prior for 𝜑(𝜃). This invariance principle carries through to multidimensional
parameters as well. While Jeffreys’ prior provides a general recipe for obtaining non-informative priors, it
has some shortcomings: the prior is improper for many models, and it can lead to improper posterior in
some cases; and the prior can be cumbersome to use in high dimensions.

Example Consider the likelihood for n independent draws from a binomial, L( / x)  C  x (1   )n  x where
the constant C does not involve  . Taking logs gives
l ( / x)  lnL( / x)  ln C  x ln   (n  x) ln(1   )
l ( / x) x n  x  2l ( / x) x nx  x nx 
Thus   and likewise  2    2  
  1  2
 (1   ) 2
 (1   )2 
Since E ( x)  n we have
  2 ln l ( / x)  n n(1   )
I( / x)   E   2   n 1 (1   ) 1
  2
  (1   ) 2

Hence, the Jeffreys’ Prior becomes p( )   1 (1   ) 1 which is a Beta Distribution (which we discuss
later).
When there are multiple parameters, I is the Fisher Information matrix, the matrix of the expected second
partials,
  2 ln l ( / x) 
I( / x) ij   E  
  i j 
In this case, the Jeffreys’ Prior becomes p()  I( / x) ij

Bayesian Inference
Bayesian inference about 𝜃 is primarily based on the posterior distribution of 𝜃 There are various ways in
which you can summarize this distribution. For example, you can report your findings through point
estimates. You can also use the posterior distribution to construct hypothesis tests or probability statements.

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 60
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Point Estimation
Classical methods often report the maximum likelihood estimator (MLE) or the method of moments
estimator (MOME) of a parameter. In contrast, Bayesian approaches often use the posterior mean. The
definition of the posterior mean is given by

𝜃̂ = 𝐸[𝜃/𝑥] = ∫ 𝜃𝑝(𝜃/𝑥)𝑑𝜃
−∞

We can also follow maximum likelihood and use the posterior mode defined as
𝜃̂ = 𝐩𝐨𝐬𝐭𝐞𝐫𝐢𝐨𝐫 𝐦𝐨𝐝𝐞 = max[𝑝(𝜃/𝑥)]
𝜃
Another candidate is the medium of the posterior distribution, where the estimator 𝜃̂ satisfies 𝑝(𝜃 >
𝜃̂/𝑥) = 𝑝(𝜃 < 𝜃̂/𝑥) = 0.5, hence
∞ ̂
𝜃
∫ 𝑝(𝜃/𝑥)𝑑𝜃 = ∫ 𝑝(𝜃/𝑥)𝑑𝜃 = 0.5
̂
𝜃 −∞
However, using any of the above estimators, or even all three simultaneously, loses the full power of a
Bayesian analysis, as the full estimator is the entire posterior density itself. If we cannot obtain the full
form of the posterior distribution, it may still be possible to obtain one of the three above estimators.
However, we can generally obtain the posterior by simulation using Gibbs sampling, and hence the Bayes
estimate of a parameter is frequently presented as a frequency histogram from (Gibbs) samples of the
posterior distribution.

Bayesian Interval Estimation


The Bayesian interval estimates are called credible sets, which are also known as credible intervals. This is
analogous to the concept of confidence intervals used in classical statistics. Given a posterior distribution
𝜋(𝜃/𝑥), A 100(11 -  )% credible set C is a subset of  such that   ( / x)d  1   If the parameter
C
space  is discrete, a sum
You can construct credible sets that have equal tails. Quantile-Based Intervals
I If  L* is the 2 posterior quantile for  , and U* is the 1 - 2 posterior quantile for  , then ( L* ,U* ) is a
100(11 -  )% equal-tail credible interval for  .
Note p(   L* / x)   2 and p(  U* / x)   2
    
 p   ( L* ,U* ) / x  1  p   ( L* ,U* ) / x  1  p(   L* / x)  p(  U* / x)  1   
p  ( ) / x    (?/x)d   1  
U* )
 *
L ,U*
 L*

Example 1: Quantile-Based Interval


Suppose X1, . . . ,Xn are the durations of cabinets for a sample of cabinets from Western European
countries. We assume the Xi ’s follow an exponential distribution
n

xi
  xi
p( x /  )  e , xi  0  L( / x)   n e i1
Suppose our prior distribution for  is p( )  1 ,   0 ) Larger values of  are less likely a priori
n n
  xi   xi
Then  ( / x)  p( )L( / x)  1  n e i 1
  n 1e i 1

n
This is the kernel of a gamma distribution with “shape” parameter n and “rate” parameter x
i 1
i

So including the normalizing constant,

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 61
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

  x   n 1e i1 n n   xi
 i
 ( / x)   i 1  , 0
( n)
Now, given the observed data x1, . . . , xn, we can calculate any quantiles of this gamma distribution. The
0.05 and 0.95 quantiles will give us a 90% credible interval for  .

Suppose we feel  ( )  1 ,   0 is too subjective and favors small values of  too much. Instead, let’s
consider the non-informative prior  ( )  1 ,   0 (favors all values of  equally).
n n
  xi   xi
Then our posterior is  ( / x)  p( )L( / x)  1 n e i 1
  ( n 1) 1e i 1
This posterior is a gamma with
n
parameters (n + 1) and x
i 1
i

We can similarly find the equal-tail credible interval.

Example 2: Quantile-Based Interval


Consider 10 flips of a coin having P{Heads} =  . Suppose we observe 2 “heads”.
We model the count of heads as binomial
p( x /  )10Cx x (1   )10 x , xi  0 ,1, 2 ,...,10
Let’s use a uniform prior for  ie p( )  1 , 0    1
Then the posterior is:  ( / x)  p( )L( / x)  110 Cx x (1   )10 x   x (1   )10 x , 0    1
This is a beta distribution for  with parameters x + 1 and 10 − x + 1. Since x = 2 here,  ( / X  2) is
Beta(3,9). The 0.025 and 0.975 quantiles of a beta(3,9) are (.0602, .5178), which is a 95% credible interval
for  .

HPD Intervals / Regions


The equal-tail credible interval approach is ideal when the posterior distribution is symmetric.
Definition: A 100(1   )% HPD region for  is a subset C   defined By C =  :  ( / x)  k where k is
the largest number such that   ( / x)d  1  
C
The value k can be thought of as a horizontal line placed over the posterior density whose intersection(s)
with the posterior define regions with probability 1  

Remarks
 The HPD region will be an interval when the posterior is unimodal.
 If the posterior is multimodal, the HPD region might be a discontiguous set.

It is critical to note that there is a profound difference between a confidence interval (CI) from classical
(frequentist) statistics and a Bayesian interval. The interpretation of a classical confidence interval is that
we repeat the experiment a large number of times, and construct CIs in the same fashion, that (1 − 𝛼. ) of
the time the confidence interval with enclose the (unknown) parameter. With a Bayesian HDR, there is a
(1 − 𝛼. ) probability that the interval contains the true value of the unknown parameter. Often the CI and
Bayesian intervals have essentially the same value, but again the interpretational difference remains. The
key point is that the Bayesian prior allows us to make direct probability statements about 𝜃, while under
classical statistics we can only make statements about the behavior of the statistic if we repeat an
experiment a large number of times. Given the important conceptual difference between classical and
Bayesian intervals, Bayesians often avoid using the term confidence interval.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 62
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Bayesian Hypothesis Testing


Before we go into the details of Bayesian hypothesis testing, let us briefly review frequentist hypothesis
testing. Recall that in the Neyman-Pearson paradigm characteristic of frequentist hypothesis testing, there is
an asymmetric relationship between two hypotheses: the null hypothesis H0 and the alternative hypothesis
H1. A decision procedure is devised by which, on the basis of a set of collected data, the null hypothesis
will either be rejected in favour of H1, or accepted.

In Bayesian hypothesis testing, there can be more than two hypotheses under consideration, and they do not
necessarily stand in an asymmetric relationship. Rather, Bayesian hypothesis testing works just like any
other type of Bayesian inference. Let us consider the case where we are comparing only two hypotheses:
Then the Bayesian hypothesis testing can be done as follows.

Suppose you have the following null and alternative hypotheses: 𝐻0 ; 𝜃 ∈ Θ0 and 𝐻1 ; 𝜃 ∈ Θ′0
where Θ0 is a subset of the parameter space and Θ′0 is its complement. Using the posterior distribution
𝜋(𝜃/𝑥 , you can compute the posterior probabilities 𝑃(𝜃 ∈ Θ0 /𝑥)and 𝑃(𝜃 ∈ Θ′0 /𝑥) or the probabilities that
H0 and H1 are true, respectively. One way to perform a Bayesian hypothesis test is to accept the null
hypothesis if 𝑃(𝜃 ∈ Θ0 /𝑥) > 𝑃(𝜃 ∈ Θ′0 /𝑥)/ and vice versa, or to accept the null hypothesis if 𝑃(𝜃 ∈ Θ0 /
𝑥)is greater than a predefined threshold, such as 0.75, to guard against falsely accepted null distribution.

It is more difficult to carry out a point null hypothesis test in a Bayesian analysis. A point null hypothesis is
a test of 𝐻0 ; 𝜃 = 𝜃0 versus𝐻1 ; 𝜃 ≠ 𝜃0 . If the prior distribution 𝜋(𝜃) is a continuous density, then the
posterior probability of the null hypothesis being true is 0, and there is no point in carrying out the test. One
alternative is to restate the null to be a small interval hypothesis: 𝜃 ∈ Θ0 = (𝜃0 − 𝑎 , 𝜃0 + 𝑎), where a is a
very small constant. The Bayesian paradigm can deal with an interval hypothesis more easily. Another
approach is to give a mixture prior distribution to 𝜃 with a positive probability of 𝑝0 on 𝜃0 and the
density (1 − 𝑝0 )𝜋(𝜃) on _ 𝜃 ≠ 𝜃0 . This prior ensures a nonzero posterior probability on 𝜃0 , and you can
then make realistic probabilistic comparisons.

Bayes Factors and Hypothesis Testing


In the classical hypothesis testing framework, we have two alternatives. The null hypothesis H0 that the
unknown parameter 𝜃 belongs to some set or interval Θ0 (𝜃 ∈ Θ0 ), versus the alternative hypothesis H1 that
𝜃 belongs to the alternative set Θ1 (𝜃 ∈ Θ1 )) where Θ0 and Θ1 are disjoint and Θ0 ∪ Θ1 = Θ

In the classical statistical framework of the frequentists, one uses the observed data to test the significant of
a particular hypothesis, and (if possible) compute a p-value (the probability p of observing the given value
of the test statistic if the null hypothesis is indeed correct). Hence, at first blush one would think that the
idea of a hypothesis test is trivial in a Bayesian framework, as using the posterior distribution
∞ 𝜃
𝑃(𝜃 > 𝜃0 ) = ∫𝜃 𝜋(𝜃/𝑥)𝑑𝜃 and 𝑃(𝜃0 < 𝜃 < 𝜃1 ) = ∫𝜃 1 𝜋(𝜃/𝑥)𝑑𝜃
0 0
The kicker with a Bayesian analysis is that we also have prior information and Bayesian hypothesis testing
addresses whether, given the data, we are more or less inclined towards the hypothesis than we initially
were. For example, suppose for the prior distribution of 𝜃 is such that 𝑃(𝜃 > 𝜃0 ) = 0.1, while for the
posterior distribution 𝑃(𝜃 > 𝜃0 ) = 0.05. The later is significant at the 5 percent level in a classical
hypothesis testing framework, but the data only doubles our confidence in the alternative hypothesis
relative to our belief based on prior information. If 𝑃(𝜃 > 𝜃0 ) = 0.5 for the prior, then a 5% posterior
probability would greatly increase our confidence in the alternative hypothesis. Hence, the prior
probabilities certainly influence hypothesis testing.
To formalize this idea, let 𝑝0 = 𝑃𝑟(𝜃 ∈ Θ0 /𝑥) and 𝑝1 = 𝑃𝑟(𝜃 ∈ Θ1 /𝑥) denote the probability, given the
observed data x, that 𝜃 is in the null (𝑝0 ) and alternative (𝑝1) hypothesis sets. Note that these are posterior

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 63
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

probabilities. Since Θ0 ∩ Θ1 = ∅ and Θ0 ∪ Θ1 = Θ, it follows that 𝑝0 + 𝑝1 = 1. Likewise, for the prior


probabilities we have 𝜋0 = 𝑃𝑟(𝜃 ∈ Θ0 ) and 𝜋1 = 𝑃𝑟(𝜃 ∈ Θ1 )
𝜋 𝑝0
Thus the prior odds of H0 versus H1 are 𝜋0 while the posterior odds are .
1 𝑝1
The Bayes factor B0 in favor of H0 versus H1 is given by the ratio of the posterior odds divided by the prior
odds,
𝑝0 /𝑝1 𝑝0 𝜋1
𝐵0 = =
𝜋0 /𝜋1 𝜋0 𝑝1
The Bayes factor is loosely interpreted as the odds in favor of H0 versus H1 that are given by the data. Since
𝜋1 = 1 − 𝜋0 and 𝑝1 = 1 − 𝑝0 , we can also express this as
𝑝0 (1 − 𝜋0 )
𝐵0 =
𝜋0 (1 − 𝑝0 )
Likewise, by symmetry note that the Bayes factor B1 in favor of H1 versus H0 is just
1 𝜋0 (1 − 𝑝0 )
𝐵1 = =
𝐵0 𝑝0 (1 − 𝜋0 )
Consider the first case where the prior and posterior probabilities for the null were
0.1 and 0.05 (respectively). The Bayes factor in favor of H1 versus H0 is given by
𝜋0 (1 − 𝑝0 ) 0.1 × 0.95
𝐵1 = = = 2.11
𝑝0 (1 − 𝜋0 ) 0.05 × 0.9
Similarly, for the second example where the prior for the null was 0.5,

𝜋0 (1 − 𝑝0 ) 0.5 × 0.95
=
𝐵1 = = 19
𝑝0 (1 − 𝜋0 ) 0.05 × 0.5
When the hypotheses are simple, say Θ0 = 𝜃0 and Θ1 = 𝜃11 , then for 𝑖 = 0; 1
𝑝𝑖 ∝ Pr(𝜃𝑖 ) 𝑃𝑟(𝑥/𝜃𝑖 ) = 𝜋𝑖 𝑃𝑟(𝑥/𝜃𝑖 )
𝑝0 𝜋0 𝑃(𝑥/𝜃0 ) 𝑃(𝑥/𝜃0 )
Thus = and the Bayes factor (in favor of the null) reduces to 𝐵0 = which is simply a
𝑝1 𝜋1 𝑃(𝑥/𝜃1 ) 𝑃(𝑥/𝜃1 )
likelihood ratio.

Normal Models
Why is it so common to model data using a normal distribution?
 Approximately normally distributed quantities appear often in nature.
 CLT tells us any variable that is basically a sum of independent components should be approximately
normal.
 Note x and S2 are independent when sampling from a normal population — so if beliefs about the
mean are independent of beliefs about the variance, a normal model may be appropriate.
 The normal model is analytically convenient (exponential family, sufficient statistics x and S2)
 Inference about the population mean based on a normal model will be correct as n   even if the
data are truly non-normal.
 When we assume a normal likelihood, we can get a wide class of posterior distributions by using
different priors.

A Conjugate analysis with Normal Data (variance known)


Simple situation: Assume data X1, . . . ,Xn are iid N (  ,  2 ) ,with μ unknown and  2 known.
We will make inference about μ. The likelihood is;

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 64
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…


 1  xi    
2
 1 n 
L(  / x)   2     2  exp 
n

2 
2  2 2  2
1 n
exp  2  ( xi   ) 2 
    
  2 i 1 
i 1

1      
 2

A conjugate prior for μ is  ~ N ( , 2 )  p(  )  exp  12    So the posterior is:
 2     
 
 1 n         2

 (  / x)  L(  / x) p(  )  exp  2  ( xi   )2   exp  12   
 2 i 1      
 1 1 n 1   1 1 n 1 
 exp   2  ( xi   ) 2  2 (    ) 2    exp   2  ( xi2  2xi   2 )  2 (  2  2   2 ) 
 2  i 1    2  i 1  

So the posterior is:



 (  / x)  exp 
1  2 n 2 2 2
  xi  2 n x  n       2   
2 2 2 2 2 2
  
 2 2 2  i 1   
 1  n

 exp  2 2   2 (n 2   2 )  2 ( 2   2 nx )  ( 2  xi2   2 2 ) 
 2   i 1 
 1  n 1  nx    
 exp    2  2  2   2  2  2   k   (where k is some constant)
 2           
Hence
   nx   
 

 1  n   1  n nx  2   2   
2
 2  2  1 
1 2
 (  / x)  exp   2  2    2   


 k    exp   2  2     
2  

 2     2     n     
n 1  
2
 2  2   

      
Hence the posterior for μ is a normal distribution with;
1
nx  2   2  n 1  2 2
Mean and variance  2  2   2
n 2   2    n   2
1 n
The precision is the reciprocal of the variance. Here, 2 is the prior precision, 2 is the data precision
 
n 1
 is the posterior precision
 2
2
nx 2   2 2 n 2
Note the posterior mean E ( / x) is simply E (  / x)     x a combination
n 2   2 n 2   2 n 2   2
of the prior mean and the sample mean. If the prior is highly precise, the weight is large on  and if the
data are highly precise (e.g., when n is large), the weight is large on x .
2
Clearly as n   , E ( / x)  x , and Var (  / x)  if we choose a large prior variance  2 .
n
 for  large and n large, Bayesian and frequentist inference about μ will be nearly identical
2

A Conjugate analysis with Normal Data (mean known)


I Now suppose X1, . . . ,Xn are iid N (  ,  2 ) with μ known and  2 unknown. We will make inference about
 2 . Our likelihood is

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 65
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

 1 n   n 1 n 
L( 2 / x)   2  
exp  2  ( xi   ) 2    2 2 exp  2   ( xi   ) 2  
n 2 n
 
 2 i1   2  n i1 
Let W denote the sufficient statistic
The conjugate prior for  2 is the inverse gamma distribution. If a r.v. Y ~ gamma, then 1/Y ~ inverse
gamma (IG).

  ( 2 )  ( 1) e 2

The prior for  is  ( ) 


2 2
for  2  0 where   0 ,   0
( )
Note the prior mean and variance are
 2
E ( 2 )  provided that   1 , and Var ( 2 )  provided that   2
 1 (  1) 2 (  2)
So the posterior for  2 is:
    n2 w 
 ( 2 / x)  L( 2 / x) ( 2 )   2  e  
n
n2  w   (  n 2 1)
2 2
( 2 ) ( 1) e
exp    2
 2
   2

n
1
Hence the posterior is clearly an IG(  n2 ,   n2 w) distribution, where w   ( xi   ) 2 .
n i 1
How to choose the prior parameters  and  Note

 2 and 
E( )
 E ( )


2 2
 E ( )
2
 1

2


2  
Var ( )
2
 Var ( ) 

2

so we could make guesses about E ( ) and Var ( ) and use these to determine  and  .
2 2

A Model for Normal Data (mean and variance both unknown)


When X1, . . . ,Xn are iid N (  ,  2 ) with both  and  2 unknown, the conjugate prior for the mean explicitly
depends on the variance:
 1  1 
p( 2 )  ( 2 ) ( 1) eand p(  /  2 )  ( 2 ) 2 exp  2
2
(   ) 2 
 2 / s0 
The prior parameter s0 measures the analyst’s confidence in the prior specification.
When s0 is large, we strongly believe in our prior.
The joint posterior for (  , 2 ) is:
 (  ,  2 / x)  L(  ,  2 / x) p( 2 ) p(  /  2 )
  
 2   1 n
exp  2  2  ( xi   ) 2  2
 (  n2  32 ) 1
(   )2 
  2 i 1 2 / s0 
  1  n 2  
 2  
 (  n2  32 )
exp  2  2   xi  n x   2   2
1
(  2  2   2 )
  2  i 1  2 / s0 
 ( , 2 / x)   2 
  1  
exp  2  2   xi  n x 2    2  exp  2 (n  s0 ) 2  2(n x  s0 )   (n x 2 s0 2 ) 
 1 
 
n
 (   )
n
2
1
2 2 1

  2  i 1   2 

Note the second part is simply a normal kernel for μ.


To get the posterior for  2 , we integrate out μ:

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 66
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

  1  n 2 
 ( 2 / x)    ( , 2 / x)d   2 

exp  2  2   xi  n x 2 
 (  n2  12 )

  2  i 1


since the second piece (which depends on μ) just integrates to a normalizing constant.
Hence since  (  n2  12 )  (  n2  12 )  1 , we see the posterior for  2 is inverse gamma:
 n

 2 / x ~ IG  n2  12 ,   12  ( xi  x)2 
 i 1 
 (  ,  / x)
2
Note that  (  /  2 , x)  and after lots of cancellation,
 ( 2 / x)
 1
 2
 
 (  /  2 , x)    2 exp  2 (n  s0 )  2  2(n x  s0 )   (n x 2 s0 2 ) 


 1  2 n x  s0 n x 2 s0 2  
 2
exp  2   2  
 2 /( n  s0 )  n  s0 n  s0  
Clearly  ( /  2 , x) ) is normal:
 n x  s0  2 
 2 
 /  2 , x ~ N   . Note as s   ,  /  2
,, x ~ N  x , 
 
0
 n s0 n s0   n 
n s
Note also the conditional posterior mean is x 0 
n  s0 n  s0
The relative sizes of n and s0 determine the weighting of the sample mean x and the prior mean 
The marginal posterior for μ is:.
 2  (n  s0 )(   ) 2  2
 (  / x)    (  ,  2 / x)d 2    2 
  (   ) n 3

exp   d
2 2

0 0
 2 2 
A A A
Letting A = 2  (n  s0 )(   )2 , z   2   d 2   2 dz Thus
2 2
2z 2z
 (  n2  32 )  (  n2  12 )
  A A z  A  1 z  (  n  1 )  (  n  1 ) 1
 (  / x)     e dz     e dz  A 2 2  z 2 2 e z dz
 2z   2z 
2
0 2z 0 z 0

This integrand is the kernel of a gamma density and thus the integral is a constant. So
 12 ( 2  n 1)
 (n  s0 )(   ) 2 
 (  / x)  A
 (  n2  12 )

 2  (n  s0 )(   ) 
2  2 ( 2  n 1)
1

 1 
2

 
which is a (scaled) noncentral t kernel having noncentrality parameter  and degrees of freedom n  2 .

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 67
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703

Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…

Appendix
Bayesian inference differs from classical inference (such as the MLE, least squares and the method of
moments) in that whereas the Classical approach treats the parameter θ as fixed quantity and draws on a
repeated sampling principle, the Bayesian approach regards θ as the realized value of a random variable Θ
and specifies a probability distribution for the parameter(s) of interest. This makes life easier because it is
clear that if we observe data X=x, then we need to compute the conditional density of Θ given X=x (“the
posterior”)

Why use Bayesian methods? Some reasons:


i) We wish to specifically incorporate previous knowledge we have about a parameter of interest.
ii) To logically update our knowledge about the parameter after observing sample data
iii) To make formal probability statements about the parameter of interest.
iv) To specify model assumptions and check model quality and sensitivity to these assumptions in a
straightforward way.
Why do people use classical methods?
 If the parameter(s) of interest is/are truly fixed (without the possibility of changing), as is possible in a
highly controlled experiment
 If there is no prior information available about the parameter(s)
 If they prefer “cookbook”-type formulas with little input from the scientist/researcher

Many reasons classical methods are more common than Bayesian methods are historical:
 Many methods were developed in the context of controlled experiments.
 Bayesian methods require a bit more mathematical formalism.
 Historically (but not now) realistic Bayesian analyses had been infeasible due to a lack of computing
power.

Motivation for Bayesian Modeling


1) Bayesians treat unobserved data and unknown parameters in similar ways.
2) They describe each with a probability distribution.
3) As their model, Bayesians specify:
(i) A joint density function, which describes the form of the distribution of the full sample of data
(given the parameter values)
(ii) A prior distribution, which describes the behavior of the parameter(s) unconditional on the data
4) The prior could reflect:
(i) Uncertainty about a parameter that is actually fixed
(ii) the variety of values that a truly stochastic parameter could take.

Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 68
Downloaded by Collins Kipkorir (kiruic@[Link])

You might also like