Bayesian-Inference Class Notes
Bayesian-Inference Class Notes
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
increases, the “utility” that one receives from the additional increase in wealth grows less than
proportionally. In the St. Petersberg Paradox, prizes go up at the same rate that the probabilities
decline. In order to obtain a finite valuation, the trick would be to allow the “value” or “utility” of
prizes to increase slower than the rate probabilities decline.
j 1
2
j
The expected monetary payoff is infinite. However much you pay to play the game, you may expect to
win more. Would you risk everything that you possess to play this game? One would suppose that real-
world people would not be willing to risk an infinite amount to play this game.
As opposed to the point estimators (means, variances) used by classical statistics, Bayesian statistics is
concerned with generating the posterior distribution of the unknown parameters given both the data and
some prior density for these parameters. As such, Bayesian statistics provides a much more complete
picture of the uncertainty in the estimation of the unknown parameters, especially after the confounding
effects of nuisance parameters are removed.
Suppose you are interested in estimating 𝜃 from data 𝒚 = {𝑦1 , 𝑦2 . . . .. 𝑦𝑛 } by using a statistical model
described by a density 𝑝(𝑦/𝜃). Bayesian philosophy states that 𝜃 cannot be determined exactly, and
uncertainty about the parameter is expressed through probability statements and distributions. You can say
that 𝜃 follows a normal distribution with mean 0 and variance 1, if it is believed that this distribution best
describes the uncertainty associated with the parameter. The following steps describe the essential elements
of Bayesian inference:
i) A probability distribution for 𝜃 is formulated as 𝜋(𝜃), which is known as the prior distribution, or just
the prior. The prior distribution expresses your beliefs about the parameter before observing the data.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 49
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
ii) Given the observed data y, you choose a statistical model 𝑝(𝑦/𝜃)to describe the distribution of y given
𝜃.
iii) You update your beliefs about 𝜃 by combining information from the prior distribution and the data
through the calculation of the posterior distribution, 𝜋(𝜃/𝒚),.
The third step is carried out by using Bayes’ theorem, which enables you to combine the prior distribution
and the model.
In 1939 Jeffrey’s book ’The theory of probability’ started a resurgence of interest in Bayesian inference.
This continued throughout the 1950-60s, especially as problems with the Frequentist approach started to
emerge. The development of simulation based inference has transformed Bayesian statistics in the last 20-
30 years and it now plays a prominent part in modern statistics.
sets), then X (1) , X ( 2) with probability one.
Note
(i) The calculation of the p-value involves a sum (or integral) over data that was not observed, and
this can depend upon the form of the experiment.
(ii) A p-value is not P(H0/T(X) = t).
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 50
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Bayes’ Theorem
Bayes' theorem shows the relation between two conditional probabilities that are the reverse of each other.
This theorem is named after Reverend Thomas Bayes (1701-1761), and is also referred to as Bayes' law or
Bayes' rule (Bayes and Price 1763). The foundation of Bayesian statistics is Bayes’ theorem. Suppose we
observe a random variable X and wish to make inferences about another random variable 𝜃, where 𝜃 is
drawn from some distribution 𝑝(𝜃). From the definition of conditional probability,
𝑝(𝑥 , 𝜃)
𝑝(𝜃/𝑥) =
𝑝(𝑥)
Again from the definition of conditional probability, we can express the joint probability by conditioning on
𝜃 to give 𝑝(𝑥 , 𝜃) = 𝑝(𝜃)𝑝(𝑥/𝜃). Putting these together gives Bayes’ theorem:
𝑝(𝜃)𝑝(𝑥/𝜃)
𝑝(𝜃/𝑥) =
𝑝(𝑥)
With n possible outcomes (𝜃1 , 𝜃2 , . . . . . 𝜃𝑛 )
𝑝(𝜃𝑖 )𝑝(𝑥/𝜃𝑖 ) 𝑝(𝜃𝑖 )𝑝(𝑥/𝜃𝑖 )
𝑝(𝜃𝑖 /𝑥) = = 𝑛
𝑝(𝑥) ∑𝑖=1 𝑝(𝜃𝑖 )𝑝(𝑥/𝜃𝑖 )
𝑝(𝜃) is the prior distribution of the possible 𝜃 values, while 𝑝(𝜃/𝑥). is the posterior distribution of 𝜃
given the observed data x. The origin of Bayes’ theorem has a fascinating history (Stigler 1983). It is named
after the Rev. Thomas Bayes, a priest who never published a mathematical paper in his lifetime. The paper
in which the theorem appears was posthumously read before the Royal Society by his friend Richard Price
in 1764. Stigler suggests it was first discovered by Nicholas Saunderson, a blind mathematician/optician
who, at age 29, became Lucasian Professor of Mathematics at Cambridge (the position held earlier by Issac
Newton).
Example 1
At a certain assembly plant, three machines make 30%, 45%, and 25%, respectively, of the products. It is
known from the past experience that 2%, 3% and 2% of the products made by each machine, respectively,
are defective. Now, suppose that a finished product is randomly selected.
a) What is the probability that it is defective?
b) If a product were chosen randomly and found to be defective, what is the probability that it was made
by machine 3?
Example 2 Suppose one in every 1000 families has a genetic disorder (sex-bias) in which they produce
only female offspring. For any particular family we can define the (indicator) random variable
0 normal family
𝜃={
1 sex bias family
Suppose we observe a family with 5 girls and no boys. What is the probability that this family is a sex-bias?
Solution
From prior information, there is a 1/1000 chance that any randomly-chosen family is a sex-bias family, so
𝑝(𝜃 = 1) = 0.001
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 51
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Likewise 𝑥 = five girls , and 𝑃(five girls / sex bias family) = 1. This is 𝑝(𝑥/𝜃). It remains to compute
the probability that a random family from the population with five children has all girls. Conditioning over
all types of families (normal + sexbias),
𝑃𝑟( 5 girls) = 𝑃𝑟(5 girls / normal) × 𝑃𝑟(normal) + 𝑃𝑟(5 girls / sex − bias) × 𝑃𝑟(sexbias),
𝑔𝑖𝑣𝑖𝑛𝑔
𝑃𝑟(𝑥) = 0.55 × 0.999 + 1 × 0.001 = 0: 0322
Hence,
𝑝(𝜃 = 1) × 𝑝(𝑥 = 5𝑔𝑖𝑟𝑙𝑠/𝜃 = 1) 0.001 × 1
𝑝(𝜃 = 1/𝑥 = 5𝑔𝑖𝑟𝑙𝑠) = = = 0.03106
p(x = 5𝑔𝑖𝑟𝑙𝑠) 0.0322
Thus, a family with five girls is 31 times more likely than a random family to have the sex-bias disorder.
Likelihood Function
The likelihood function L( / x) is a function of that shows how “likely” are various parameter values
to have produced the data X that were observed. In classical statistics, the specific value of that
maximizes L( / x) is the maximum likelihood estimator (MLE) of .
In many common probability models, when the sample size n is large, L( / x) is unimodal in .
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 52
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
The Likelihood Principle of Birnbaum states that (given the data) all of the evidence about is contained
in the likelihood function. Often it is more convenient to use the log likelihood
l( / x) lnL( / x)).
This definition almost seems to be defining the likelihood function to be the same as the pdf or pmf. The
only distinction between these two functions is which variable is considered fixed and which is varying.
When we consider the pdf or pmf f(x/ ) , we are considering as fixed and x as the variable; when we
consider the likelihood function L( / x) , we are considering x to be the observed sample point and to be
varying over all possible parameter values.
Remarks
• The likelihood function is not a probability density function.
• It is an important component of both frequentist and Bayesian analyses
• It measures the support provided by the data for each possible value of the parameter. If we compare
the likelihood function at two parameter points and find that L(1 / x) L(2 / x) ), then the sample we
actually observed is more likely to have occurred if 1 than if 2 which can be interpreted as
saying that 1 is a more plausible value for the true value of than is 2 . We carefully use the word
“plausible” rather than “probable” because we often think of as a fixed value.
Example: Normal distribution. Assume that x1 , x2 , .... , xn is a random sample from N( , 2 ) , where both
and , 2 are unknown parameters R and , 2 0 . With , 2 , the likelihood is
1 xi
2
1 n
L( / x) 2 exp 2 2 2 2 exp
n
2
2 2 n
1
( xi ) 2
2 i 1
i 1
n
1
and the log-likelihood is l( / x) n2 ln( 2 ) n ln 2 ( xi )2
2 i 1
Example: Poisson distribution. Assume that x1; : : : ; xn is a random sample from Poisson( ), with
unknown 0 ; then the likelihood is
n
xi
n e e i1 xi n
L( / x) n
i 1 !
i
x ( xi !)
i 1
n
and the log-likelihood is l( / x) n (ln ) xi ln( xi !)
n
i 1 i 1
Example: M&M’s sold in the United States have 50% red candies compared to 30% in those sold in
Canada. In an experimental study, a sample of 5 candies was drawn from an unlabelled bag and 2 red
candies were observed. Is it more plausible that this bag was from the United States or from Canada?
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 53
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Solution
The likelihood function is: L( / x) 2 (1 )3 , 0.3 or 0.5
L(0.3 / x) 0.03087 0.03125 L(0.5 / x) suggesting that it is more plausible that the bag used in the
experiment was from the United States.
Likelihood Principle
If x and y are two sample points such that L( / x) L( / y) then the conclusions drawn from x and y
should be identical. Thus the likelihood principle implies that likelihood function can be used to compare
the plausibility of various parameter values. For example, if L(2 / x) 2L(1 / x) and
L( / x) L( / y) , then L(2 / y) 2L(1 / y) . Therefore, whether we observed x or y we would come
to the conclusion that 2 is twice as plausible as 1 .
Example: Consider the distribution Multinomial(n 6 , , ,1 2 ). The following two samples drawn
from this distribution have the same likelihood:
6!
X (1, 3 , 2) 3 (1 2 )2 and
1!3!2!
6! 2 2
X (2 , 2 , 2) (1 2 )2
2!2!2!
This means both samples would lead us to the same conclusion regarding the relative plausibility of
different values of .
Conjugate Priors
In the Bayesian setting it is important to compute posterior distributions. This is not always an easy task.
The main difficulty is to compute the normalizing constant in the denominator of Bayes theorem. The
appropriate likelihood function (Binomial, Gaussian, Poisson, Bernoulli,...) is typically clear from the
data, but there is a great deal of flexibility when choosing the prior distribution. However, for certain
parametric families there are convenient choices of prior distributions. Particularly convenient is when
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 54
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
the posterior belongs to the same family of distributions as the prior. Such families are called conjugate
families.
Definition (Conjugate Priors) a prior ( ) for a sampling model is called a conjugate prior if the resulting
posterior ( / x) is in the same distributional family as the prior. For example, in Beta Prior and binomial
likelihood Posterior is beta (with different parameter values!)
Remarks
The parameters of the prior distribution are called prior hyperparameters. We choose them to best represent our
beliefs about the distribution of θ. The parameters of the posterior distribution are called posterior
hyperparameters.
Any time a likelihood model is used together with its conjugate prior, we know the posterior is from
the same family of the prior, and moreover we have an explicit formula for the posterior
hyperparameters. A table summarizing some of the useful conjugate prior relationships follows. There
are many more conjugate prior relationships that are not shown in the following table but that can be
1
found in reference books on Bayesian statistics
Conjugate prior posterior
Likelihood Prior hyperparameters hyperparameters
Bernoulii Beta , x , 1 x
Binomial Beta , x , nx
Poisson Gamma , x , n
Geometric Beta , 1 , x
Uniform [0 , ] Pareto xs , k Max{max x i , xs }, k 1
Exponential Gamma , 1 , x
N0rmal nx 2 2 2 2
,
(unknown mean n 2 2 n 2 2
& known )
2
Normal , 2
We will now discuss a few of these conjugate prior relationships to try to gain additional insight.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 55
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
( / x)
1 xe( n ) , 0 which is a gamma( x , n ) (Conjugate!)
xi xi n xi
The posterior mean is E ( / x)
n n n n n n
Again, the data get weighted more heavily as n .
Loss Function
Let ˆ ˆ(X) be an estimator for the parameter . We start with a loss function L( ˆ) that measures how
good the estimator is. Effectively L( ,ˆ) is used to quantify the consequence that would be incurred for
each possible decision for various possible values of .
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 56
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Risk Function
Intuitively, we prefer decision rules with small “expected loss" resulting from the use of ˆ( x) repeatedly
with varying x. This leads to the risk function of a decision rule.
The risk function of an estimator ˆ is
L( , ˆ)f(x/ ) x is discrete
R( ,ˆ) E L( , ˆ) x X
L( , ˆ)f(x/ )dx x is continuous
x
where X is the sample space (the set of possible outcomes) of x)
When the loss function is squared error, the risk is just the MSE (mean squared error):
R( , ˆ) E ( ˆ) 2 var(ˆ) + bias 2 (ˆ)
Note If the loss function is unspecified, assume the squared error loss function.
E ˆ( x) E ˆ( x) E ˆ( x) Var ˆ( x) Bias ˆ( x)
2 2 2
This is known as bias-variance trade-offs.
Risk Comparison
How do we compare two estimators?
Given ˆ1 ( x) and ˆ2 ( x) , if R ( , ˆ1 ) R ( , ˆ2 ) then ˆ1 ( x) is the preferred estimator.
Ideally, we would like to use the decision rule ˆ( x) which minimizes the risk R ( , ˆ) for all values of .
However, this problem has no solution, as it is possible to reduce the risk at a specific 0 to zero by making
ˆ( x) equal to for all X.
0
Minimax rules
A rule ˆ is a minimax rule if max R ( , ˆ) max R ( , ˆ' ) for any other rule ˆ' . It minimizes the
maximum risk. Sometimes this doesn’t produce a sensible choice of decision rule.
Since minimax minimizes the maximum risk (ie, the loss averaged over all possible data X ) the choice of
rule is not influenced by the actual data X = x (though given the rule ˆ , the action ˆ( x) is data-dependent).
It makes sense when the maximum loss scenario must be avoided, but can can lead to poor performance on
average.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 57
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
minimax estimator if sup R( , ˆ) inf sup R( , ˆ)
^
L( / x) (
Let ( / x) ) denote the posterior following from likelihood L( / x) L and prior ( ) . The
p ( x)
expected posterior loss (posterior risk) is defined as L ( , ˆ) ( / x)d S
That is for each x we choose ˆ( x) to minimize the integral L ( , ˆ) ( / x)d
S
The form of the Bayes rule depends upon the loss function in the following way
Zero-one loss (as b ) leads to the posterior mode.
Absolute error loss leads to the posterior median.
Quadratic loss leads to the posterior mean.
Note These are not the only loss functions one could use in a given situation, and other loss functions will
lead to different Bayes rules
Example X ~ Bin(n , ) ), and the prior ( ) is a Beta ( , ) distribution. The distribution is unimodal if
1
, 1 with mode and E ( ) . The posterior distribution of / x is
2
Beta ( x, n x ) .
x 1
With zero-one loss and b the Bayes estimator is ˆ
n2
x
For a quadratic loss function, the Bayes estimator is ˆ .
n
For an absolute error loss function is the median of the posterior.
Example 1 Let X1 , X 2 , ..., X n ~ N ( ,1) . We will see that x is minimax with respect to many different loss
functions. The risk is 1/n.
Example 2 Let X1 , X 2 , ..., X n be a sample from a density f. Let F be the class of smooth densities (defined
more precisely later). We will see (later in the course) that the minimax
4
risk for estimating f is Cn 5 .
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 58
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Prior Distributions
A prior distribution of a parameter is the probability distribution that represents your uncertainty about the
parameter before the current data are examined. Multiplying the prior distribution and the likelihood
function together leads to the posterior distribution of the parameter. You use the posterior distribution to
carry out all inferences. You cannot carry out any Bayesian inference or perform any modeling without
using a prior distribution.
Non-informative Priors
Roughly speaking, a prior distribution is non-informative if the prior is “flat” relative to the likelihood
function. Thus, a prior 𝜋(𝜃) is noninformative if it has minimal impact on the posterior distribution of 𝜃.
Other names for the non-informative prior are vague, diffuse, and flat prior. Many statisticians favor non-
informative priors because they appear to be more objective. However, it is unrealistic to expect that non-
informative priors represent total ignorance about the parameter of interest. In some cases, non-informative
priors can lead to improper posteriors (non-integrable posterior density). You cannot make inferences with
improper posterior distributions. In addition, non-informative priors are often not invariant under
transformation; that is, a prior might be non-informative in one parameterization but not necessarily non-
informative if a transformation is applied.
Improper Priors
A prior 𝜋(𝜃) is said to be improper if ∫ 𝜋(𝜃)𝑑𝜃 = ∞
For example, a uniform prior distribution on the real line, 𝜋(𝜃) ∝ 1 for − ∞ < 𝜃 < ∞, is an improper
prior. Improper priors are often used in Bayesian inference since they usually yield non-informative priors
and proper posterior distributions. Improper prior distributions can lead to an improper posterior
distribution. To determine whether a posterior distribution is proper, you need to make sure that the
normalizing constant ∫ 𝐿(𝜃/𝑥)𝜋(𝜃)𝑑𝜃 is finite for all x. If an improper prior distribution leads to an
improper posterior distribution, inference based on the improper posterior distribution is invalid.
Informative Priors
An informative prior is a prior that is not dominated by the likelihood and that has an impact on the
posterior distribution. If a prior distribution dominates the likelihood, it is clearly an informative prior.
These types of distributions must be specified with care in actual practice. On the other hand, the proper use
of prior distributions illustrates the power of the Bayesian method: information gathered from the previous
study, past experience, or expert opinion can be combined with current information in a natural way.
Conjugate Priors
A prior is said to be a conjugate prior for a family of distributions if the prior and posterior distributions are
from the same family, which means that the form of the posterior has the same distributional form as the
prior distribution. For example, if the likelihood is binomial, 𝑦~ 𝐵𝑖𝑛(𝑛 , 𝜃) a conjugate prior on 𝜃 is the
beta distribution; it follows that the posterior distribution of _ is also a beta distribution. Other commonly
used conjugate prior/likelihood combinations include the normal/normal, gamma/Poisson, gamma/gamma,
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 59
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
and gamma/beta cases. The development of conjugate priors was partially driven by a desire for
computational
Convenience—conjugacy provides a practical way to obtain the posterior distributions. The Bayesian
procedures do not use conjugacy in posterior sampling.
Jeffreys’ Prior
A very useful prior is Jeffreys’ prior (Jeffreys 1961). It satisfies the local uniformity property: a prior that
does not change much over the region in which the likelihood is significant and does not assume large
values outside that range. It is based on the Fisher information matrix. Jeffreys’ prior is defined as 𝜋(𝜃) ∝
⌊𝐼(𝜃)⌋0.5 where ⌊ ⌋ denotes the determinant and 𝐼(𝜃) is the Fisher information matrix based on the
likelihood function
𝛿 2 log f(𝑦/𝜃)
𝐼(𝜃) = −𝐸 [ ]
𝛿𝜃 2
Jeffreys’ prior is locally uniform and hence non-informative. It provides an automated scheme for finding a
non-informative prior for any parametric model P(𝑦/𝜃). Another appealing property of Jeffreys’ prior is
that it is invariant with respect to one-to-one transformations. The invariance property means that if you
have a locally uniform prior on 𝜃 and 𝜑(𝜃) is a one-to-one function of 𝜃, then 𝑃(𝜑(𝜃)) = 𝜋(𝜃) ∙ ⌊𝜑′(𝜃)⌋−1
is a locally uniform prior for 𝜑(𝜃). This invariance principle carries through to multidimensional
parameters as well. While Jeffreys’ prior provides a general recipe for obtaining non-informative priors, it
has some shortcomings: the prior is improper for many models, and it can lead to improper posterior in
some cases; and the prior can be cumbersome to use in high dimensions.
Example Consider the likelihood for n independent draws from a binomial, L( / x) C x (1 )n x where
the constant C does not involve . Taking logs gives
l ( / x) lnL( / x) ln C x ln (n x) ln(1 )
l ( / x) x n x 2l ( / x) x nx x nx
Thus and likewise 2 2
1 2
(1 ) 2
(1 )2
Since E ( x) n we have
2 ln l ( / x) n n(1 )
I( / x) E 2 n 1 (1 ) 1
2
(1 ) 2
Hence, the Jeffreys’ Prior becomes p( ) 1 (1 ) 1 which is a Beta Distribution (which we discuss
later).
When there are multiple parameters, I is the Fisher Information matrix, the matrix of the expected second
partials,
2 ln l ( / x)
I( / x) ij E
i j
In this case, the Jeffreys’ Prior becomes p() I( / x) ij
Bayesian Inference
Bayesian inference about 𝜃 is primarily based on the posterior distribution of 𝜃 There are various ways in
which you can summarize this distribution. For example, you can report your findings through point
estimates. You can also use the posterior distribution to construct hypothesis tests or probability statements.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 60
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Point Estimation
Classical methods often report the maximum likelihood estimator (MLE) or the method of moments
estimator (MOME) of a parameter. In contrast, Bayesian approaches often use the posterior mean. The
definition of the posterior mean is given by
∞
𝜃̂ = 𝐸[𝜃/𝑥] = ∫ 𝜃𝑝(𝜃/𝑥)𝑑𝜃
−∞
We can also follow maximum likelihood and use the posterior mode defined as
𝜃̂ = 𝐩𝐨𝐬𝐭𝐞𝐫𝐢𝐨𝐫 𝐦𝐨𝐝𝐞 = max[𝑝(𝜃/𝑥)]
𝜃
Another candidate is the medium of the posterior distribution, where the estimator 𝜃̂ satisfies 𝑝(𝜃 >
𝜃̂/𝑥) = 𝑝(𝜃 < 𝜃̂/𝑥) = 0.5, hence
∞ ̂
𝜃
∫ 𝑝(𝜃/𝑥)𝑑𝜃 = ∫ 𝑝(𝜃/𝑥)𝑑𝜃 = 0.5
̂
𝜃 −∞
However, using any of the above estimators, or even all three simultaneously, loses the full power of a
Bayesian analysis, as the full estimator is the entire posterior density itself. If we cannot obtain the full
form of the posterior distribution, it may still be possible to obtain one of the three above estimators.
However, we can generally obtain the posterior by simulation using Gibbs sampling, and hence the Bayes
estimate of a parameter is frequently presented as a frequency histogram from (Gibbs) samples of the
posterior distribution.
xi
xi
p( x / ) e , xi 0 L( / x) n e i1
Suppose our prior distribution for is p( ) 1 , 0 ) Larger values of are less likely a priori
n n
xi xi
Then ( / x) p( )L( / x) 1 n e i 1
n 1e i 1
n
This is the kernel of a gamma distribution with “shape” parameter n and “rate” parameter x
i 1
i
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 61
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
x n 1e i1 n n xi
i
( / x) i 1 , 0
( n)
Now, given the observed data x1, . . . , xn, we can calculate any quantiles of this gamma distribution. The
0.05 and 0.95 quantiles will give us a 90% credible interval for .
Suppose we feel ( ) 1 , 0 is too subjective and favors small values of too much. Instead, let’s
consider the non-informative prior ( ) 1 , 0 (favors all values of equally).
n n
xi xi
Then our posterior is ( / x) p( )L( / x) 1 n e i 1
( n 1) 1e i 1
This posterior is a gamma with
n
parameters (n + 1) and x
i 1
i
Remarks
The HPD region will be an interval when the posterior is unimodal.
If the posterior is multimodal, the HPD region might be a discontiguous set.
It is critical to note that there is a profound difference between a confidence interval (CI) from classical
(frequentist) statistics and a Bayesian interval. The interpretation of a classical confidence interval is that
we repeat the experiment a large number of times, and construct CIs in the same fashion, that (1 − 𝛼. ) of
the time the confidence interval with enclose the (unknown) parameter. With a Bayesian HDR, there is a
(1 − 𝛼. ) probability that the interval contains the true value of the unknown parameter. Often the CI and
Bayesian intervals have essentially the same value, but again the interpretational difference remains. The
key point is that the Bayesian prior allows us to make direct probability statements about 𝜃, while under
classical statistics we can only make statements about the behavior of the statistic if we repeat an
experiment a large number of times. Given the important conceptual difference between classical and
Bayesian intervals, Bayesians often avoid using the term confidence interval.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 62
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
In Bayesian hypothesis testing, there can be more than two hypotheses under consideration, and they do not
necessarily stand in an asymmetric relationship. Rather, Bayesian hypothesis testing works just like any
other type of Bayesian inference. Let us consider the case where we are comparing only two hypotheses:
Then the Bayesian hypothesis testing can be done as follows.
Suppose you have the following null and alternative hypotheses: 𝐻0 ; 𝜃 ∈ Θ0 and 𝐻1 ; 𝜃 ∈ Θ′0
where Θ0 is a subset of the parameter space and Θ′0 is its complement. Using the posterior distribution
𝜋(𝜃/𝑥 , you can compute the posterior probabilities 𝑃(𝜃 ∈ Θ0 /𝑥)and 𝑃(𝜃 ∈ Θ′0 /𝑥) or the probabilities that
H0 and H1 are true, respectively. One way to perform a Bayesian hypothesis test is to accept the null
hypothesis if 𝑃(𝜃 ∈ Θ0 /𝑥) > 𝑃(𝜃 ∈ Θ′0 /𝑥)/ and vice versa, or to accept the null hypothesis if 𝑃(𝜃 ∈ Θ0 /
𝑥)is greater than a predefined threshold, such as 0.75, to guard against falsely accepted null distribution.
It is more difficult to carry out a point null hypothesis test in a Bayesian analysis. A point null hypothesis is
a test of 𝐻0 ; 𝜃 = 𝜃0 versus𝐻1 ; 𝜃 ≠ 𝜃0 . If the prior distribution 𝜋(𝜃) is a continuous density, then the
posterior probability of the null hypothesis being true is 0, and there is no point in carrying out the test. One
alternative is to restate the null to be a small interval hypothesis: 𝜃 ∈ Θ0 = (𝜃0 − 𝑎 , 𝜃0 + 𝑎), where a is a
very small constant. The Bayesian paradigm can deal with an interval hypothesis more easily. Another
approach is to give a mixture prior distribution to 𝜃 with a positive probability of 𝑝0 on 𝜃0 and the
density (1 − 𝑝0 )𝜋(𝜃) on _ 𝜃 ≠ 𝜃0 . This prior ensures a nonzero posterior probability on 𝜃0 , and you can
then make realistic probabilistic comparisons.
In the classical statistical framework of the frequentists, one uses the observed data to test the significant of
a particular hypothesis, and (if possible) compute a p-value (the probability p of observing the given value
of the test statistic if the null hypothesis is indeed correct). Hence, at first blush one would think that the
idea of a hypothesis test is trivial in a Bayesian framework, as using the posterior distribution
∞ 𝜃
𝑃(𝜃 > 𝜃0 ) = ∫𝜃 𝜋(𝜃/𝑥)𝑑𝜃 and 𝑃(𝜃0 < 𝜃 < 𝜃1 ) = ∫𝜃 1 𝜋(𝜃/𝑥)𝑑𝜃
0 0
The kicker with a Bayesian analysis is that we also have prior information and Bayesian hypothesis testing
addresses whether, given the data, we are more or less inclined towards the hypothesis than we initially
were. For example, suppose for the prior distribution of 𝜃 is such that 𝑃(𝜃 > 𝜃0 ) = 0.1, while for the
posterior distribution 𝑃(𝜃 > 𝜃0 ) = 0.05. The later is significant at the 5 percent level in a classical
hypothesis testing framework, but the data only doubles our confidence in the alternative hypothesis
relative to our belief based on prior information. If 𝑃(𝜃 > 𝜃0 ) = 0.5 for the prior, then a 5% posterior
probability would greatly increase our confidence in the alternative hypothesis. Hence, the prior
probabilities certainly influence hypothesis testing.
To formalize this idea, let 𝑝0 = 𝑃𝑟(𝜃 ∈ Θ0 /𝑥) and 𝑝1 = 𝑃𝑟(𝜃 ∈ Θ1 /𝑥) denote the probability, given the
observed data x, that 𝜃 is in the null (𝑝0 ) and alternative (𝑝1) hypothesis sets. Note that these are posterior
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 63
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
𝜋0 (1 − 𝑝0 ) 0.5 × 0.95
=
𝐵1 = = 19
𝑝0 (1 − 𝜋0 ) 0.05 × 0.5
When the hypotheses are simple, say Θ0 = 𝜃0 and Θ1 = 𝜃11 , then for 𝑖 = 0; 1
𝑝𝑖 ∝ Pr(𝜃𝑖 ) 𝑃𝑟(𝑥/𝜃𝑖 ) = 𝜋𝑖 𝑃𝑟(𝑥/𝜃𝑖 )
𝑝0 𝜋0 𝑃(𝑥/𝜃0 ) 𝑃(𝑥/𝜃0 )
Thus = and the Bayes factor (in favor of the null) reduces to 𝐵0 = which is simply a
𝑝1 𝜋1 𝑃(𝑥/𝜃1 ) 𝑃(𝑥/𝜃1 )
likelihood ratio.
Normal Models
Why is it so common to model data using a normal distribution?
Approximately normally distributed quantities appear often in nature.
CLT tells us any variable that is basically a sum of independent components should be approximately
normal.
Note x and S2 are independent when sampling from a normal population — so if beliefs about the
mean are independent of beliefs about the variance, a normal model may be appropriate.
The normal model is analytically convenient (exponential family, sufficient statistics x and S2)
Inference about the population mean based on a normal model will be correct as n even if the
data are truly non-normal.
When we assume a normal likelihood, we can get a wide class of posterior distributions by using
different priors.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 64
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
1 xi
2
1 n
L( / x) 2 2 exp
n
2
2 2 2 2
1 n
exp 2 ( xi ) 2
2 i 1
i 1
1
2
A conjugate prior for μ is ~ N ( , 2 ) p( ) exp 12 So the posterior is:
2
1 n 2
( / x) L( / x) p( ) exp 2 ( xi )2 exp 12
2 i 1
1 1 n 1 1 1 n 1
exp 2 ( xi ) 2 2 ( ) 2 exp 2 ( xi2 2xi 2 ) 2 ( 2 2 2 )
2 i 1 2 i 1
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 65
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
1 n n 1 n
L( 2 / x) 2
exp 2 ( xi ) 2 2 2 exp 2 ( xi ) 2
n 2 n
2 i1 2 n i1
Let W denote the sufficient statistic
The conjugate prior for 2 is the inverse gamma distribution. If a r.v. Y ~ gamma, then 1/Y ~ inverse
gamma (IG).
( 2 ) ( 1) e 2
2 and
E( )
E ( )
2 2
E ( )
2
1
2
2
Var ( )
2
Var ( )
2
so we could make guesses about E ( ) and Var ( ) and use these to determine and .
2 2
2 i 1 2
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 66
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
1 n 2
( 2 / x) ( , 2 / x)d 2
exp 2 2 xi n x 2
( n2 12 )
2 i 1
since the second piece (which depends on μ) just integrates to a normalizing constant.
Hence since ( n2 12 ) ( n2 12 ) 1 , we see the posterior for 2 is inverse gamma:
n
2 / x ~ IG n2 12 , 12 ( xi x)2
i 1
( , / x)
2
Note that ( / 2 , x) and after lots of cancellation,
( 2 / x)
1
2
( / 2 , x) 2 exp 2 (n s0 ) 2 2(n x s0 ) (n x 2 s0 2 )
1 2 n x s0 n x 2 s0 2
2
exp 2 2
2 /( n s0 ) n s0 n s0
Clearly ( / 2 , x) ) is normal:
n x s0 2
2
/ 2 , x ~ N . Note as s , / 2
,, x ~ N x ,
0
n s0 n s0 n
n s
Note also the conditional posterior mean is x 0
n s0 n s0
The relative sizes of n and s0 determine the weighting of the sample mean x and the prior mean
The marginal posterior for μ is:.
2 (n s0 )( ) 2 2
( / x) ( , 2 / x)d 2 2
( ) n 3
exp d
2 2
0 0
2 2
A A A
Letting A = 2 (n s0 )( )2 , z 2 d 2 2 dz Thus
2 2
2z 2z
( n2 32 ) ( n2 12 )
A A z A 1 z ( n 1 ) ( n 1 ) 1
( / x) e dz e dz A 2 2 z 2 2 e z dz
2z 2z
2
0 2z 0 z 0
This integrand is the kernel of a gamma density and thus the integral is a constant. So
12 ( 2 n 1)
(n s0 )( ) 2
( / x) A
( n2 12 )
2 (n s0 )( )
2 2 ( 2 n 1)
1
1
2
which is a (scaled) noncentral t kernel having noncentrality parameter and degrees of freedom n 2 .
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 67
Downloaded by Collins Kipkorir (kiruic@[Link])
lOMoARcPSD|11637703
Time is precious, but we do not know yet how precious it really is. We will only know when we are no longer able to take advantage of it…
Appendix
Bayesian inference differs from classical inference (such as the MLE, least squares and the method of
moments) in that whereas the Classical approach treats the parameter θ as fixed quantity and draws on a
repeated sampling principle, the Bayesian approach regards θ as the realized value of a random variable Θ
and specifies a probability distribution for the parameter(s) of interest. This makes life easier because it is
clear that if we observe data X=x, then we need to compute the conditional density of Θ given X=x (“the
posterior”)
Many reasons classical methods are more common than Bayesian methods are historical:
Many methods were developed in the context of controlled experiments.
Bayesian methods require a bit more mathematical formalism.
Historically (but not now) realistic Bayesian analyses had been infeasible due to a lack of computing
power.
Proverbs 21:5 The plans of the diligent lead to profit as surely as haste leads to poverty Page 68
Downloaded by Collins Kipkorir (kiruic@[Link])