FIT3154 Lecture 5
Bayesian Inference #2
Daniel F. Schmidt
Faculty of Information Technology, Monash University
August 21, 2024
(Monash University) August 21, 2024 1 / 54
Outline
1 Prior Distributions Revisited
Bayesian Inference of Normal Distribution
Transformations of Priors
2 Weakly Informative Prior Distributions
The Cauchy Prior
Bayesian Inference of Normal Revisited
(Monash University) August 21, 2024 2 / 54
Bayes Inference - A Recap (1)
We have a model of our data, p(y | θ)
Assume our population parameter is a R.V. distributed as per
θ ∼ π(θ)dθ
where π(θ) is our prior distribution
Chosen to represent prior beliefs about θ/prior ignorance/convenience
We observe some data y = (y1 , . . . , yn )
We form the posterior distribution of θ, given y
p(y | θ)π(θ)
p(θ | y) = R
p(y | θ)π(θ)dθ
(Monash University) August 21, 2024 3 / 54
Bayes Inference - A Recap (2)
Quantity Frequentist Bayesian
Model of population p(y | θ), true population parameter θ unknown
Population Parameter True θ unknown, but fixed True θ is a random variable
i.e., θ ∼ θ(π)dθ
Point Estimates Maximum Likelihood θ̂ML Posterior mean, posterior mode
Penalized Maximum Likelihood, etc. General Bayes estimator
Measures of Uncertainty Standard error
q Posterior standard deviation
p
V θ̂ML V [θ | y]
Interval Estimates 100α% Confidence Intervals 100α% Credible Intervals
A(y) such that P(θ ∈ A(y)) = α A such that P(θ ∈ A | y) = α
if y ∼ p(y | θ), θ unknown but fixed conditional on seeing y
Frequentist vs Bayesian Inference
(Monash University) August 21, 2024 4 / 54
Today’s Relevant Figure (N/A)
Dennis Lindley (1923 - 2013). Born in London, England. Studied mathematics at
Cambridge, where he first encountered statistics. He was a committed Bayesian,
and was one of the few prominent figures in the 50s, 60s and 70s to promote
Bayesian statistics. Made many fundamental contributions to the area of
Bayesian inference.
(Monash University) August 21, 2024 5 / 54
Outline
1 Prior Distributions Revisited
Bayesian Inference of Normal Distribution
Transformations of Priors
2 Weakly Informative Prior Distributions
The Cauchy Prior
Bayesian Inference of Normal Revisited
(Monash University) August 21, 2024 6 / 54
Bayesian Inference of the Normal Distribution
Let’s start with another example of Bayesian inference
Let us examine Bayesian inference of the normal distribution
n/2 n
1 1 X
p(y | µ, σ 2 ) = exp − (µ − yj )2
2πσ 2 2σ 2 j=1
with unknown mean µ and known variance σ 2
We will relax the latter assumption later
So given a data sample y = (y1 , . . . , yn ), we want to infer the
population µ using Bayesian inference
Point estimate for µ
Interval estimates for µ
(Monash University) August 21, 2024 7 / 54
Bayesian Inference of the Normal Distribution
For Bayesian inference we need a prior distribution
Describes our a priori beliefs about potential values of µ
Let us use the normal distribution as prior for µ:
1/2 !
1 (m − µ)2
2
π(µ | m, s ) = exp −
2πs2 2s2
where m and s2 are the prior hyperparameters.
Prior mean E [µ] = m sets our “best guess” of the population
parameter
Prior variance V [µ] = s2 controls how much confidence we have in our
guess
Normal is convenient as it is the “conjugate prior”
(Monash University) August 21, 2024 8 / 54
Normal Prior Distributions
0.4
N(m=10, s 2=1)
0.35 N(m = 10, s 2=16)
0.3
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20
Two normal prior distributions; the first, N (m = 10, s2 = 1) expresses strong
belief µ is near 10; the second, N (m = 10, s2 = 16) expresses much weaker belief
that µ is near 10.
(Monash University) August 21, 2024 9 / 54
Bayesian Inference of the Normal Distribution
We can write our Bayesian model as the hierarchy
yj | µ, σ 2 ∼ N (µ, σ 2 ), j = 1, . . . , n
µ | m, s2 ∼ N (m, s2 )
After observing a sample y = (y1 , . . . , yn ), the posterior is
−1 !
n 1
µ | y ∼ N wȳ + (1 − w)m, +
σ 2 s2
where
ns2
w=
ns2 + σ 2
is the weight put on the information in the data
=⇒ (1 − w) is the weight put on the prior information
Weight controlled by sample size and population and prior variances
(Monash University) August 21, 2024 10 / 54
Bayesian Inference of the Normal Distribution
We have recorded on body mass index (BMI) on n = 25 members of
Pima indian ethnic group
We want to estimate the average BMI for the Pima population
We are told that population standard deviation of BMI in US is
6.5kg/m2
We choose to use a normal prior distribution for µ
Need to choose values for m and s
We know that the average BMI of people in US is 28.6 (source:
CDC)
So we could set m = 28.6kg/m2
(Monash University) August 21, 2024 11 / 54
Bayesian Inference of the Normal Distribution
How to select s2 (the prior standard deviation)?
We know that a majority of the US population has a BMI > 25
So we could choose s = 2, so that there is a 95% prior probability that
the population mean BMI for Pima indians is between ≈ (24.6, 32.6)
So most prior probability concentrated on µ being in the higher range
of plausible BMI values
Let’s see how our analysis turns out
(Monash University) August 21, 2024 12 / 54
Bayesian Inference of the Normal Distribution
The sample mean of our data is ȳ = 33.20kg/m2
Standard error of 1.3kg/m2
Using our prior (m = 28.6, s = 2) yields the posterior
µ | y ∼ N (31.84, 1.188)
and the 95% credible interval
(29.703, 33.976)
using qnorm(c(0.025,0.975),31.84,sqrt(1.188))
How sensitive is this to choice of prior guess m?
Instead use population average BMI of Japan (m = 22); then
µ | y ∼ N (29.87, 1.188)
which is a ≈ 7% change in estimate of mean BMI
(Monash University) August 21, 2024 13 / 54
Bayesian Inference of the Normal Distribution
The sample mean of our data is ȳ = 33.20kg/m2
Standard error of 1.3kg/m2
Using our prior (m = 28.6, s = 2) yields the posterior
µ | y ∼ N (31.84, 1.188)
and the 95% credible interval
(29.703, 33.976)
using qnorm(c(0.025,0.975),31.84,sqrt(1.188))
How sensitive is this to choice of prior guess m?
Instead use population average BMI of Japan (m = 22); then
µ | y ∼ N (29.87, 1.188)
which is a ≈ 7% change in estimate of mean BMI
(Monash University) August 21, 2024 13 / 54
Normal Posterior Distributions
0.4
N(m=28.6,s2=4) prior
N(m=22,s2=4) prior
0.35 Sample mean
0.3
0.25
0.2
0.15
0.1
0.05
0
24 26 28 30 32 34 36 38
Posterior distributions of µ for Pima Indians BMI data, for two choices of normal
prior distribution (m = 28.6kg/m2 and m = 22kg/m2 ). Notice how much the
posterior is affected by choice of prior guess m.
(Monash University) August 21, 2024 14 / 54
Why is the posterior mean so sensitive?
The posterior mean (and mode, and median) is given by
E [µ | y] = wȳ + (1 − w)m
where
ns2
w=
ns2 + σ 2
Important observations:
As s2 → ∞, w → 1 (use only the data)
As s2 → 0, w → 0 (use only the prior guess)
if s2 < ∞, then as |m| → ∞
|E [µ | y] − ȳ| → ∞
that is, the posterior mean gets further and further away from the
sample mean as we move the prior mean guess
=⇒ the posterior is very sensitive to choices of m and s2
(Monash University) August 21, 2024 15 / 54
Uninformative Priors
How to solve this sensitivity?
One way is to try and make the prior uninformative
In our analysis of normal-normal posterior, we noticed that if s2 is
larger then the prior has less effect
So perhaps we could let s2 grow very large to make a more
“uninformative” prior ...
(Monash University) August 21, 2024 16 / 54
Uninformative Priors
0.2
N(28.6,22 )
N(28.6,102 )
N(28.6,202 )
0.15 N(28.6,100 2 )
0.1
0.05
0
0 10 20 30 40 50
Normal prior distributions as s increases. Notice for very large s the distribution
becomes “flat” and spreads its probability very thinly across the µ-line.
(Monash University) August 21, 2024 17 / 54
Uninformative Priors
In the extreme case that s2 → ∞, the prior becomes uniform
π(µ) ∝ 1
which tries to say that any value of µ is a priori equally likely
=⇒ no a priori preference for any particular value of µ
One issue is that the uniform prior on R doesn’t normalise, i.e.,
Z ∞
(1)dµ = ∞
−∞
This type of prior is called improper
It lacks even a subjective probability interpretation
Assigns “zero” prior probability to every set A ⊂ R ...
(Monash University) August 21, 2024 18 / 54
Uninformative Priors
Incredibly, despite being improper, using it with the normal likelihood
results in a proper posterior
We now have the hierarchy
yj | µ, σ 2 ∼ N (µ, σ 2 ), j = 1, . . . , n
µ ∼ (1)dµ
with p(x)dx denoting distribution associated with PDF p(x)
The posterior is now !
σ2
µ | y ∼ N ȳ,
n
where we can see that
The posterior mean is now the sample mean
The posterior variance is the square of the standard error
=⇒ data completely determines inferences
(Monash University) August 21, 2024 19 / 54
Normal Posterior Distributions
0.4
N(m=28.6,s2=4) prior
N(m=22,s2=4) prior
0.35 Uninformative prior (s= )
Sample mean
0.3
0.25
0.2
0.15
0.1
0.05
0
24 26 28 30 32 34 36 38
Posterior distributions of µ for Pima Indians BMI data, for two choices of normal
prior distribution (m = 28.6kg/m2 and m = 22kg/m2 ) and uninformative
(uniform prior).
(Monash University) August 21, 2024 20 / 54
Uninformative Priors
Why does the posterior normalise even though prior does not?
We can write any prior probability distribution as
πu (θ)
π(θ) =
πc
where
πu (θ) is the prior density up to constantsRin θ
πc are the normalizing constants so that π(θ)dθ = 1, i.e.,
Z
πc = πu (θ)dθ
The posterior can then be written as
p(y | θ)πu (θ)(1/πc )
p(θ | y) = R
(1/πc ) p(y | θ)πu (θ)dθ
so that the normalizing constants πc cancel
(Monash University) August 21, 2024 21 / 54
Uninformative Priors
Why does the posterior normalise even though prior does not?
We can write any prior probability distribution as
πu (θ)
π(θ) =
πc
where
πu (θ) is the prior density up to constantsRin θ
πc are the normalizing constants so that π(θ)dθ = 1, i.e.,
Z
πc = πu (θ)dθ
The posterior can then be written as
p(y | θ)πu (θ)(1/πc )
p(θ | y) = R
(1/πc ) p(y | θ)πu (θ)dθ
=⇒ the normalizing constants πc cancel
(Monash University) August 21, 2024 22 / 54
Uninformative Priors
Uniform prior distributions are “uninformative”
So we have solved the problem of Bayesian inference!
Not quite ...
There are three problems with this approach:
1 Posteriors based on improper priors lack many Bayesian optimality
properties
2 The marginal probability p(y) is zero – which causes problems for some
parts of Bayesian theory
3 Being uniform for one parameterisation does not necessarily mean
“uninformative” in actuality
To see the last point we now need to examine transformations of
random variables
(Monash University) August 21, 2024 23 / 54
Reparameterisation of models (1)
Recall our definition of a model:
A distribution p(y | θ) over dataspace y ∈ Y n
The quantity θ are the parameter(s)
Example: normal distribution
1/2 !
1 (y − µ)2
2
p(y | µ, σ ) = exp −
2πσ 2 2σ 2
θ = (µ, σ 2 ) are the parameters
But this parameterisation is not unique!
(Monash University) August 21, 2024 24 / 54
Reparameterisation of models (2)
We can choose any one-to-one transformation of θ, i.e.,
ϕ = f (θ) ⇐⇒ θ = f −1 (ϕ)
so that ϕ is the new parameterisation; then
p(y | ϕ) ≡ p(y | θ = f −1 (ϕ))
Example, we can use precision, τ , instead of variance
r
1 1
τ = 2 ⇐⇒ σ =
σ τ
which leads to
1/2 !
τ τ (y − µ)2
−1/2
p(y | µ, σ = τ )= exp −
2π 2
E.g., (µ = 0, σ = 2) is the same model as (µ = 0, τ = 1/4)
No parameterisation is better than any other, though some may be
more interpretable
(Monash University) August 21, 2024 25 / 54
Reparameterisation of models (2)
We can choose any one-to-one transformation of θ, i.e.,
ϕ = f (θ) ⇐⇒ θ = f −1 (ϕ)
so that ϕ is the new parameterisation; then
p(y | ϕ) ≡ p(y | θ = f −1 (ϕ))
Example, we can use precision, τ , instead of variance
r
1 1
τ = 2 ⇐⇒ σ =
σ τ
which leads to
1/2 !
τ τ (y − µ)2
−1/2
p(y | µ, σ = τ )= exp −
2π 2
E.g., (µ = 0, σ = 2) is the same model as (µ = 0, τ = 1/4)
No parameterisation is better than any other, though some may be
more interpretable
(Monash University) August 21, 2024 25 / 54
Transformation of Random Variables (1)
We now consider transformation of random variables
Let X be a discrete RV with probability distribution p(X = x)
Now consider a one-to-one transformation
Y = f (X) ⇐⇒ X = f −1 (Y );
then
P(Y = y) = P(X = f −1 (Y ))
Example; if X ∈ {1, 2, 3, 4, 5, 6} is the result of a fair dice throw, and
Y = 1/X, then
Y ∈ {1, 1/2, 1/3, 1/4, 1/5, 1/6}
and
P(Y = 1/3) = P(X = 3)
(Monash University) August 21, 2024 26 / 54
Transformation of Random Variables (1)
We now consider transformation of random variables
Let X be a discrete RV with probability distribution p(X = x)
Now consider a one-to-one transformation
Y = f (X) ⇐⇒ X = f −1 (Y );
then
P(Y = y) = P(X = f −1 (Y ))
Example; if X ∈ {1, 2, 3, 4, 5, 6} is the result of a fair dice throw, and
Y = 1/X, then
Y ∈ {1, 1/2, 1/3, 1/4, 1/5, 1/6}
and
P(Y = 1/3) = P(X = 3)
(Monash University) August 21, 2024 26 / 54
Transformation of Random Variables (2)
For continuous RV we have a pdf, say p(X = x) ≡ p(x)
This means that p(y) ̸= p(x = f −1 (y)), in general
=⇒ p(x) is not the probability of x; p(x)dx is for small dx
So, we have
p(y)dy = p(x)dx
dx
p(y) = p(x)
dy
df −1 (y)
p(y) = p(f −1 (y))
dy
where the last term is called the Jacobian
We need y = f (x) be differentiable as well as one-to-one
(Monash University) August 21, 2024 27 / 54
Transformation of Random Variables (2)
For continuous RV we have a pdf, say p(X = x) ≡ p(x)
This means that p(y) ̸= p(x = f −1 (y)), in general
=⇒ p(x) is not the probability of x; p(x)dx is for small dx
So, we have
p(y)dy = p(x)dx
dx
p(y) = p(x)
dy
df −1 (y)
p(y) = p(f −1 (y))
dy
where the last term is called the Jacobian
We need y = f (x) be differentiable as well as one-to-one
(Monash University) August 21, 2024 27 / 54
Transformation of Prior Distributions (1)
Why is this important to us?
In Bayesian inference we have:
1 A probability model, p(y | θ);
2 A prior distribution π(θ) on the parameter θ
If we reparameterise our model, say ϕ = f (θ), then
Our new probability model is p(y | ϕ) ≡ p(y | θ = f −1 (ϕ))
Our new prior is the transformation of π(θ) to π(ϕ)
=⇒ θ is a random variable, so need to transform the prior
This means uniform prior for one parameterisation does not imply
uniform prior for another ...
(Monash University) August 21, 2024 28 / 54
Transformation of Prior Distributions (2)
Example: Bayesian inference of Bernoulli model
Here our probability model is
p(yj | θ) = θyj (1 − θ)1−yj
so θ is probability of success
Lets choose a uniform prior on θ, i.e.,
θ ∼ Beta(1, 1)
which has a probability density
π(θ) = 1
so that we are “uninformative”
=⇒ any probability of success equally likely
(Monash University) August 21, 2024 29 / 54
Transformation of Prior Distributions (3)
But an alternative parameterisation could be in terms of odds
θ O
O= , θ=
1−θ O+1
i.e., how much more likely a success is than a failure
This is one-to-one and differentiable; for example
θ = 0.5 ⇐⇒ O = 1
θ = 0.9 ⇐⇒ O = 9
θ = 0.1 ⇐⇒ O = 1/9
In terms of odds, our probability model is
yj 1−yj
O O O
p yj | θ = = 1−
O+1 O+1 O+1
O yj
=
O+1
What about our prior for O?
(Monash University) August 21, 2024 30 / 54
Transformation of Prior Distributions (4)
We said that θ ∼ Beta(1, 1) (i.e., uniform)
So to find π(O) we need to transform the probability density
O dO/(O + 1)
π(O) = π θ =
O+1 dO
1
=
(O + 1)2
This is clearly not uniform
Let’s have a look at it ...
Note: see if you can transform this prior back to θ (you should
recover the uniform distribution ...)
(Monash University) August 21, 2024 31 / 54
Example: Bayesian Analysis of Bernoulli Distribution (2)
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10
Uniform prior on θ leads to very non-uniform prior on odds O; would we expect
uniform prior on O to make sense? Think: θ < 1/2 ⇔ O < 1, so that
R1 R 1/2
0
(1 + O)−2 dO = 1/2 just as 0 (1)dθ = 1/2.
(Monash University) August 21, 2024 32 / 54
Transformation of Prior Distributions (3)
But it can be much worse; imagine we start by putting a uniform prior
on O
π(O) ∝ 1
which is improper as O ∈ (0, ∞)
We might think we are being uninformative, as we are saying all odds
are equally likely; but does this make sense?
Transforming back to a prior on θ yields:
θ dθ/(1 − θ)
π(θ) ∝ π O =
1−θ dθ
1
∝
(1 − θ)2
This is definitely not uninformative; in fact, let’s look at it ...
(Monash University) August 21, 2024 33 / 54
Example: Bayesian Analysis of Bernoulli Distribution (2)
10000
8000
6000
4000
2000
0
0 0.2 0.4 0.6 0.8 1
Uniform prior on O leads to prior that heavily favours θ close to 1 – so “uniform”
does not mean uninformative; we need to think about what our priors are saying
about our beliefs about the population parameter.
(Monash University) August 21, 2024 34 / 54
Outline
1 Prior Distributions Revisited
Bayesian Inference of Normal Distribution
Transformations of Priors
2 Weakly Informative Prior Distributions
The Cauchy Prior
Bayesian Inference of Normal Revisited
(Monash University) August 21, 2024 35 / 54
Weakly Informative Priors (1)
We now look at weakly informative priors
They are proper and let us specify prior beliefs
But they don’t affect our inferences too much
For our normal example; instead of using normal prior for µ, lets use
the CauchyW distribution
1
π(µ | m, s) =
πs(1 + (µ − m)2 /s2 )
where
m is the location parameter, and
s is the scale parameter
The Cauchy is bell-shaped and symmetric around m
However, Cauchy does not have a mean or variance
The parameter m sets the median (and mode)
(Monash University) August 21, 2024 36 / 54
Weakly Informative Priors (2)
Our hierarchy is then
yj | µ, σ 2 ∼ N (µ, σ 2 ), j = 1, . . . , n
µ | m, s ∼ C(m, s)
where X ∼ C(m, s) denotes that X is a RV distributed a per a
Cauchy with location m and scale s
Unfortunately, the posterior does not have a nice form and the
marginal p(y) does not have a nice solution
Because we only have one parameter θ we use numerical intergration
to compute marginal
Lets see how it works
(Monash University) August 21, 2024 37 / 54
Weakly Informative Priors (3)
For this example, let’s choose m and s to calibrate with our normal
prior
We chose prior so that 95% prior probability that µ ∈ (24.6, 32.6) to
match choice of s = 2 for our normal prior
Let’s set m = 28.6 for Cauchy, and then adjust s until
1 - pcauchy(32.6, 28.6, scale=s)
is approximately 0.025
Choice of s = 0.32 approximately satisfies this, so that
Z 32.6
π(µ | m = 28.6, s = 0.32)dµ ≈ 0.95
24.6
(Monash University) August 21, 2024 38 / 54
Cauchy vs Normal Prior Distribution (1)
1
Normal N(m=28.6, s2=4) prior
Cauchy, C(m=28.6,s=0.32) prior
0.8
0.6
0.4
0.2
0
15 20 25 30 35 40
Normal and Cauchy prior distributions for µ calibrated so that
P(µ ∈ (24.6, 32.6)) ≈ 0.95. Note how more peaked the Cauchy prior is; suggests
it might be more informative ...
(Monash University) August 21, 2024 39 / 54
Cauchy vs Normal Prior Distribution (2)
0
10
10-2
10-4
10-6
10-8
10-10
Normal N(m=28.6, s2=4) prior
Cauchy, C(m=28.6,s=0.32) prior
-12
10
15 20 25 30 35 40
... but plotting the two densities on the log-scale shows that the normal
distribution races off to zero probability as |µ − m| grows much faster than
Cauchy. It has “heavier tails”.
(Monash University) August 21, 2024 40 / 54
Cauchy vs Normal Prior Distribution (3)
Sample mean ȳ = 33.20
For normal prior:
When µ ∼ N (28.6, 4), then E [µ | y] = 31.84kg/m2
When µ ∼ N (22.0, 4), then E [µ | y] = 29.87kg/m2
For Cauchy prior:
When µ ∼ C(28.6, 0.32), then E [µ | y] = 31.96kg/m2
When µ ∼ C(22.0, 0.32), then E [µ | y] = 32.89kg/m2
Setting expected guess further away from ȳ made our estimate closer
to ȳ!
Let’s see behaviour as our prior guess m is varied for the two priors
(normal vs Cauchy)
(Monash University) August 21, 2024 41 / 54
Cauchy vs Normal Prior Distribution (4)
45
Normal N(m,s 2=4) prior
Cauchy, C(m, s=0.32) prior
Uninformative Prior
40
35
30
25
20
0 10 20 30 40 50 60
Posterior mean for the Pima indians BMI data for the three different priors as prior
guess m is varied. The uninformative prior is π(µ) ∝ 1 and has no “prior guess”.
Note how the Cauchy prior only uses its prior guess when m is “near” ȳ = 33.2.
(Monash University) August 21, 2024 42 / 54
Weakly Informative Priors: Summary
In summary, using weakly informative priors:
If our sample mean is in “vicinity” of our prior guess, use prior
information
If our sample mean is far from our prior guess, then ignore our prior
guess
Cauchy is very robust
It means we can specify any prior information we might have, but be
safe that if it is grossly wrong, we won’t bias our results
Better than uninformative prior because it remains proper
(normalizable)
(Minor) drawback is they can be harder to work with
(Monash University) August 21, 2024 43 / 54
Tails of Distributions
Cauchy distribution vs normal distribution
For simplicity, let m = 0 and s = 1 for both
Then, for normal: !
µ2
π(µ) ∝ exp −
2
For Cauchy:
1
π(µ) ∝
1 + µ2
So as |µ| → ∞ ...
probability vanishes to zero exponentially fast for normal;
probability vanishes to zero polynomially fast for Cauchy
Cauchy goes to zero infinitely slower (“heavier tails”)
(Monash University) August 21, 2024 44 / 54
Inference of Normal with Unknown σ 2 (1)
Let’s finish by revisiting inference of the normal distribution
So far we have assumed the data variance σ 2 was known
Somewhat restrictive - and unrealistic
Can we relax this assumption in the Bayesian framework?
Of course we can!
But we need a prior for σ ...
(Monash University) August 21, 2024 45 / 54
Inference of Normal with Unknown σ 2 (2)
σ controls the scale of the data (m, km, etc.)
The uninformative prior for σ is
1
π(σ) ∝
σ
which is not normalizable
Instead we might choose to use the (unit) half-Cauchy
2
π(σ) =
π(1 + σ 2 )
which is a good default choice for any scale type parameter
This has heavy tails; no mean, or variance
Median is at σ = 1
If an RV X follows a unit half-Cauchy, we can say that
X ∼ C + (0, 1)
(Monash University) August 21, 2024 46 / 54
Half-Cauchy Prior Distribution
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10
Unit half-Cauchy prior distribution for σ; it tails off to probability of zero slowly
as σ → ∞.
(Monash University) August 21, 2024 47 / 54
Inference of Normal with Unknown σ 2
Our hierarchy is then
yj | µ, σ 2 ∼ N (µ, σ 2 ), j = 1, . . . , n
µ | m, s ∼ C(m, s)
σ ∼ C + (0, 1)
The posterior distribution is then
p(y | µ, σ)π(µ | m, s)π(σ)
p(µ, σ | y) = R R
p(y | µ, σ)π(µ | m, s)π(σ)dµdσ
The denominator integral does note exist in closed form
Two dimensional integral, harder to compute numerically
Even if it did, dealing with two parameter PDFs is difficult
Hard to calculate credible intervals, etc.
(Monash University) August 21, 2024 48 / 54
Bayesian Inference with Difficult Posteriors
This problem with denominator integral is the norm rather than
exception
Can we overcome this or is that it for the Bayesian approach?
It turns out it is much easier to draw random samples from p(θ | y)
than to compute the integral p(y)
So what we usually do is draw many “samples” of θ from the
posterior, and use these samples to approximate mean, intervals, etc.
Called the Markov-Chain Monte-Carlo (MCMC) approach
We will look at how we do this a little in the next lecture
(Monash University) August 21, 2024 49 / 54
Bayesian Inference with Difficult Posteriors
This problem with denominator integral is the norm rather than
exception
Can we overcome this or is that it for the Bayesian approach?
It turns out it is much easier to draw random samples from p(θ | y)
than to compute the integral p(y)
So what we usually do is draw many “samples” of θ from the
posterior, and use these samples to approximate mean, intervals, etc.
Called the Markov-Chain Monte-Carlo (MCMC) approach
We will look at how we do this a little in the next lecture
(Monash University) August 21, 2024 49 / 54
Example: Pima indian BMI data
For now, let’s return to inference of our normal distribution
Our hierarchy is then
yj | µ, σ 2 ∼ N (µ, σ 2 ), j = 1, . . . , n
µ | m, s ∼ C(m, s)
σ ∼ C + (0, 1)
For our Pima indian BMI data, we chose to use m = 28.6 and
s = 0.32 for our hyperparameters
Draw 100, 000 samples of µ and σ from posterior p(µ, σ | y)
Use histograms of samples to visualise posterior
Use mean, sd, etc. to get statistics
(Monash University) August 21, 2024 50 / 54
Posterior Samples of µ
0.3
0.25
0.2
0.15
0.1
0.05
0
26 28 30 32 34 36 38
mean([Link]) ≈ 32.91; sd([Link]) ≈ 1.27;
quantile([Link],c(0.025,0.975)) ≈ (30.36, 35.40).
Note the two modes – one near ȳ and one near our prior guess m.
(Monash University) August 21, 2024 51 / 54
Posterior Samples of σ
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
4 6 8 10 12 14
mean([Link]) ≈ 6.17; sd([Link]) ≈ 0.92;
quantile([Link],c(0.025,0.975)) ≈ (4.68, 8.24).
ddd
(Monash University) August 21, 2024 52 / 54
Markov-Chain Monte-Carlo Approaches
In general, this sampling approach lets us explore the posterior for any
Bayesian hierarchy
The drawback is we don’t get a clear mathematical
look/understanding of what is happening
The advantage is we don’t have to worry too much about being
clever as there are general purpose progams for this
The drawback of those is that they are much slower than being clever
and developing specialised programs
(Monash University) August 21, 2024 53 / 54
Terms to Revise
Terms you should know/be aware of:
Improper prior
Transformation of random variables
Weakly informative prior distributions
Cauchy distribution
Next week we will examine Bayesian Poisson models and Bayesian
linear regression
(Monash University) August 21, 2024 54 / 54