0% found this document useful (0 votes)
27 views34 pages

Chapter 08 Inference

Chapter 8 discusses statistical models and inference, highlighting the distinction between exploratory data analysis and statistical modeling, which aims to describe populations through mathematical models. It covers classical and Bayesian approaches to statistical inference, detailing methods such as maximum likelihood estimation, p-values, and confidence intervals. The chapter emphasizes the iterative process of model development, evaluation, and refinement to achieve a satisfactory statistical model for making predictions and understanding phenomena.

Uploaded by

master kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views34 pages

Chapter 08 Inference

Chapter 8 discusses statistical models and inference, highlighting the distinction between exploratory data analysis and statistical modeling, which aims to describe populations through mathematical models. It covers classical and Bayesian approaches to statistical inference, detailing methods such as maximum likelihood estimation, p-values, and confidence intervals. The chapter emphasizes the iterative process of model development, evaluation, and refinement to achieve a satisfactory statistical model for making predictions and understanding phenomena.

Uploaded by

master kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 8: Statistical Models & Statistical Inference

Mark Andrews

Contents
Introduction 1

Statistical inference 4

Classical statistical inference 5


Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Sampling distribution of θ̂ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Null hypotheses and significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Bayesian statistical inference 11


Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Bayes’ rule and the posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Posterior summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Monte Carlo sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Model evaluation 27
Deviance and Log likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Cross validation and out-of-sample predictive performance . . . . . . . . . . . . . . . . . . . . . . . 29
AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
WAIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

References 33

Introduction
As important as exploratory analysis is, it is a fundamentally different undertaking to statistical modelling
and statistical inference. Exploratory data analysis aims to describe and visualize the data and to identify
possible trends and patterns in it. Statistical models, by contrast, are mathematical models of the population
from which the data originated. The term population is here used in a technical sense that is specific to
statistics. In general, a population is the hypothetical set from which our actual data are assumed to be a
random sample. From the perspective of statistical modelling, our ultimate interest lies not in the data itself
but rather in the population, or more specifically in our mathematical models of the population.
Let us assume that we have the following data set: y1 , y2 . . . yn , where each yi is a single (scalar) value. From
the perspective of statistical modelling, y1 , y2 . . . yn is a random sample from a population. Ultimately, this
population is described by a probability distribution. More specifically, we treat the values y1 , y2 . . . yn as a
realization of the random variables Y1 , Y2 . . . Yn . A random variable has a formal mathematical definition, but
informally speaking, it is a variable that can take on different values according to some probability distribution.
The probability distribution over the variables Y1 , Y2 . . . Yn can be denoted generally by P(Y1 , Y2 . . . Yn ). This
probability distribution is a function over an n dimensional space that gives the probabilities of every possible

1
combination of values of the n variables Y1 , Y2 . . . Yn . Our observed data y1 , y2 . . . yn is but one realization of
the random variables Y1 , Y2 . . . Yn , and every possible realization is a random sample from the probability
distribution P(Y1 , Y2 . . . Yn ). Thus, the possible values that Y1 , Y2 . . . Yn can take and their corresponding
probabilities formally define the population.
In general, except in artificial scenarios, we do not know the nature of the population probability distribution
P(Y1 , Y2 . . . Yn ). Our aim, therefore, is to first propose a model of this probability distribution, and then
use statistical inference to infer the properties of this model. The model that we develop is a probabilistic
model, which means that it is defines a probability distribution over the variables Y1 , Y2 . . . Yn . This model is
often referred to simply as the statistical model. In some contexts, it is also known as the generative model,
the probabilistic generative model, the data generating model, and so on. Almost always, as we will see, this
statistical model is defined in terms of parameters and other variables that are assumed to have fixed but
unknown values. We use statistical inference, which we will discuss at length below, to infer what the values
of these variables are.
The process of developing a statistical model and inferring its parameters and other unknown variables is, in
fact, an iterative process, and is illustrated in the following diagram.

Model

evaluation inference

Fitted model

First, we propose or assume a statistical model. This may be done on the basis of some exploratory data
analysis and visualization, as well as by using our scientific knowledge and understanding of the phenomenon
being studied. Having assumed some model, we then infer its parameters and other unknowns. We are
now in position to critically evaluate the resulting fitted model. Specifically, we evaluate whether or not
the predictions of model make sense and whether they are consistent with the data. On the basis of this
elaboration, we may now need to elaborate or possibly simplify our originally proposed model, which leads to
a new proposed model. The unknowns of this model are then inferred, and the new fitted model is evaluated.
This process iterates until we are satisfied that the model is sufficient for practical purposes or that no
major alternative models have been overlooked. The final fitted model is then used as the model of the
population. As we will see through examples, this is then effectively a mathematical or probabilistic model of
the phenomenon being studied. With this model, we can explain and reason about the phenomenon, make
predictions about future data, and so on.
As a brief illustrative outline of this modelling process, consider the following housing_df data frame.
housing_df <- read_csv('data/housing.csv')
housing_df
## # A tibble: 546 x 1
## price
## <dbl>
## 1 42000
## 2 38500
## 3 49500
## 4 60500
## 5 61000
## 6 66000
## 7 66000
## 8 69000
## 9 83800

2
## 10 88500
## # ... with 536 more rows
This gives the prices (in Canadian dollars) of a set of 546 houses in the city of Windsor, Ontario in 1987. We
can denote these values as y1 , y2 . . . yn and treat them as a realization of the random variables Y1 , Y2 . . . Yn
whose probability distribution is P(Y1 , Y2 . . . Yn ). This defines the population. How best to conceive of this
population in real world terms is usually a matter of debate and discussion, rather than a simple matter of
fact. For example, the population might be conceived of narrowly as the set of house prices in Windsor in
1987, or more widely as the set of house prices in mid-sized cities in Canada in the late 1980s, or more widely
still as the set of house prices in North America during the time period, and so on. Regardless of how we
conceive of the population, we do not know the true nature of P(Y1 , Y2 . . . Yn ) and so begin by proposing
a model of it. Among other things, this initial proposal could be based on exploratory data analysis and
visualization, such as that shown in Figure 1a-d. From the histograms and QQ plots shown here, we see that
the logarithm of house prices appears to be distributed as a normal distribution.

a 100
b 60

75
40
count

count
50

20
25

0 0
50000 100000 150000 200000 10.0 10.5 11.0 11.5 12.0
price log(price)

c d
12.0

150000
11.5
log(price)
price

100000 11.0

50000 10.5

0 25000 50000 75000 100000 125000 10.0 10.5 11.0 11.5 12.0
x x

Figure 1: a) A histogram of house prices (in Canadian dollars) from the city of Windsor, Ontario, Canada, in
1987. b) A histogram of the logarithm of the house prices. c) A QQ plot of the prices compared to a normal
distribution whose mean and standard deviation are set to the median and median absolute deviation (mad)
of the prices. d) The QQ plot of log of the prices, again compared to a normal distribution whose mean and
standard deviation deviation are set to the median and mad of the log prices.

This leads us to the following proposed model: for each i ∈ 1 . . . n, Yi ∼ logN(µ, σ 2 ). Note that logN(µ, σ 2 ) is
shorthand to denote a log-normal distribution whose parameters are µ and σ 2 . A log-normal distribution is a
distribution of a variable whose logarithm is normally distributed. In the model statement, the ∼ is read as

3
is distributed as. This model therefore states that each random variable Yi , for i ∈ 1 . . . N , is distributed as a
log-normal distribution whose parameters are µ and σ 2 . This model is an example of an independent and
identically distributed (iid) model: each of the n random variables is modelled as independent of one another,
and each one has the same probability distribution.
Having proposed a initial model, we must now infer the values of the unknown variables µ and σ. As we will
discuss at length below, there are two major approaches to doing statistical inference, but in practical terms,
both methods effectively lead to estimates of µ and σ as well as measures of the uncertainty of these estimates.
If we denote the estimates of µ and σ by µ̂ and σ̂, our fitted model is then the log-normal distribution
logN(µ̂, σ̂ 2 ). At this point, we may evaluate this fitted model and determine if its assumptions or predictions
are consistent with the observed data. If, on the basis of this evaluation, the assumed model is deemed
satisfactory for present practical purposes, we then can use the fitted model, particularly taking into account
the uncertainties in our estimates, to explain, reason about, or make predictions concerning house prices in
cities like Windsor during this period.

Statistical inference
Statistical inference is the inference of the values of unknown variables in a statistical model. There are
two major approaches to statistical inference. The first approach is variously referred to as the classical,
frequentist, or sampling theory based approach. The second is the Bayesian approach. The classical approach
is still the dominant one in practice, particularly for the more introductory or medium level topics. It is also
the principal, or even only, approach taught in most applied statistics courses. As such, it is the only approach
that most working scientists will have been formally introduced to. The Bayesian approach, on the other
hand, although having its origins in the 18th century and being used in practice throughout the 19th century,
had a long hiatus in statistics until around the end of the 1980s. Since then, and as we will see, largely
because of the growth in computing power, it has steadily grown in terms of its popularity and widespread
usage throughout statistics and data science. Throughout the remainder of this book, we will attempt to pay
adequate attention to both the classical and Bayesian approaches wherever we discuss statistical models. In
this section, we aim to provide a brief general introduction to both approaches. For this, we will use a simple
problem, propose a statistical model for it, and then infer the values of its unknown parameters using the
classical and the Bayesian approaches.
Example problem: The problem we will consider was described in the Guardian newspaper in January
4, 20021 : “Polish mathematicians Tomasz Gliszczynski and Waclaw Zawadowski. . . spun a Belgian one
euro coin 250 times, and found it landed heads up 140 times.” Here, the data is the observed number of
heads, m = 140. The total number of spins n = 250 is the sample size, and is a fixed and known quantity,
and so is not modelled per se. From the perspective of statistical modelling, the observed number of heads
is treated as a sample from a population. In this case, the population is the set of all possible observed
number of heads, and their relative frequencies, that would be obtained if Gliszczynski and Zawadowski were
to infinitely repeat their study under identical circumstances with the exact same Belgian one euro coin.
The population is therefore a probability distribution over the set of possible values of the number of heads
in a sample of n = 250 trials, which is a probability distribution over 0 . . . n. Thus, m is a realization of
a random variable Y whose possible values are 0 . . . n and whose probability distribution is P(Y ). We do
not know P(Y ) and so we begin by proposing a model of it. In this case, the only viable option for this
model is the binomial distribution: Y ∼ Binomial(θ, n = 250). While more complex options are possible in
principle, they would all go beyond what the data can tell us, and so simply can not be evaluated. In general
terms, a binomial distribution gives the probability of the number of so-called “successes” in a fixed number
n of “trials” where the probability of a success on any trial is fixed quantity θ and all trials are independent
of another. Translated into the terms of the current example, the binomial distribution is the probability
distribution over the observed number of heads in n = 250 spins where the probability of a heads on any trial
is θ, and where we assume the outcome on each trial is independent of every other trial. In this binomial
model, the parameter θ has a fixed but unknown quantity. The value of θ is therefore the objective of our
statistical inference.
1 https://www.theguardian.com/world/2002/jan/04/euro.eu2

4
Classical statistical inference
Classical statistical inference begins with an estimator of the value of θ, denoted by θ̂, and then considers the
sampling distribution of θ̂ for any hypothetical true value of θ. Informally speaking, we can see the estimator
θ̂ as an educated guess of what the true value of θ is. There are different methods of estimation available,
and each one can be evaluated, as we will see, according to certain criteria, such as bias, variance, etc. One
widely used method of estimation, perhaps the most widely used method, is maximum likelihood estimation.

Maximum likelihood estimation


The maximum likelihood estimator of θ is the value of θ that maximizes the likelihood function. The likelihood
function is an extremely important function in both classical and Bayesian approaches to statistical inference.
It is a function over the set of all possible values of θ, which we will denote by Θ. It gives the probability of
observing the data given any particular value of θ. In order to determine the likelihood function in the case
of the binomial distribution model, we first start with the definition of the binomial distribution itself. If Y is
random variable distributed as a binomial distribution with parameter θ and sample size n, which we can
state succinctly as Y ∼ Binomial(θ, n), then the probability that Y takes on the value of m is as follows:
 
n m
P(Y = m|θ, n) = Binomial(Y = m|θ, n) = θ (1 − θ)n−m .
m

Why the binomial distribution has the probability mass function shown on the very right hand side here is
not something we will derive here, and we will just take it as a given, but it is relatively straightforward to
derive from the above definition of a binomial problem. Note that this probability mass function is a function
that maps values of m ∈ 0 . . . n to the interval for fixed values for θ and n. The corresponding likelihood
 m [0, 1]n−m
n
function takes the same formula, i.e. m θ (1 − θ) , and treats it as a function of θ for fixed values of m
and n. In other words, the likelihood function is as follows:
 
n m
L(θ|m, n) = θ (1 − θ)n−m ,
m

where θ is assumed to takes values in the interval Θ = [0, 1] and where n and m are fixed. In other words, the
likelihood function gives the probability of the observed data for every possible value of the unknown variable
θ ∈ Θ. Technically speaking, any likelihood function is defined up to a proportional constant. What this
means is that multiplying, or dividing, any likelihood function by a fixed constant value results in the same
likelihood function. In practice, therefore, when writing a likelihood function, we usually drop any constant
n

multipliers. In the above likelihood function, the binomial coefficient m is a constant term that does not
change for any value of θ, and so it can be dropped, leading to the likelihood function being written as

L(θ|m, n) = θm (1 − θ)n−m .

The likelihood function for m = 140 and n = 250 is shown in Figure 2a. As we can see from this plot,
the likelihood function is concentrated from around θ = 0.45 to around θ = 0.65. Outside of that range,
the likelihood of any value of θ becomes negligible. In itself, this is very informative. It tells us that the
probability of observing the data that we did, i.e., m = 140 heads in n = 250 spins, is negligible if θ is less
than around 0.45 or greater than around 0.65. Therefore, only values in the range of approximately 0.45 to
0.65 have evidential support from the data.
The value of θ that maximizes the likelihood function is obtained by the usual procedure of optimizing
functions, namely calculating the derivative of the function with respect to θ, setting the derivative equal
to zero, and solving for the value of θ. We can do this more easily by using the logarithm of the likelihood
function, rather than the likelihood function itself. The logarithm of the likelihood function is shown in
Figure 2b. Because the logarithm is a monotonic transformation, the value of θ that maximizes the logarithm
of the likelihood also maximizes the likelihood function. The derivative of the log of the likelihood function

5
a b

logL(θ|n, m)
L(θ|n, m)

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ

Figure 2: a) The binomial likelihood function for θ when m = 140 and n = 250. b) The logarithm of the
likelihood function.

with respect to θ is as follows:


d d
log θm (1 − θ)n−m =

[m log(θ) + (n − m) log(1 − θ)] ,
dθ dθ
m n−m
= −
θ 1−θ
Setting this derivative equal to zero and solving for θ gives us the following.
m n−m
− = 0,
θ 1−θ
m
θ= .
n
Thus, the maximum likelihood estimator for θ is θ̂ = m 140
n = 250 = 0.56. This is obviously a very simple and
intuitive result. It tells us that the best guess for the value of θ, which is the probability of obtaining heads
on any given spin, is simply the relative number of heads in the n spins so far.

Sampling distribution of θ̂
The maximum likelihood estimator can be seen as a random variable. It is a deterministic function of the
observed data m, but m would vary were we repeat the experiment even under identical circumstances. For
example, if we knew the true value of θ, let’s call this true value θ? , and we spun the coin n times, and did so
ad infinitum, then we know that the distribution of m (the observed number of heads) would be the binomial
distribution with sample size n and parameter value of θ? . Given that the maximum likelihood estimator is al-
ways θ̂ = m m
n , the probability that θ̂ takes on the value of n for any given value of m if the true value of θ is θ? is
 
n m
θ (1 − θ? )n−m .
m ?
In general, the sampling distribution for θ̂ when the true value of θ is θ? can be written as follows.
 
n θ̂n
P(θ̂|θ? , n) = θ (1 − θ? )n−θ̂n .
θ̂n ?

The sampling distribution of θ̂ when n = 250 and for the example value of θ? = 0.64 is shown in Figure 3.
The expected value of the sampling distribution of θ̂ is θ? . In other words, on average, θ̂ is equal to θ? . This
tells us that the binomial maximum likelihood estimator is an unbiased estimator of the true value of θ,

6
0.05

0.04
mathrmP(θ^|thetastar, n)

0.03

0.02

0.01

0.00

0.5 0.6 0.7


θ^

Figure 3: Sampling distribution of the binomial maximum likelihood estimator θ̂ when n = 250 and θ? = 0.64.
Here, we limit the x axis to just beyond those values of θ̂ that have nontrivial probabilities.

i.e. θ? . The variance of the distribution of θ̂ is n1 θ? (1 − θ? ). The lower the variance of the estimator, the less
sampling variability there will be in the value of the estimators. Here, we see that the variance decreases as n
increases, and so as sample size increases, there p is less variability in the estimator’s values. The standard
deviation of the sampling distribution is √1n θ? (1 − θ? ). This is known as the standard error, and often
plays an important role in classical statistical inference, as we will see in later examples.

p-values
Informally speaking, a p-value tells us whether the value of an estimator, or more generally, the value of
any statistic or function of the observed data, is consistent with some hypothetical value of the unknown
variable. If the p-value is low, the estimator’s value is not consistent with the hypothesized value. The higher
the p-value is, the more consistent the estimator’s value is with the hypothesized value. If the p-value is
sufficiently low, we can, practically speaking, rule out or reject the hypothesized value. In general, the p-value
takes values from 0 to 1 and so is an inherently continuous measure of support for a hypothesis. Where
we “draw the line” on this continuum between low and not-low is a matter of convention, but the almost
universally held convention is that a p-value must be lower than at most 0.05 to be considered sufficiently low
for the corresponding hypothesized value to be rejected. This threshold for rejection/non-threshold is usually
signified by α.
Technically speaking, p-values are tail areas of the sampling distribution of the estimator corresponding to a
particular hypothesized value of the unknown variable. Once we have the sampling distribution, p-values
are straightforward to calculate. In the current problem, the unknown variable is θ and we can in principle
hypothesize that its true value is any value between 0 and 1. If, for example, we hypothesize that the true
value of θ is θ? = 0.64, the sampling distribution of θ̂ is that shown in Figure 3. On the basis of this sampling
distribution, we can see that some values for θ̂ are expected and others are not. For example, we see that the
θ̂ values are mostly from around 0.58 to around 0.7, and values of θ̂ much below or above those extremes
rarely occur. On the basis of the estimator’s value that we did obtain, namely θ̂ = 140 250 = 0.56, we can see
that this result seems to be outside the range of values of θ̂ that we would expect if θ? = 0.64. In order to
be precise in our statement of whether θ̂ = 0.56 is beyond what we would expect if θ? = 0.64, we calculate

7
the tail areas of the sampling distribution defined by values as or more extreme than θ̂ = 0.56. These tail
areas are shaded in Figure 4a. The total area in these tails defines the p-value. In other words, the p-value is
the probability of observing a value of the estimator as or more extreme than θ̂ = 0.56 if θ? = 0.64. If the
p-value is low, then we know that θ̂ = 0.56 is far into the tails of the sampling distribution when θ = 0.64. In
this particular example, the area of these tails, and so therefore the p-value, is approximately 0.01. This is
clearly low according to the conventional standards mentioned above, and so therefore we say that the result
θ̂ = 0.56 is not consistent with the hypothesis that θ? = 0.64, and so in practical terms, we can reject the
hypothesis that the true value of θ is θ? = 0.64.

a b 0.05
0.05

0.04
0.04
mathrmP(θ^|θ ∗ , n)

mathrmP(θ^|θ ∗ , n)
0.03 0.03

0.02 0.02

0.01 0.01

0.00 0.00
0.5 0.6 0.7 0.35 0.40 0.45 0.50 0.55 0.60 0.65
θ^ θ^

Figure 4: a) Sampling distribution of the binomial maximum likelihood estimator θ̂ when n = 250 and
θ? = 0.64. b) Sampling distribution of the binomial maximum likelihood estimator θ̂ when n = 250 and
θ? = 0.5. In both cases, as in Figure 3, we limit the x axis to just beyond those values of θ̂ that have nontrivial
probabilities. The shaded tail areas correspond to values of θ̂ that are as or more extreme, relative to the
140
center, than the value of the estimator we observed, which was θ̂ = 250 = 0.56.

In more detail, the p-value is the sum of the two tails of the sampling distribution when θ? = 0.64. The lower
tail is defined by all those values from θ̂ = 0 up to θ̂ = 0.56. These are values of θ̂ that are less than or equal to
the observed θ̂ of 140
250 = 0.56. The upper tail is defined by all those values from θ̂ = 0.64 + |0.64 − 0.56| = 0.72
up to 1. These are the values θ̂ that as extreme relative to θ? = 0.64 as is 0.56, but in the opposite direction.
Thus, the p-value is calculated as follows.
Z 0.56   Z 1  
n θ̂n n θ̂n
p-value = θ (1 − θ)n−θ̂n dθ̂ + θ (1 − θ)n−θ̂n dθ̂ ≈ 0.01.
θ̂n θ̂n
|0 {z } | 0.72 {z }
area of lower tail area of upper tail

More generally speaking, the precise definition of a p-value is as follows.


p-value = P(|θ̂ − θH | ≥ |θ̂obs − θH |).

Here, for clarity, we distinguish between θ̂obs , which is the actual value of the estimator calculated from the
observed data, θH , which is the hypothetized value of θ, and θ̂, which is the estimator’s random variable
whose distribution is the sampling distribution when θ = θH .
We can calculate p-values for binomial problems like this using R with the binom.test command. We need
to pass in the observed number of “successes”, which are heads in this case, as the value of the x parameter,
and the sample size as the value of n. The hypothesized value of θ is passed in as the value of p.

8
binom_model <- binom.test(x = 140, n = 250, p = 0.64)
The value of the maximum likelihood estimator is given by the value of the estimate element of the output
object.
binom_model$estimate
## probability of success
## 0.56
The p-value is given by the value of p.value element.
binom_model$p.value
## [1] 0.01004809

Null hypotheses and significance


In general, we can test any hypothetical value of the unknown variable. Often some hypothetical values have
a special meaning in that they correspond to values that entail that there is no effect of any interesting kind.
Hypotheses of this kind are known as null hypotheses. In the present example, the hypothesis that θ = 0.5
entails that the coin is completely unbiased; it is no more likely to come up heads than tails, and so any
differences in the observed numbers of heads and tails in a set of spins or flips is just a matter of chance.
As such, this hypothesis would usually be seen as a null hypothesis. The sampling distribution for θ̂ when
θ? = 0.5 is shown in Figure 4b. The total tail area, which is the p-value is calculated similarly to the example
above. Z 0.44   Z 1  
n θ̂n n−θ̂n n θ̂n
p-value = θ (1 − θ) dθ̂ + θ (1 − θ)n−θ̂n dθ̂ ≈ 0.066.
θ̂n θ̂n
|0 {z } | 0.56 {z }
area of lower tail area of upper tail

Note that here the lower tail is defined up to 0.5 − |0.5 − 0.56| = 0.44. From this p-value of 0.066, we see that
this is not sufficiently low to reject the null hypothesis at the conventional α = 0.05 threshold, though of
course it is quite close to this threshold too. We can calculate this p-value using binom.test.
null_binom_model <- binom.test(x = 140, n = 250, p = 0.5) # p = 0.5 is default
null_binom_model$p.value
## [1] 0.06642115
As a general point about null hypothesis testing, given that a null hypothesis is a hypothesis of no interesting
effect, if we reject that hypothesis, we say the result is significant. Saying that a result is significant in
this sense of the term, however, is not necessarily saying much. The estimated effect may be small or even
negligible in practical terms, but may still be statistical significant. Moreover, even a highly significant p-value
does not necessarily mean a large effect in practical terms. As an example, there were 731,213 live births in
the United Kingdom in 2018. The p-value for the null hypothesis that the probability of the birth being a
male rather than female is ≈ 3.8 × 10−119 . This is a tiny p-value and so is an extremely significant result.
However, the (maximum likelihood) estimator for the probability of a male birth is 0.514. Although certainly
not practically meaningless, this is nonetheless quite a small effect: it corresponds to around 28 more males
than females in every 1000 births. As such, an extremely statistically significant result corresponds to small
effect in practical terms. In general, because the p-value for a non-null true effect will always decrease as the
sample size increases, we can always construct cases of arbitrarily small p-values for arbitrarily small effects,
and so even practically trivial effects may be highly statistically significant.

Confidence intervals
Confidence intervals are counterparts to p-values. As we’ve seen, each p-value corresponds to a particular
hypothesis about the true value of θ. If the p-value is sufficiently low, we will reject the corresponding
hypothesis. If the p-value is not sufficiently low, we can not reject the hypothesis. However, not rejecting a
hypothesis does not entail that we should accept that hypothesis; it just simply means that we can not rule it
out. The set of hypothetical values of θ that we do not reject at the α p-value threshold corresponds to the

9
1 − α confidence interval. Practically speaking, we can treat all values in this range as the set of plausible
values for θ.

1.00

p−value 0.75

0.50

0.25

0.00

0.45 0.50 0.55 0.60 0.65


θ

Figure 5: The p-value for each hypothetical value of θ from 0.46 to 0.66 in steps of 10−3 . Shaded in grey
are all those values of θ that have p-values less than α = 0.05. These values of θ would be rejected in a
hypothesis test. All other values of θ have p-values greater than α = 0.05 and so would not be rejected by a
hypothesis test. These value of θ comprise the confidence interval.

In Figure 5, we shade in yellow all those values of θ that would not be rejected by a hypothesis test at
α = 0.05. Therefore, these values of θ are in the 1 − α = 0.95 confidence interval. The lower and upper bounds
of this interval as calculated in this figure, which is based on a discretization of the θ interval into steps of
10−3 , is [0.498, 0.622]. The interval can be calculated exactly by using a relationship between the cumulative
binomial distribution and the cumulative Beta distribution, a method known as the Clopper-Pearson method
(Clopper and Pearson 1934). Using R, we can calculate this Clopper-Pearson confidence interval as follows.
c(qbeta(0.025, 140, 250 - 140 + 1),
qbeta(1-0.025, 140+1, 250 - 140))
## [1] 0.4960662 0.6224941
This will also be returned by the binom.test.
M_binom <- binom.test(x = 140, n = 250)
M_binom$conf.int
## [1] 0.4960662 0.6224941
## attr(,"conf.level")
## [1] 0.95
Confidence intervals have the following frequentist property. In an infinite repetition of an experiment[ˆHere,
we use the term experiment in sense that it is used in probability theory, which is of a well defined procedure
that generates data according to a specified probability distribution.], the 95% confidence interval will contain
the true value of the unknown variable 95% of the time. What this means in the case of the present problem
is that if we were to repeat ad infinitum the original coin spinning experiment under identical conditions,
and so the probability of a heads outcome is a fixed though unknown value of θ, and on each repetition
calculate the 95% confidence interval, then 95% of these confidence intervals would contain the true value
of θ. In Figure 6, we provide an illustration of this phenomenon. For 100 repetitions, we generate data
from a binomial distribution with n = 250 and θ = 0.6. From the data on each repetition, we calculate the
90% confidence interval. Those intervals that do not contain θ = 0.6 are shaded in grey. In this set of 100

10
repetitions, there are 92 intervals out of 100 that contain θ = 0.6.

0.70

0.65
θ

0.60

0.55

0.50

repetitions

Figure 6: The 90% confidence intervals in 100 repetitions of a binomial experiment with n = 250 and θ = 0.6.
Shaded in grey are any confidence intervals that do not contain θ = 0.6.

Bayesian statistical inference


Bayesian approaches to statistical inference ultimately aim to solve the same problem as that of classical
approaches, namely the inference of the unknown values of variables in a statistical model. While the classical
approaches is based on calculation of estimators and their sampling distributions, Bayesian approaches rely
on an 18th century mathematical result known as Bayes’ rule or Bayes’ theorem to calculate a probability
distribution over the possible values of the unknown variable. This probability distribution, known as the
posterior distribution, gives us the probability that the unknown variable takes on any given value, contingent
on the assumptions of the model. To introduce Bayesian inference, we will continue with the same example
problem covered above.
It is important to emphasize that the choice of the statistical model is prior to and not dependent on whether
a classical or Bayesian approach to inference is used. In other words, we first assume or propose a statistical
model for the data at hand and then, in principle, we can choose to use a classical or Bayesian approach to
inference of the unknown variables in the model. For the problem at hand, as described above, our statistical
model is that Y is a random variable with a binomial probability distribution with unknown parameter θ and
sample size n = 250. As we’ve seen, we can state this as follows:
Y ∼ Binomial(θ, n = 250).
The observed number of heads, m = 140, is a realization of the random variable Y , and the probability that Y
takes on the value of m given fixed values of θ and n is the given by the following probability mass function.
 
n m
P(Y = m|θ, n) = Binomial(Y = m|θ, n) = θ (1 − θ)n−m .
m
As we’ve also seen, the likelihood function corresponding to this probability distribution is
L(θ|m, n) = θm (1 − θ)n−m .
As we’ll see, the likelihood function plays a major role in Bayesian inference.

Priors
Having established our statistical model, to perform Bayesian inference on the value of θ, we must first
provide a probability distribution for the possible values that θ can take on in principle. This is known as the

11
prior distribution. As an example, if we assume that θ can take on any possible value in the interval [0, 1]
and each value has equal likelihood, our prior would be a uniform distribution over θ. On the other hand,
if we assume that θ values are more likely to be equal to θ = 0.5, but possibly be above or below θ = 0.5
too, our prior might be a symmetrical unimodal distribution centered at θ = 0.5. How wide this unimodal
distribution is depends on what we think are the relative probabilities of values close to and far from θ = 0.5.
Likewise, if we assume that θ is likely to correspond to a bias towards heads, then the prior might be another
symmetrical unimodal distribution but centered on some value of θ greater than 0.5. Examples of priors like
this are shown in Figure 7.

1.050
a b 5

1.025 4

3
P(θ)

P(θ)
1.000
2

0.975 1

0
0.950
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
6
c 2.5 d
2.0
4
1.5
P(θ)

P(θ)

1.0
2
0.5

0.0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ

Figure 7: Examples of priors over θ. a) A uniform prior. b-c) Priors where θ is more likely to be 0.5 than
otherwise but values greater and less than 0.5 are probable too, and more so in the case of c) whose variance
is wider than the prior in b). d) A prior where θ is more likely to be θ = 0.6 than any other value and more
likely to be above rather than below 0.5.

There is a vast literature on what priors are, how to choose them, and the relative merits of, for example,
subjective, objective, informative, noninformative, weakly informative, reference, maximum entropy, and
other kinds of priors. We will not even attempt a summary of this vast literature here. Our general advice on
the topic is that the choice of a prior should not be automatic, but rather should be treated as a modelling
assumption, just like the assumptions that lead to our choice of the original statistical model. When choosing
the prior, just like when we choose the statistical model, we should follow a number of guiding principles:
we should use our general understanding of the problem at hand, and also of the data itself; our choices
should be reasonable and justifiable rather than arbitrary; our choices should be seen as tentative and subject
to revision if the assumptions or reasoning on which it was based are found to be invalid or wrong. More
specifically, following Gelman, Simpson, and Betancourt (2017), we recommend that our choices of priors
should be guided by whether the prior could generate the type of data that we expect to see in the problem
at hand, that it covers the range of plausible values of the unknown parameter, and most of the prior mass

12
should be the parts of the parameter space that are likely to generate the type of data we are modelling.
It should be emphasised, however, that although in general the choice of priors is always important, it has
more practical consequences in some problems than in others. In particular, in relatively simple models where
the ratio of observed data to unknown variables is relatively high, as is the case with the present problem,
most choices of priors, unless they are extreme, will lead to practically similar conclusions. In complex models,
on the other hand, or models where the amount of data relative to the number of unknown variables is low,
the priors play a much more important role in the final fitted model, and so choices of priors in these contexts
needs to be more careful and judicious.
For the present problem, the parameter θ is the probability that the coin lands heads up after a spin. The
value of θ is a function of both the physical properties of the coin and also the manner in which it is spun. It
is arguably difficult to physically bias a coin so that one side is more likely in a flip (Gelman and Nolan 2002),
but experimental results with different materials can lead to outcomes that are as much as 70-30 biased
(Kerrich 1946), and precise control over how it is flipped can lead to even 100% biased outcomes (Diaconis,
Holmes, and Montgomery 2007). Coin spins, as opposed to flips, can be easily biased to as much as 75-25
bias (Gelman and Nolan 2002). From this, it seems to reasonable to have a unimodal and symmetrical prior
centered at θ = 0.5, but with a relatively wide spread to allow for relatively extreme possibilities. For this, we
will use the prior displayed in Figure 7c. Clearly, the most likely value of θ according to this prior is θ = 0.5,
but values of θ = 0.25 or θ = 0.75 have substantial prior mass in their vicinity, and even extreme values of
greater than θ = 0.9 or less than θ = 0.1 have non-trivial prior mass.
The prior that we have chosen (displayed in Figure 7c) is a beta distribution, which is a probability distribution
over the interval from 0 to 1. There are two parameters in the beta distribution, which we call hyperparameters
to distinguish them from the parameters of the statistical model, and these are conventionally denoted by α
and β. In the particular beta distribution that we have chosen their values are α = 5 and β = 5. The density
function of any beta distribution is
Γ(α + β) α−1
P(θ|α, β) = θ (1 − θ)β−1 ,
Γ(α)Γ(β)
and its mean and variance are
α αβ
hθi = , Var(θ) = .
α+β (α + β)2 (α + β + 1)

For our choice of prior therefore, the mean of the distribution is 0.5 and the variance is 0.023 (and so the
standard deviation is 0.151).

Bayes’ rule and the posterior distribution


Having specified our statistical model and our prior, we can now write out the full Bayesian model.

Y ∼ Binom(θ, n = 250),
θ ∼ Beta(α = 5, β = 5)

We can view this model as an expanded generative model of the observed data. In other words, to generate
hypothetical data sets, first, we sample a value of θ from a beta distribution with hyperparameters α = β = 5,
and then we sample a value from a binomial distribution with parameter θ and sample size n = 250. This
expanded model therefore defines a joint probability distribution over Y and θ, conditional on α, β and n,
i.e. P(Y, θ|α, β, n). We will use this join probability distribution, coupled with the fact that we observe the
value of Y , to calculate a probability distribution over the possible values of θ conditional on all the observed
data and the assumptions of the model. This is known as the posterior distribution, and is the central result
in Bayesian inference. To understand how the posterior distribution is calculated, we must introduce some
elementary results from probability theory.
If we have two random variables A and B, an elementary probability theory result shows how the joint
probability distribution of A and B can be factored into products of conditional probability distributions and

13
Perished Survived
Male 1364 367 1731
Female 126 344 470
1490 711 2201

Perished Survived
Male 0.620 0.167 0.786
Female 0.057 0.156 0.214
0.677 0.323 1.000

marginal probability distributions:

P(A, B) = P(A|B) P(B) = P(B|A) P(B)


| {z } | {z } | {z } | {z } | {z }
joint conditional marginal conditional marginal

In other words, the joint probability of A and B is equal to the conditional probability of A given B times
the probability of B, or equally, it is equal to the conditional probability of B given A times the probability
of A. Another elementary result is how we calculate marginal probability distributions from summations over
joint distributions. X X
P(A) = P(A, B) = P(A|B)P(B),
{B} {B}
X X
P(B) = P(A, B) = P(B|A)P(A),
{A} {A}

where {A} and {B} in the summations indicate the set of all values of A and B, respectively.
We can illustrate these results easily using any joint probability distribution table. For example, we can use
the following numbers, which give the number of males and females who survived or not on the RMS Titanic.
There were 2201 people on board, and so we can divide the numbers in each category by 2201 to obtain a
joint probability distribution table.
Note that the column totals and row totals are provided here and these give the probabilities of a person on
board Titanic being a male or female, or surviving or not, respectively. From the joint probability distribution,
we can get the two conditional probability distributions. The first tells us the probability of surviving or not
given that we know that the person is a Male or a Female.
The second conditional probability distribution table tells us the probability of being a male or a female given
that the person survived or not.
Using the joint table, we can see that the probability of being a person being both a male and a survivor is

P(Sex = Male, Survival = Survived) = 0.167.

Sex Perished Survived


Male 0.788 0.212
Female 0.268 0.732

Sex Perished Survived


Male 0.915 0.516
Female 0.085 0.484

14
This is equal to the probability of being a male given that the person survived times the probability of being
a survivor.
P(Sex = Male|Survival = Survived)P(Survival = Survived) = 0.516 × 0.323 = 0.167.
It is also equal to the probability of being a survivor given that the person is a male
P(Survival = Survived|Sex = Male)P(Sex = Male) = 0.212 × 0.786 = 0.167.
The same holds for any other element of the joint probability table. Using these tables we can also calculate
marginal probabilities.
P(Sex = Male) = P(Sex = Male, Survival = Survived) + P(Sex = Male, Survival = Perished),
= P(Sex = Male|Survival = Survived)P(Survival = Survived)
+ P(Sex = Male|Survival = Perished)P(Survival = Perished),
= 0.516 × 0.323 + 0.915 × 0.677,
= 0.786

With these elementary results from probability theory, we end up with the following uncontroversial result
known as Bayes’ rule.
P(A|B)P(B)
P(B|A) = P .
{B} P(A|B)P(B)

This can be easily derived as follows:


P(B|A)P(A) = P(A|B)P(B),
P(A|B)P(B) P(A|B)P(B)
P(B|A) = =P
P(A) {B} P(A|B)P(B)

We can use this result to solve elementary probability puzzles like the following2 .
Box A has 10 lightbulbs, of which 4 are defective. Box B has 6 lightbulbs, of which 1 is defective.
Box C has 8 lightbulbs, of which 3 are defective. If we do choose a nondefective bulb, what is the
probability it came from Box C?
Here, we are being asked for P(Box = C|Bulb = Working). By Bayes’ rule this is
P(Bulb = Working|Box = C)P(Box = C)
P(Box = C|Bulb = Working) = ,
P(Bulb = Working)
where the denominator here is
P(Bulb = Working) =P(Bulb = Working|Box = A)P(Box = A)
+ P(Bulb = Working|Box = B)P(Box = B)
+ P(Bulb = Working|Box = C)P(Box = C).
The conditional probabilities of drawing working bulb from each of box A, B, or C, are given by knowing the
6
number of bulbs in each box and the number of defective bulbs in each, and so are 10 , 56 , 58 , respectively.
The marginal probabilities of box A, B, and C are 13 , 13 , and 13 . From this, we have
51
83
P(Box = C|Bulb = Working) = 6 1 ,
10 3+ 56 13 + 85 13
5
8
= 6 5 5,
10 + 6 + 8
= 0.304
2 This problem is taken from Schaum’s Outlines: Probability (2nd ed., 2000), pages 87-88.

15
Returning now to our joint distribution over Y and θ, i.e., P(Y, θ|α, β, n), from this, we have the following:

P(Y |θ, n)P(θ|α, β)


P(θ|Y, α, β, n) = R
P(Y |θ, n)P(θ|α, β)dθ.

The left hand side is the posterior distribution. There are three components on the right hand side. First,
there is P(Y |θ, n). Here, the value of Y and n are known and θ is a free variable, and so P(Y |θ, n) is a
function over θ. This is the likelihood function over θ that we have already encountered, and which we can
also write as L(θ|Y, n). Second, there is P(θ|α, β), which is the prior. Like the likelihood function, this is
also a function over the set of all possible values of θ. The third component is the denominator, which is the
integral of the product of the likelihood function and the prior, integrated over all possible values of θ. This
integral is known as the marginal likelihood, and is a single value that gives the area under the curve of the
function that is the product of the likelihood function and the prior. We can write this as follows.
likelihood prior
z }| { z }| {
L(θ|Y, n) P (θ|α, β)
P(θ|Y, α, β, n) = Z .
L(θ|Y, n)P (θ|α, β)dθ
| {z }
posterior
| {z }
marginal likelihood

What this tells us is that having observed the data, the probability distribution over the possible values of θ
is the normalized product of the likelihood function and the prior over θ. The prior tells us what values θ
could take on in principle. The likelihood function effectively tells us the evidence in favour of any given value
of θ according to the data. We multiply these two functions together (the numerator above) and then divide
by the area under the curve of this product of functions (the marginal likelihood, which is the denominator).
Dividing by the marginal likelihood ensures that the area under the curve of the posterior is normalized so
that its integral is exactly 1.
Filling out the detail of this equation, we have the following.
L(θ|Y, n)P (θ|α, β)
P(θ|Y, α, β, n) ∝ R ,
L(θ|Y, n)P (θ|α, β)dθ
Γ(α+β) α−1
θm (1 − θ)n−m · Γ(α)Γ(β) θ (1 − θ)β−1
=R Γ(α+β) α−1
,
θm (1 − θ)n−m · Γ(α)Γ(β) θ (1 − θ)β−1 dθ
θm+α−1 (1 − θ)n−m+β−1
=R
θm+α−1 (1 − θ)n−m+β−1 dθ

The integral evaluates as follows.


Γ(m + α)Γ(n − m + β)
Z
θm+α−1 (1 − θ)n−m+β−1 dθ =
Γ(n + α + β)
From this, we have
Γ(n + α + β)
P(θ|Y, α, β, n) = θm+α−1 (1 − θ)n−m+β−1 .
Γ(m + α)Γ(n − m + β)
For the case of our data and prior, i.e., m = 140, n = 250, α = β = 5, the posterior, likelihood, and prior are
shown in Figure 8.
This formula for the posterior distribution is, in fact, a beta distribution with hyperparameters m + α and
n − m + β, respectively. This situation where the posterior distribution is of the same parametric family as
the prior is an example of conjugate prior. A conjugate prior is a prior that when combined with a particular
likelihood function yields a posterior distribution of the same probability distribution family. In this case, the
beta distribution prior, when used with the binomial likelihood, always leads to a posterior distribution that
is also a beta distribution.

16
a

0.00 0.25 0.50 0.75 1.00


θ

likelihood posterior prior

0.4 0.5 0.6 0.7


θ

likelihood posterior prior

Figure 8: The posterior, likelihood, and prior in a binomial problem with m = 140 and n = 250 and where
the prior is a beta distribution with α = 5, β = 5. In b) the same functions are plotted as in a) but over a
limited range of the x-axis in order to make the difference between the posterior and the likelihood more
apparent. Note also that the likelihood function is scaled so that it integrates to 1. This is simply to make it
easier to visualize on the same plot at the prior and posterior. Scaling the likelihood by an arbitrary positive
quantity does not affect the calculations of the posterior.

Having a mathematical formula for the posterior distribution, as we do here, is an example of an analytic
solution to the posterior distribution. This is not always possible. In other words, it is not always possible to
have an mathematical expression with a finite number of terms and operations that describes the posterior
distribution exactly. When an analytic solution is not possible, and in fact it is only possible in a relatively
limited number of cases, we must rely on numerical methods to evaluate the posterior distribution, and in
particular, Monte Carlo methods. We will return to this important topic in detail below.

Posterior summaries
The result of any Bayesian inference of any unknown variable is the posterior distribution. On the assumption
that the statistical model and the prior are valid, the posterior distribution provides us everything that is
known about the true value of the variable. For the current problem, this is P(θ|Y = m, α = 5, β = 5, n = 250),
which is a beta distribution with hyperparameters m + α and n − m + β. From the properties of the beta

17
distribution mentioned above, the mean and standard deviation of this posterior are as follows.
s
α+m (α + m)(β + n − m)
hθi = = 0.558, SD(θ) = = 0.031.
n+α+β (α + β + n)2 (α + β + n + 1)
We can also use the cumulative distribution function of the beta distribution to determine that 2.5th and
97.5% percentiles of the posterior. Between these two bounds, there is 95% of the posterior probability mass.
We can calculate this using R with the qbeta command, which is the inverse of the cumulative distribution
function of the beta distribution.
n <- 250
m <- 140
alpha <- 5
beta <- 5
c(qbeta(0.025, m + alpha, n - m + beta),
qbeta(0.975, m + alpha, n - m + beta))
## [1] 0.4970679 0.6174758
This is a posterior interval, also known as a credible interval. More precisely, this interval is the central or
percentile posterior interval.
Another posterior interval of interest is known as the high posterior density (hpd) interval. The precise
definition of the hpd is as follows. The ϕ hpd interval for the posterior density function f (θ) is computed by
finding a probability density value p∗ such that
P({θ : f (θ) ≥ p∗ }) = ϕ.
In other words, we find the value p∗ such that the probability mass of the set of points whose density is
greater than than p∗ is exactly ϕ. By this definition, no value of θ outside the hpd has a higher density than
any value within the hpd. It will also be the shortest posterior interval containing ϕ. The hpd as precisely
defined here is not easily calculated, and in general requires an optimization procedure to solve for p∗ . In this
model, using a numerical optimization technique, we calculate it to be (0.4974, 0.6177). This is obviously
very close to the quantile based posterior interval defined above.
The posterior mean and the hpd interval can be compared to the maximum likelihood estimator and confidence
interval in classical inference. Recall that the maximum likelihood estimator was 0.56 and the 95% confidence
interval was (0.496, 0.622). By contrast, the posterior mean and 95% central posterior interval is 0.558 and
(0.497, 0.617), respectively. While not identical, these are very close and for practical purposes are probably
indistinguishable. This illustrates that classical and Bayesian methods can be, especially in simple models,
indistinguishable.

Monte Carlo sampling


In the example problem that we have been discussing, the prior we chose was a beta distribution. As
we’ve seen, when this prior is used with a binomial likelihood function, the posterior distribution is also a
beta distribution. Thus, we have a relatively simple mathematical expression for the posterior distribution.
Moreover, we can now use the properties of the beta distribution to determine the posterior mean, standard
deviation, posterior intervals, etc. As mentioned, this is an example of an analytic solution to the posterior
distribution. This is not always possible. Consider, for example, the posterior distribution
 when we change
θ
the prior to a logit normal distribution. This is a normal distribution over log 1−θ , which is the log odds
of θ. A plot of this prior for the case of a zero mean normal distribution with standard deviation τ = 0.5 is
shown in Figure 9a. In Figure 9b, we show a beta distribution
 over θ with parameters α = β = 8.42, which is

2 θ
virtually identical to the N (0, 0.5 ) distribution over log 1−θ .

Despite the similarities of these two priors, if we use the normal distribution over the log odds, the posterior
distribution is as follows
1 |logit(θ)|2
P(θ|m, n, τ ) = θm (1 − θ)n−m θ(1−θ)
1
e− 2τ 2 ,
Z

18
a b
3 3

2 2
P(θ)

P(θ)
1 1

0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ

θ
Figure 9: a) A zero mean normal distribution with standard deviation τ = 0.5 over log( 1−θ ). b) A beta
distribution with parameters α = β = 8.42.

where
|logit(θ)|2
Z
Z= θm (1 − θ)n−m θ(1−θ)
1
e− 2τ 2 dθ,

This does not simplify to an analytic expression, and it is not a known probability density function with well
documented properties. The problem here arises primarily because of the integral, which does not have an
analytic solution. In cases like this, we need to resort to numerical alternatives to the analytic solution. For
simple models such as this one, there are many options for how to do this. However, a general approach
that applies to all Bayesian models, including and especially complex and high dimensional (i.e. with large
numbers of unknown variables) models, is to use Monte Carlo sampling methods.
Monte Carlo methods, first introduced by Metropolis and Ulam (1949), can be generally defined as numerical
mehods for approximating mathematical expectations of random variables. If X is a random variable of
dimensionality d whose probability distribution is P(X), and φ(X) is a function of X, then the expectation
or expected value of φ(X) is Z
hφ(X)i = φ(x)P(X = x)dx.

This can be approximated by


n
1X
hφ(X)i = φ(xi ),
n i=1

where x1 , x2 . . . xi . . . xn are n samples from √P(X). Particularly important is the fact that the error of
approximation decreases as a function of the n and is independent of d, the dimensionality of X.
Quantities of interest related to the probability distribution P(X) such as the mean, variance, or probabilities
of being above or below a certain value can all be expressed as expectations:
Z Z Z
2
hXi = xP(X = x)dx, V(X) = (x − hXi) P(X = x)dx, P(X > x0 ) = I(x > x0 )P(X = x)dx.

They can, therefore, be approximated as follows.


n n n
1X 1X 1X
hXi ≈ x̄ = xi , V(x) ≈ var(x) = (xi − x̄)2 , P(X > x0 ) ≈ I(xi > x0 ).
n i=1 n i=1 n i=1

19
What this entails for Bayesian inference is that if we can draw samples from the posterior distribution, Monte
Carlo methods can be used to calculate all quantities of interest related to the posterior. Moreover, this
can be done in high dimensional problems without encountering the exponential rise in approximation error
known as the curse of dimensionality.
There are many Monte Carlo methods for drawing samples from posterior probability distributions. Here, we
will describe just two: the Metropolis sampler, and the Hamiltonian Monte Carlo (hmc) sampler. Both of
these are Markov Chain Monte Carlo (mcmc) samplers, as we will see. The Metropolis sampler was one of
the earliest mcmc samplers, introduced by Metropolis et al. (1953), and has traditionally been one of the
most widely used samplers in Bayesian analysis. The hmc sampler is an extension of the Metropolis sampler.
It is the main sampler that is used the Stan probabilistic programming language that we will use extensively
throughout the remainder of this book.
In order to discuss this topic generally, rather than just for a specific problem or model, we will assume
that our observed data is D and that the unknown variable(s) that we are trying to infer is a d dimensional
variable θ. The posterior distribution is

L(θ|D)P(θ)
P(θ|D) = R ,
L(θ|D)P(θ)dθ

where L(θ|D) is the likelihood function and P(θ) is the prior. We can rewrite the posterior distribution as
Z
1
P(θ|D) = f (θ) , f (θ) = L(θ|D)g(θ), Z = L(θ|D)g(θ)dθ.
Z

Here, f (θ) is the unnormalized posterior distribution, where for notational simplicity we drop explicit reference
to D, Z is the normalisation constant, and g(θ) is the prior distribution.

Metropolis sampler
For the Metropolis sampler, we need only be able to evaluate f (θ) at any given value of θ, and we do not
need to know the value of Z. Evaluating f (θ) is often very straightforward even in complex models. If we can
evaluate the likelihood function at θ and the (normalized or unnormalized) prior at θ, we simply multiply their
values together to obtain f (θ). To draw samples using the metropolis sampler, we begin with an arbitrary
initial value θ̃0 , and then use a proposal distribution based around θ̃0 to propose a new point in the θ space,
which we will call θ̃0 . We can write this proposal distribution Q(θ̃0 |θ̃0 ). In the standard implementation of
the Metropolis sampler, the proposal distribution must be symmetric such that for any two values θ̃i and θ̃j ,
Q(θ̃i |θ̃j ) = Q(θ̃j |θ̃i ). In the Metropolis-Hasting variant of the Metropolis sampler, as we will see, the proposal
distribution need not be symmetric.
Having sampled θ∗ , if f (θ̃0 ) ≥ f (θ̃0 ), we accept that proposed new point and set θ̃1 = θ̃0 . We then propose a
new point in θ space using our proposal distribution centered at θ̃1 . If, on the other hand f (θ̃0 ) < f (θ̃0 ), we
accept θ̃0 with probability
f (θ̃0 )
.
f (θ̃0 )
If we accept it, then we set θ̃1 = θ̃0 and propose a new point using our proposal distribution centered at θ̃1 . If
we reject θ̃0 , we new then propose a new point using the proposal distribution centered at θ̃0 , and repeat the
above steps.
Continuing in this way, we produce a sequence of samples θ̃0 , θ̃1 , θ̃2 . . . that are realizations of random variables
θ0 , θ1 , θ2 . . .. Because each variable is dependent on the immediately preceding variable, and not dependent
on any variable before that, these random variables form a first-order Markov chain. The marginal probability
distributions of these variables are π0 (θ0 ), π1 (θ1 ), π2 (θ2 ) . . .. The first distribution, π0 (θ0 ), is an arbitrary
starting distribution. The nature of the second distribution, π1 (θ1 ), can be understood as follows. If θ0 = θ̃a ,
then the probability that θ1 = θ̃b will be the Q(θ̃0 = θ̃b |θ̃b ) if f (θ̃b ) > f (θ̃a ) and will be Q(θ̃0 = θ̃b |θ̃b ) × ff ((θ̃θ̃b ))
a

20
otherwise. Thus, π1 (θ1 ) is
Z  
f (θ1 )
π1 (θ1 ) = min , 1 Q(θ1 |θ0 ) π0 (θ0 )dθ0 .
f (θ0 )
| {z }
T (θ1 |θ0 )

By extension, for any i, we have


Z
πi (θi ) = T (θi |θi−1 )πi−1 (θi−1 )dθi .

It can be shown that under some general and minimal conditions3 of Markov chains, the sequence of probability
distributions π0 (θ0 ), π1 (θ1 ), π2 (θ2 ) . . . will converge upon a unique distribution that we will label π(θ). This
is known as the invariant or stationary distribution of the Markov chain. Furthermore, if we have a function
ψ(θ) that satisfies

T (θi = θ̃b |θi−1 = θ̃a )ψ(θi−1 = θ̃a ) = T (θi = θ̃a |θi−1 = θ̃b )ψ(θi−1 = θ̃b )

for any two states θ̃a and θ̃b , then ψ(θ) is this invariant distribution. We can see that the ψ(θ) is the posterior
distribution as follows. First, we will assume that f (θ̃a ) > f (θ̃b ). If this is not the case, we simply reverse the
labels of θ̃a and θ̃b . Then we have the following.
T (θ̃b |θ̃a )ψ(θ̃a ) = T (θ̃a |θ̃b )ψ(θ̃b ),
f (θb )
Q(θ̃b |θ̃a )ψ(θ̃a ) = Q(θ̃a |θ̃b )ψ(θ̃b ).
f (θa )
Because of symmetry of the proposal distribution, we have
f (θb )
Q(θ̃b |θ̃a )ψ(θ̃a ) = Q(θ̃b |θ̃a )ψ(θ̃b ),
f (θa )
f (θb )
ψ(θ̃a ) = ψ(θ̃b ),
f (θa )
f (θb ) ψ(θ̃a )
= .
f (θa ) ψ(θ̃b )
From this we have
1
ψ(θ) = f (θ)
Z
being the invariant distribution of the Markov chain.
In Figure 10, we show some of the sequence of distributions π0 , π1 , π2 . . . of a Metropolis sampler as it
converges to the true posterior distribution, which is also shown. For this illustration, we use a binomial
problem like above but where the data is m = 14 and n = 25, use a logit-Normal prior, and assume a uniform
proposal distribution. From this illustration, we see a relatively quick convergence to the true posterior
distribution.
The Metropolis-Hastings variant of the original Metropolis sampler allows for asymmetric proposal distri-
butions. Given an initial sample θ̃i and proposed new sample θ̃0 , sampled from the proposal distribution
Q(θ̃0 |θ̃i ), then we accept θ̃0 if
f (θ̃0 )Q(θ̃i |θ̃0 ) ≥ f (θ̃i )Q(θ̃0 |θ̃i ),
and otherwise we accept it with probability
f (θ̃0 )Q(θ̃i |θ̃0 )
f (θ̃i )Q(θ̃0 |θ̃i ).
3 These are irreducibility, which is the non-zero probability of eventually transitioning from any one state to any other,

aperiodicity, which is having no perfectly cyclic state trajectories, and non transience, which is the non-zero probability of
returning to any given state.

21
Initial 2 iterations 3 iterations
0.05

0.04

0.03

0.02

0.01

0.00
P(θ)

5 iterations 50 iterations True posterior


0.05

0.04

0.03

0.02

0.01

0.00
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
θ

Figure 10: Convergence of a Metropolis sampler to the posterior distribution having started at an arbitrary
distribution.

Sampling in this way, as with the original Metropolis sampler, we converge upon an invariant probability
distribution over θ that is the posterior distribution Z1 f (θ).
To illustrate the Metropolis algorithm, we will use our binomial problem with the logit normal distribution.
The likelihood function at any value of θ is

L(θ|m, n) = θm (1 − θ)n−m .

The unnormalized logit normal prior is


|logit(θ)|2
1 −
g(θ) = θ(1−θ) e
2τ 2 .

From this, we have


|logit(θ)|2
f (θ) = θm−1 (1 − θ)n−m−1 e− 2τ 2 .
This can be easily implemented using R as follows:
logit <- function(x) log(x/(1-x))
f <- function(theta, m=140, n = 250, tau = 0.5){
theta^(m-1) * (1-theta)^(n-m-1) * exp(-logit(theta)^2/(2*tau^2))
}
For the proposal distribution, for simplicity, we will use uniform distribution on (0, 1).

22
proposal <- function(theta_0){
runif(1)
}
The following code implements a Metropolis sampler that draws 100 samples.
nsamples <- 100
i <- 1
theta <- numeric(length = nsamples)
theta[i] <- runif(1) # initial sample from U(0,1)

while (i < nsamples) {


theta_star <- proposal(theta[i]) # proposed sample
p <- min(f(theta_star)/f(theta[i]), 1) # prob. of acceptance
if (runif(1) <= p){
i <- i + 1
theta[i] <- theta_star
}
}
The trajectory of samples from two separate chains, each started at opposite ends of the θ space, are shown
in Figure 11a. These plots are known as trace-plots. As can be seen, these trajectories both quickly converge
upon the same area of θ space. The histogram of samples from one chain is shown in Figure 11b.

a 1.00 b

0.75
400
count
value

0.50

200

0.25

0.00 0
0 25 50 75 100 0.4 0.5 0.6 0.7
t value

Figure 11: a) The trajectory, or trace-plot, of 100 samples from two separate chains of a Metropolis sampler
for the binomial problem using a logit-normal prior. b) The histogram of ten thousand samples from one of
the samplers.

Hamiltonian Monte Carlo


The Metropolis algorithm suffers from the fact that its proposal distribution takes random steps in parameter
space. The Hamiltonian Monte Carlo (hmc), which was originally known as Hybrid Monte Carlo, aims to
overcome this limitation by taking account of the gradient of the posterior distribution when making its
proposals. Both theoretically and in terms of its technical details, hmc is considerably more complex than
the Metropolis method (see Betancourt 2017; Neal 2011). However, hmc is extremely efficient and it is now
widely used in Bayesian data analysis, and so any contemporary introduction of mcmc would be incomplete
without covering it.

23
To appreciate hmc, we return to the fact that in any Bayesian analysis, our results are ultimately expectations
of functions according to the posterior distribution. Thus, if the posterior distribution over our unknown
variables is P(θ|D), the expected value of some function φ(θ) according to the posterior is
Z
hφ(θ)i = φ(θ)P(θ|D)dθ.

The value of this integral is largely determined by where P(θ|D) is massed. In high dimensional spaces, most
of this mass is concentrated in a thin shell far from the mode of the distribution that is known as the typical
set (see Betancourt 2017). All mcmc methods ideally sample from this typical set. However, because the
typical set is a thin shell, the random proposals made by the Metropolis sampler will be mostly to regions of
low probability density outside of it, which will be rejected, leading to inefficiency. Likewise, those proposals
that will be accepted will often be just those close to the current state of the sampler, and in this sense, the
sampler becomes stuck in one region and does not explore the typical set.
hmc tries to overcome these problems with the random walk Metropolis by using the gradient of the posterior
distribution to guide its proposals of new points to sample. For any given value of θ, the gradient of the
posterior distribution is evaluated. This indicates the directions with higher or lower posterior density. Were
we to simply follow the gradient, we would end up moving away from the typical set and toward the mode
of the distribution. While the mode will have high density, it will have infinitesimally low volume in high
dimensional problems, and so is not part of the typical set as defined above. In order to remain within the
typical set, therefore, each point in θ space is given a momentum value denoted by r, which gives a direction
and velocity to each point. This momentum speeds up as the object approaches areas of high density, and
slows and reverses in regions of low density.
In order to implement this system, hmc represents the value of θ as a point in a physical space that has a
potential energy and a kinetic energy. The potential energy is determined by its position, and can seen as the
effect of gravity. The kinetic energy is determined by the momentum of the point. Intuitively, we can imagine
this as smooth (frictionless) hilly landscape with a ball with a momentum moving over this landscape. If we
kick the ball, we give it some momentum and it will move in a certain direction with a certain speed. If that
direction is downwards, under the effect of gravity, it will speed up and then when it gets to the bottom of
the hill, its momentum will allow it to continue moving and it will start to roll up the side of another hill.
Eventually, the effect of gravity will start to slow it down, and it will then reverse and roll back down the hill,
and so on. The movement of the ball is physically determined by its position and its momentum and their
respective potential and kinetic energies.
In more detail, the potential energy at each point is given by

U (θ) = − log(f (θ)),

where f (θ) ∝ P(θ|D) as defined above. This is equivalent to


1 −U (θ)
e
P(θ|D) = .
Z
From this perspective, the hmc sampler can seen as a physical system with an infinite number of states,
each of which have a potential energy associated with it. The probability of the system being in any state is
determined by the potential energy associated with that state4 . The lower the energy, the more likely the
system is to be in that state. The higher the energy, the less likely it is to be in that state. In addition to its
potential energy, each state also has a kinetic energy, denoted K(r), which is determined by the momentum
with value r. In one dimension, if the point has unit mass, the kinetic energy is K(r) = r2 /2 and in d
dimensions, it is
d
1X 2
K(r) = rk .
2
k=1
4 Instatistical mechanics, a physical system where the probability of being in any given state is proportional to e to the power
of the negative of the energy at that state is said to have a Boltzmann distribution.

24
Given this energy function, the probability distribution corresponding to it is P (r) ∝ e−K(r) , which is a
d dimensional standard normal distribution. The total energy in this system is given by the Hamiltonian
function:
H(θ, r) = U (θ) + K(r).
The change in the position and the momentum is now determined by classical mechanics and is given the
Hamiltonian equations:
dθk ∂H ∂K(r)
= = ,
dt ∂rk ∂rk
drk ∂H ∂U (θ)
=− =− ,
dt ∂θk ∂θk
where dθ drk
dt and dt are the rate of change, at dimension k, of the position θ and momentum r, respectively.
k

∂H ∂H
The ∂rk and ∂θk are the partial derivatives of the Hamiltonian function with respect to rk and θk , respectively.
Now, if we start at any point in θ space and give this point a momentum, the Hamiltonian equations will
determine how the point will move through this space under the actions of both the potential and kinetic
energy.
In order to simulate this continuous dynamical system, which is governed by differential equations, on a
computer, we must use a discrete approximation to its dynamics. Widely used methods such as Runge-Kutta
methods or Euler’s method are possible, but a more suitable method is the leapfrog algorithm, which involves
taking steps, or half-steps, of size δ as follows:
δ ∂U (θ)
rk ← rk − , first half-step in r space,
2 ∂θk
∂K(r)
θk ← θk + δ , step in θ space,
∂rk
δ ∂U (θ)
rk ← rk − , second half-step in r space.
2 ∂θk

As a simple example of Hamiltonian dynamics, let us assume that the posterior distribution is a two
dimensional normal distribution with a mean µ and covariance matrix Σ. In this case, we have

U (θ) = 21 (θ − µ)| Σ−1 (θ − µ),


1
Pd 2
and we will assume that K(r) = 2 k=1 rk . The partial derivatives are

∂U (θ) ∂K(r)
= (θ − µ)| Σ−1 , = rk ,
∂θk ∂rk
and so our leapfrog steps are
δ
rk ← rk − (θ − µ)| Σ−1 , θk ← θk + δrk .
2
Starting with an arbtrary point θ̃ = (0, 1.5), and choosing r at random from a 2d standard normal, we
simulate the Hamiltonian dynamics for a 45 leapfrog steps with a δ = 0.1. This shown in Figure 12. As we
can see, the trajectories move smoothly back and forth through the posterior distribution.
The key feature of hmc that differentiates it from random walk Metropolis is that it replaces the random
proposal distribution with the deterministic dynamics of the Hamiltonian system. Specifically, a point in θ̃
space is chosen at random, and a value of the momentum variable r is chosen from the probability distribution
corresponding to K(r), i.e. P (r) ∝ e−K(r) , the Hamiltonian system deterministically evolves for period of time
to arrive at a new point θ̃0 and new momentum r0 . The new point θ̃0 is then accepted if H(θ̃0 , r0 ) > H(θ̃, r).
Otherwise, θ̃0 is accepted with probability

P(θ̃0 |D) P(r0 )


eH(θ̃, r) − H(θ̃0 , r0 ) = .
P(θ̃|D) P(r)

25
2 2 2

1 1 1
θ1

θ1

θ1
0 0 0

−1 −1 −1

−2 −2 −2

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
θ0 θ0 θ0

Figure 12: Three trajectories of a Hamitonian dynamical system where the posterior distribution is a 2d
normal distribution, where each trajectory starts at θ̃ = (0, 1.5), and where we choose the momentum r at
random from a 2d standard normal. The system is simulated for a 45 leapfrog steps with a δ = 0.1.

This acceptance-step step is essentially identical to the Metropolis-Hastings acceptance-rejection step.


One final issue to address concerns how long the trajectories of the deterministic Hamiltonian dynamics ought
to be. If the trajectories are too short, the sampler will move slowly through the posterior distribution. If
the trajectory is longer, as we can see in Figure 12, the trajectory may loop back upon itself. The extent to
which the trajectories will loop around will depend on the curvature of the U (θ) space: In flat regions, the
loop arounds will not happen, but will happen sooner in curved regions. The No-U-Turn (NUTS) sampler
by Hoffman and Gelman (2014) optimally tunes the trajectory lengths according to the local curvature.
This allows optimal trajectory lengths without manual tuning or without tuning runs of the sampler. The
probabilistic programming language Stan, to which Chapter 17 is devoted, is a hmc sampler that uses NUTS.
As an illustration of a hmc sampler, in fact based on the Stan language, we will use the brms package
(Bürkner 2018).
library(brms)
This package allows us to implement a very wide range of models using minimal and familiar R command
syntax and creates, compiles, and executes the Stan sampler. We will use brms based models often in the
remaining chapters. We can implement the logit normal prior based binomial model mentioned above as
follows.
M_hmc <- brm(m | trials(n) ~ 1,
data = tibble(n = n, m = m),
prior = prior(normal(0, 0.5), class = Intercept),
family = binomial)
By default, this will sample 4 independent chains, each with 1000 post-warmup samples. The warmup
iterations are iterations prior to convergence on the typical set, and during this period, the hmc sampler is
fine tuned. The summary of this model is as follows.
summary(M_hmc)
## Family: binomial
## Links: mu = logit
## Formula: m | trials(n) ~ 1
## Data: tibble(n = n, m = m) (Number of observations: 1)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000

26
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.23 0.13 -0.02 0.48 1.00 1539 2023
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The histogram of the 4000 posterior samples of θ is shown in Figure 13.

400
count

200

0.45 0.50 0.55 0.60 0.65


θ

Figure 13: Histogram of the samples from the posterior distribution over θ from a hmc sampler of a
logit-normal prior based binomial model.

Model evaluation
Let us return to the following diagram from the introduction to this chapter.

Model

evaluation inference

Fitted model

In our coverage of classical and then Bayesian methods of inference, we have dealt with the inference arc
to right. This is where we begin with a statistical model that has variables or parameters whose values are
unknown, and then use one general approach or another to effectively obtain estimates, as well as measures
of uncertainty of these estimates, of these unknown variables. The result is a model that it fitted to the data.
However, whether it is a good fit, and more importantly, whether the model is able to generalize to new
data, remains to be seen. Model evaluation is the general term for this stage of the modelling process. Model
evaluation is ultimately a large and multifaceted topic, involving various quantitative and graphical methods.
Here, we will focus on the most common methods, and in particular, on model comparison, which is where we

27
directly compare two or more models against one another.

Deviance and Log likelihood ratio tests


Let us first consider a widely used approach for model comparison for classical inference based models. We
will call the model that we are evaluating M1 , whose vector of unknown parameters is θ1 and the maximum
likelihood estimator of these values based on observed data D is θ̂1 . The probability of D according to M1
and θ̂1 is
P(D|M1 , θ̂1 ).
Note that this the value of the likelihood function for the model evaluated at its maximum value. In itself, this
value is actually not very informative about how well D is predicted by M1 with parameters θ̂1 . It is almost
always a very small value based on the fact that there is a very large set of possible values that the model
could predict. However, it is more informative to compare the relative probabilities of the data according to
two models that we wish to compare. If we name the comparison model M0 , and name its parameter vector
θ0 and its estimator of these parameters θ̂0 , then the ratio of the probabilities of D according to these two
models is
P(D|M0 , θ̂0 )
.
P(D|M1 , θ̂1 )
This ratio is usually referred to as the likelihood ratio. Note that it is specifically the ratio of the maximum
values of the likelihood functions of the two models. The logarithm of this likelihood ratio is
!
P(D|M0 , θ̂0 )
log = log P(D|M0 , θ̂0 ) − log P(D|M1 , θ̂1 ).
P(D|M1 , θ̂1 )
For reasons that will become clear, we usually multiple this logarithm by −2 to obtain the following.
!
P(D|M0 , θ̂0 )
−2 log = −2 log P(D|M0 , θ̂0 ) − −2 log P(D|M1 , θ̂1 ),
P(D|M1 , θ̂1 )
= D0 − D1 .
Here, D0 and D1 , which are simply -2 times the logarithm of the corresponding likelihoods, are known as the
deviance of model M0 and M1 , respectively. Note that because of the negative sign of the multiplier, the
larger the deviance the lower the likelihood of the model. Thus, if D0 > D1 , model M1 is better able to
predict the data than model M0 .
Differences of deviances, and so the log likelihood ratio, are particularly widely used when comparing nested
models. Nested models are where one model’s parameter space is subset of that of another. For example,
if M1 is a normal distribution model with mean and variance parameters µ1 and σ12 , both of which are
unknown, and M0 is also a normal distribution but with fixed mean µ0 = 0 and only σ02 being unknown,
then M0 is nested in M1 because {µ0 = 0, σ02 } ⊆ {µ1 , σ12 }. When comparing nested models, we can make
use of Wilks’s theorem that states that, if models M0 and M1 predict D equally well, then asymptotically
(i.e., as sample size increases)
∆D = D0 − D1 ∼ χ2k1 −k0 ,
where k1 and k0 are the number of parameters in M1 and M0 , respectively. In other words, if M0 and M1
predict the data equally well, then the difference of their deviances, which would be due to sampling variation
alone, will be distributed as a χ2 distribution whose degrees of freedom are equal to the difference of the
number of parameters in the two models.
Wilks’s theorem is widely used in regression models, as we will see in subsequent chapters. However, for
now, we can illustrate with a simple example. Let us return to the houseprices_df data above. One model,
M1 , of the price variable could that its logarithm is normally distributed with some mean µ1 and variance
σ12 , both of which are unknown. A nested model, M0 , could be that the logarithm of price is normally
distributed with a fixed mean µ0 = log(60000) and unknown variance σ02 . Without providing too much details
of the R commands, which we will cover in more detail in subsequent chapters, we can fit these two models
as follows.

28
M_1 <- lm(log(price) ~ 1, data = housing_df)
mu_0 <- rep(log(60000), nrow(housing_df))
M_0 <- lm(log(price) ~ 0 + offset(mu_0), data = housing_df)
The logarithms of the likelihoods of M_1 and M_0 at the maximum values are as follows.
logLik(M_1)
## 'log Lik.' -234.2995 (df=2)
logLik(M_0)
## 'log Lik.' -240.6163 (df=1)
The corresponding deviances are -2 times these values, which we may also obtain as follows.
(d_1 <- -2 * logLik(M_1))
## 'log Lik.' 468.599 (df=2)
(d_0 <- -2 * logLik(M_0))
## 'log Lik.' 481.2327 (df=1)
The difference of the deviances ∆D is as follows.
delta_d <- as.numeric(d_0 - d_1)
delta_d
## [1] 12.6336
Assuming M1 and M0 predict the observed data equally well, this difference is distributed as a χ2 distribution
with 1 degree of freedom. This value for the degrees of freedom is based on the fact that one model has two
parameters and the other has just one. We can calculate the p-value for ∆D by calculating the probability of
obtaining a result as or more extreme than ∆D in a χ2 distribution with 1 degree of freedom.
pchisq(delta_d, df = 1, lower.tail = F)
## [1] 0.0003788739
Simply put, this result tells us that a value of ∆D = 12.63 is not an expected result if models M1 and M1
predict the data equal well, as so we can conclude that the M1 , which has the lower deviance, predicts the
data significantly better than M0 .

Cross validation and out-of-sample predictive performance


Although widely used and very useful, the deviance based model comparison just described is limited to both
classical methods and to nested models. It is also limited in that it examines how well any pair of models
can predict the observed data on which the model was fitted. A more important question is how the any
model can generalize to new data that is from the same putative source as the original data. This is known
as out-of-sample predictive performance. Any model may in fact predict the data on which it was fitted well,
or even perfectly, and yet poorly generalize because it is too highly tuned to the peculiarities or essentially
random element of the data. This is known as over-fitting and it is a major problem, especially with complex
models. To properly evaluate a model and identify any over-fitting we need to see how well the model can
generalize to new data. Rather than waiting for new data to be collected, a simple solution is to remove
some data from the data that is used for model fitting, fit the model with the remaining data, and then test
how well the fitted model predicts the reserved data. This is known as cross-validation.
One common approach to cross-validation is known as K-fold cross-validation. The original data set is
divided randomly into K subsets. One of these subsets is randomly selected to be reserved for testing. The
remaining K − 1 are used for fitting and the generalization to the reserved data set is evaluated. This process
is repeated for all K subsets, and overall cross validation performance is the average of the K repetitions.
One extreme version of K-fold cross-validation is where K = n where n is the size of the data-set. In this
case, we remove each one of the observations, fit the model on the remaining set, and test on the held out
observed. This is known as leave one out cross-validation.

29
For leave one out cross-validation, the procedure is as follows, with the procedure for any K-fold cross-
validation being similarly defined. Assuming our data is D = y1 , y2 . . . yn , we divide the data into n sets:

(y1 , y¬1 ), (y2 , y¬2 ), . . . (yi , y¬i ) . . . (yn , y¬n ),

where yi is data point i and y¬i is all the remaining data except for data point i. Then, for each i, we fit
the model using y¬i and test how well the fitted model can predict yi . We then calculate the sum of the
predictive performance over all data points. In classical inference based models, for each i, we calculate
θ̂¬i , which is the maximum likelihood or other estimator of the parameter vector θ based on y¬i . We then
calculate log P(yi |θ̂¬i ), which is the logarithm of the predicted probability of yi based on the model with
parameters θ̂¬i . This then leads to
Xn
elpd = log P(yi |θ̂¬i )
i=1

as the overall measure of the model’s out-of-sample predictive performance, which we refer to as expected
log predictive density (elpd). In Bayesian approaches, an analogous procedure is followed. For each i, we
calculate the posterior distribution P(θ|y¬i ), and the model’s overall out-of-sample predictive performance is
n
X Z
elpd = log P(yi |θ)P(θ|y¬i )dθ.
i=1

This can be approximated by !


n S
X 1X ¬i
elpd ≈ log P(yi |θ̃s ) ,
i=1
S s=1

where θ̃1¬i , θ̃2¬i . . . θ̃S¬i are S samples from P(θ|y¬i ).


In order to illustrate this, let us use the price variable in the housing_df data. One model for this data that
we can consider is, as described in the introduction to this chapter, is as follows i ∈ 1 . . . n, yi ∼ logN(µ, σ 2 ),
where logN(µ, σ 2 ) is a log-normal distribution whose parameters are µ and σ 2 . Using a classical inference
based model, this can be implemented in R as follows.
m1 <- lm(log(price) ~ 1, data = housing_df)
We can remove any i ∈ 1 . . . n and fit the model with the remaining data as follows, using i = 42 as an
example.
i <- 42
m1_not_i <- lm(log(price) ~ 1,
data = slice(housing_df, -i)
)
The logarithm of the probability of yi is logarithm of the normal density for log yi with the mean and standard
deviation based on their maximum likelihood estimators in the fitted model. We can extract the maximum
likelihood estimators of the mean and standard deviation from m1_not_i as follows.
mu <- coef(m1_not_i)
stdev <- sigma(m1_not_i)
Then , the logarithm of normal density of log yi based on these estimators is
y_i <- slice(housing_df, i) %>% pull(price)
dnorm(log(y_i), mean = mu, sd = stdev, log = T)
## [1] 0.03483869
We can create a function to calculate the logarithm of the prediction for any log yi as follows.
logprediction_m1 <- function(i){
m1_not_i <- lm(log(price) ~ 1,

30
data = slice(housing_df, -i)
)
mu <- coef(m1_not_i)
stdev <- sigma(m1_not_i)
y_i <- slice(housing_df, i) %>% pull(price)
dnorm(log(y_i), mean = mu, sd = stdev, log = T)

}
We can then apply this to all data points and sum the result.
n <- nrow(housing_df)
map_dbl(seq(n), logprediction_m1) %>%
sum()
## [1] -236.2388
Now, let us compare this log-normal model’s performance to a normal model, i.e., where we assume that
for i ∈ 1 . . . n, yi ∼ N(µ, σ 2 ). The log of the prediction of the held out data point is calculated in almost an
identical manner, except that we use the values of the price variable directly and not their logarithm.
logprediction_m0 <- function(i){
m0_not_i <- lm(price ~ 1,
data = slice(housing_df, -i)
)
mu <- coef(m0_not_i)
stdev <- sigma(m0_not_i)
y_i <- slice(housing_df, i) %>% pull(price)
dnorm(y_i, mean = mu, sd = stdev, log = T)

map_dbl(seq(n), logprediction_m0) %>%


sum()
## [1] -6342.375
It is common to multiply these sums of log predictions by -2 to put the values on a deviance scale. This is
sometimes known as the leave-one-out-information-criterion (looic). Thus, our measure of the out-of-sample
predictive deviance of the log-normal model is 472.48, and for the normal model, it is 12684.75. The log-normal
model is clearly the better model with a much lower deviance score. In the next section, we will consider how
best to interpret differences in looic and related quantities.
To perform leave one out cross validation using Stan models, we can make use of an efficient implementation
of this based on Pareto-smoothed importance sampling (Vehtari, Gelman, and Gabry 2017). This means that
we do not have to manually divide the data set into subsets, as we did in the previous example. In fact,
this efficient method allows us to effectively implement the leave one out cross validation method without
repeatedly re-running the hmc models, which would be extremely computationally burdensome. All we need
do is implement the Stan model, which we can do easily using brm as follows.
m1_bayes <- brm(log(price) ~ 1, data = housing_df)
m0_bayes <- brm(price ~ 1, data = housing_df)
We may then get the elpd and the looic using the command loo.
loo(m1_bayes)$estimates
## Estimate SE
## elpd_loo -236.234322 15.9135198
## p_loo 1.888464 0.1616675
## looic 472.468644 31.8270395

31
loo(m0_bayes)$estimates
## Estimate SE
## elpd_loo -6341.868732 23.4359160
## p_loo 2.981645 0.6150526
## looic 12683.737464 46.8718319
Leaving aside p_loo (a measure of effective number of parameters) and the standard errors (a measure of
uncertainty of the estimates, as defined above), we see here that the elpd_loo and the looic estimates from
the Bayesian models are very similar to those calculated using the classical inference based models above.

AIC
Cross-validation is an excellent method of model evaluation because it addresses the central issue of out-of-
sample generalization, rather than fit to the data, and can be applied to any models, regardless of whether
these models are based on classical or Bayesian methods of inference. On the other hand, traditionally, cross
validation has been seen as too computationally demanding to be used in all data analysis situations. This is
becoming less of a concern now, both because of the computational power and the development of efficient
implementation such as the Pareto smoothed importance sampler method mentioned above. Nonetheless, one
still widely used model evaluation model, the Akaike Information Criterion (aic), originally proposed by
Akaike (1973), can be justified as a very easily computed approximation to leave one out cross validation (see
Stone 1977; Fang 2011). aic is defined as follows.

AIC = 2k − 2 log P(D|θ̂),


= 2k + Deviance,

where k is the number of parameters in the model. Obviously, for models where the log of the likelihood is
available, and where the number of parameters of the model is straightforward to count (which is not always
the case in complex models), the aic is simple to calculate. For example, for the log normal and normal
models that we evaluated above using cross-validation, both of which have k = 2 parameters, the aic is
calculated as follows.
m1 <- lm(log(price) ~ 1, data = housing_df)
m0 <- lm(price ~ 1, data = housing_df)
k <- 2
aic_1 <- as.numeric(2 * k - 2 * logLik(m1))
aic_0 <- as.numeric(2 * k - 2 * logLik(m0))

c(aic_1, aic_0)
## [1] 472.599 12682.711
Clearly, these are very similar to the cross-validation elpd for these two models.
Like elpd and other measures, a model’s aic value is of little value in itself, and so we only interpret
differences in aic between models. Conventional standards (see, for example, Burnham and Anderson 2003,
Chapter 2) hold that aic differences of greater than between 4 or 7 indicate clear superiority of the predictive
power of the model with the lower aic, while differences of 10 or more indicate that the model with the
higher value has essentially no predictive power relative to the model with the lower value. We can appreciate
why these thresholds are followed by considering the concept of Akaike weights (see Burnham and Anderson
2003, 75). Akaike weights provide probabilities for each of a set of K models that are being compared with
one another. They are defined as follows:
1
e− 2 ∆AICk
ρk = P 1 ,
K − 2 ∆AICk
k=1 e

where ∆AICk is the difference between the aic of model k and the lowest aic value in the set of K models.
These probabilities can be interpreted as the probabilities that any model has better predictive performance

32
than the others. If we have just two models, with model k = 1 being the one with the lower aic, then
∆AIC1 = 0 and ∆AIC2 will be some value positive quantity δ. Using Akaike weights, the probability that
the model with the lower aic has the better predictive performance is
1
ρ= .
1 + e−δ/2
For δ values of 4, 6, 9, the corresponding probabilities are 0.88, 0.95, 0.99. These values provide a justification
for the thresholds proposed by Burnham and Anderson (2003).

WAIC
aic is appealing because of its simplicity. As we have seen, despite its simplicity, it can be a highly accurate
approximation to cross-validation elpd. However, this approximation will not hold in all models, especially
large and complex ones. Addition, in some situations calculating the maximum of the likelihood function
is not straightforward, and nor is defining the number of free parameters in the model. A more widely
applicable version of the aic is the Watanabe Akaike Information Criterion (waic). waic was introduced
in (Watanabe 2010) by the name Widely Applicable Information Criterion rather than Watanabe Akaike
Information Criterion. It has been shown that waic is more generally or widely applicable than aic, and is a
close approximation to Bayesian leave-one-outcross-validation, yet can be calculated easily from a model’s
posterior samples. waic is calculated as follows:
n S
! n
!
X 1X s
X
S s
waic = −2 log P(yi |θ ) − Vs=1 (log P (yi |θ )) ,
i=1
S s=1 i=1

where yi , θs etc are as they are defined above for the case of Bayesian elpd. The term Vs=1
S
(·) signifies the
variance of its arguments.
Using Stan based models, waic is easily calculated using the loo package.
waic(m1_bayes)$estimates
## Estimate SE
## elpd_waic -236.232503 15.9133627
## p_waic 1.886645 0.1614885
## waic 472.465006 31.8267254
waic(m0_bayes)$estimates
## Estimate SE
## elpd_waic -6341.868701 23.4362820
## p_waic 2.981614 0.6155333
## waic 12683.737402 46.8725640
As we can see, this is very similar to the Bayesian cross-validation elpd, but can be calculated directly from
posterior samples.

References
Akaike, Hirotogu. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.”
Edited by F Petrov B. N AND Caski.
Betancourt, Michael. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.”
Burnham, Kenneth P, and David R Anderson. 2003. Model Selection and Multimodel Inference: A Practical
Information-Theoretic Approach. Springer Science & Business Media.
Bürkner, Paul-Christian. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R
Journal 10 (1): 395–411.

33
Clopper, Charles J, and Egon S Pearson. 1934. “The Use of Confidence or Fiducial Limits Illustrated in the
Case of the Binomial.” Biometrika 26 (4): 404–13.
Diaconis, Persi, Susan Holmes, and Richard Montgomery. 2007. “Dynamical Bias in the Coin Toss.” SIAM
Review 49 (2): 211–35.
Fang, Yixin. 2011. “Asymptotic Equivalence Between Cross-Validations and Akaike Information Criteria in
Mixed-Effects Models.” Journal of Data Science 9 (1): 15–21.
Gelman, Andrew, and Deborah Nolan. 2002. “You Can Load a Die, but You Can’t Bias a Coin.” The
American Statistician 56 (4): 308–11.
Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. “The Prior Can Often Only Be Understood
in the Context of the Likelihood.” Entropy 19 (10): 555.
Hoffman, Matthew D, and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path
Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–1623.
Kerrich, J. E. 1946. An Experimental Introduction to the Theory of Probability. E. Munksgaard.
Metropolis, Nicholas, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller.
1953. “Equation of State Calculations by Fast Computing Machines.” The Journal of Chemical Physics 21
(6): 1087–92.
Metropolis, Nicholas, and S. Ulam. 1949. “The Monte Carlo Method.” Journal of the American Statistical
Association 44 (247): 335–41.
Neal, Radford. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo,
edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. CRC press.
Stone, M. 1977. “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.”
Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 44–47.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using
Leave-One-Out Cross-Validation and Waic.” Statistics and Computing 27 (5): 1413–32.
Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable
Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (Dec):
3571–94.

34

You might also like