Chapter 08 Inference
Chapter 08 Inference
Mark Andrews
Contents
Introduction 1
Statistical inference 4
Model evaluation 27
Deviance and Log likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Cross validation and out-of-sample predictive performance . . . . . . . . . . . . . . . . . . . . . . . 29
AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
WAIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References 33
Introduction
As important as exploratory analysis is, it is a fundamentally different undertaking to statistical modelling
and statistical inference. Exploratory data analysis aims to describe and visualize the data and to identify
possible trends and patterns in it. Statistical models, by contrast, are mathematical models of the population
from which the data originated. The term population is here used in a technical sense that is specific to
statistics. In general, a population is the hypothetical set from which our actual data are assumed to be a
random sample. From the perspective of statistical modelling, our ultimate interest lies not in the data itself
but rather in the population, or more specifically in our mathematical models of the population.
Let us assume that we have the following data set: y1 , y2 . . . yn , where each yi is a single (scalar) value. From
the perspective of statistical modelling, y1 , y2 . . . yn is a random sample from a population. Ultimately, this
population is described by a probability distribution. More specifically, we treat the values y1 , y2 . . . yn as a
realization of the random variables Y1 , Y2 . . . Yn . A random variable has a formal mathematical definition, but
informally speaking, it is a variable that can take on different values according to some probability distribution.
The probability distribution over the variables Y1 , Y2 . . . Yn can be denoted generally by P(Y1 , Y2 . . . Yn ). This
probability distribution is a function over an n dimensional space that gives the probabilities of every possible
1
combination of values of the n variables Y1 , Y2 . . . Yn . Our observed data y1 , y2 . . . yn is but one realization of
the random variables Y1 , Y2 . . . Yn , and every possible realization is a random sample from the probability
distribution P(Y1 , Y2 . . . Yn ). Thus, the possible values that Y1 , Y2 . . . Yn can take and their corresponding
probabilities formally define the population.
In general, except in artificial scenarios, we do not know the nature of the population probability distribution
P(Y1 , Y2 . . . Yn ). Our aim, therefore, is to first propose a model of this probability distribution, and then
use statistical inference to infer the properties of this model. The model that we develop is a probabilistic
model, which means that it is defines a probability distribution over the variables Y1 , Y2 . . . Yn . This model is
often referred to simply as the statistical model. In some contexts, it is also known as the generative model,
the probabilistic generative model, the data generating model, and so on. Almost always, as we will see, this
statistical model is defined in terms of parameters and other variables that are assumed to have fixed but
unknown values. We use statistical inference, which we will discuss at length below, to infer what the values
of these variables are.
The process of developing a statistical model and inferring its parameters and other unknown variables is, in
fact, an iterative process, and is illustrated in the following diagram.
Model
evaluation inference
Fitted model
First, we propose or assume a statistical model. This may be done on the basis of some exploratory data
analysis and visualization, as well as by using our scientific knowledge and understanding of the phenomenon
being studied. Having assumed some model, we then infer its parameters and other unknowns. We are
now in position to critically evaluate the resulting fitted model. Specifically, we evaluate whether or not
the predictions of model make sense and whether they are consistent with the data. On the basis of this
elaboration, we may now need to elaborate or possibly simplify our originally proposed model, which leads to
a new proposed model. The unknowns of this model are then inferred, and the new fitted model is evaluated.
This process iterates until we are satisfied that the model is sufficient for practical purposes or that no
major alternative models have been overlooked. The final fitted model is then used as the model of the
population. As we will see through examples, this is then effectively a mathematical or probabilistic model of
the phenomenon being studied. With this model, we can explain and reason about the phenomenon, make
predictions about future data, and so on.
As a brief illustrative outline of this modelling process, consider the following housing_df data frame.
housing_df <- read_csv('data/housing.csv')
housing_df
## # A tibble: 546 x 1
## price
## <dbl>
## 1 42000
## 2 38500
## 3 49500
## 4 60500
## 5 61000
## 6 66000
## 7 66000
## 8 69000
## 9 83800
2
## 10 88500
## # ... with 536 more rows
This gives the prices (in Canadian dollars) of a set of 546 houses in the city of Windsor, Ontario in 1987. We
can denote these values as y1 , y2 . . . yn and treat them as a realization of the random variables Y1 , Y2 . . . Yn
whose probability distribution is P(Y1 , Y2 . . . Yn ). This defines the population. How best to conceive of this
population in real world terms is usually a matter of debate and discussion, rather than a simple matter of
fact. For example, the population might be conceived of narrowly as the set of house prices in Windsor in
1987, or more widely as the set of house prices in mid-sized cities in Canada in the late 1980s, or more widely
still as the set of house prices in North America during the time period, and so on. Regardless of how we
conceive of the population, we do not know the true nature of P(Y1 , Y2 . . . Yn ) and so begin by proposing
a model of it. Among other things, this initial proposal could be based on exploratory data analysis and
visualization, such as that shown in Figure 1a-d. From the histograms and QQ plots shown here, we see that
the logarithm of house prices appears to be distributed as a normal distribution.
a 100
b 60
75
40
count
count
50
20
25
0 0
50000 100000 150000 200000 10.0 10.5 11.0 11.5 12.0
price log(price)
c d
12.0
150000
11.5
log(price)
price
100000 11.0
50000 10.5
0 25000 50000 75000 100000 125000 10.0 10.5 11.0 11.5 12.0
x x
Figure 1: a) A histogram of house prices (in Canadian dollars) from the city of Windsor, Ontario, Canada, in
1987. b) A histogram of the logarithm of the house prices. c) A QQ plot of the prices compared to a normal
distribution whose mean and standard deviation are set to the median and median absolute deviation (mad)
of the prices. d) The QQ plot of log of the prices, again compared to a normal distribution whose mean and
standard deviation deviation are set to the median and mad of the log prices.
This leads us to the following proposed model: for each i ∈ 1 . . . n, Yi ∼ logN(µ, σ 2 ). Note that logN(µ, σ 2 ) is
shorthand to denote a log-normal distribution whose parameters are µ and σ 2 . A log-normal distribution is a
distribution of a variable whose logarithm is normally distributed. In the model statement, the ∼ is read as
3
is distributed as. This model therefore states that each random variable Yi , for i ∈ 1 . . . N , is distributed as a
log-normal distribution whose parameters are µ and σ 2 . This model is an example of an independent and
identically distributed (iid) model: each of the n random variables is modelled as independent of one another,
and each one has the same probability distribution.
Having proposed a initial model, we must now infer the values of the unknown variables µ and σ. As we will
discuss at length below, there are two major approaches to doing statistical inference, but in practical terms,
both methods effectively lead to estimates of µ and σ as well as measures of the uncertainty of these estimates.
If we denote the estimates of µ and σ by µ̂ and σ̂, our fitted model is then the log-normal distribution
logN(µ̂, σ̂ 2 ). At this point, we may evaluate this fitted model and determine if its assumptions or predictions
are consistent with the observed data. If, on the basis of this evaluation, the assumed model is deemed
satisfactory for present practical purposes, we then can use the fitted model, particularly taking into account
the uncertainties in our estimates, to explain, reason about, or make predictions concerning house prices in
cities like Windsor during this period.
Statistical inference
Statistical inference is the inference of the values of unknown variables in a statistical model. There are
two major approaches to statistical inference. The first approach is variously referred to as the classical,
frequentist, or sampling theory based approach. The second is the Bayesian approach. The classical approach
is still the dominant one in practice, particularly for the more introductory or medium level topics. It is also
the principal, or even only, approach taught in most applied statistics courses. As such, it is the only approach
that most working scientists will have been formally introduced to. The Bayesian approach, on the other
hand, although having its origins in the 18th century and being used in practice throughout the 19th century,
had a long hiatus in statistics until around the end of the 1980s. Since then, and as we will see, largely
because of the growth in computing power, it has steadily grown in terms of its popularity and widespread
usage throughout statistics and data science. Throughout the remainder of this book, we will attempt to pay
adequate attention to both the classical and Bayesian approaches wherever we discuss statistical models. In
this section, we aim to provide a brief general introduction to both approaches. For this, we will use a simple
problem, propose a statistical model for it, and then infer the values of its unknown parameters using the
classical and the Bayesian approaches.
Example problem: The problem we will consider was described in the Guardian newspaper in January
4, 20021 : “Polish mathematicians Tomasz Gliszczynski and Waclaw Zawadowski. . . spun a Belgian one
euro coin 250 times, and found it landed heads up 140 times.” Here, the data is the observed number of
heads, m = 140. The total number of spins n = 250 is the sample size, and is a fixed and known quantity,
and so is not modelled per se. From the perspective of statistical modelling, the observed number of heads
is treated as a sample from a population. In this case, the population is the set of all possible observed
number of heads, and their relative frequencies, that would be obtained if Gliszczynski and Zawadowski were
to infinitely repeat their study under identical circumstances with the exact same Belgian one euro coin.
The population is therefore a probability distribution over the set of possible values of the number of heads
in a sample of n = 250 trials, which is a probability distribution over 0 . . . n. Thus, m is a realization of
a random variable Y whose possible values are 0 . . . n and whose probability distribution is P(Y ). We do
not know P(Y ) and so we begin by proposing a model of it. In this case, the only viable option for this
model is the binomial distribution: Y ∼ Binomial(θ, n = 250). While more complex options are possible in
principle, they would all go beyond what the data can tell us, and so simply can not be evaluated. In general
terms, a binomial distribution gives the probability of the number of so-called “successes” in a fixed number
n of “trials” where the probability of a success on any trial is fixed quantity θ and all trials are independent
of another. Translated into the terms of the current example, the binomial distribution is the probability
distribution over the observed number of heads in n = 250 spins where the probability of a heads on any trial
is θ, and where we assume the outcome on each trial is independent of every other trial. In this binomial
model, the parameter θ has a fixed but unknown quantity. The value of θ is therefore the objective of our
statistical inference.
1 https://www.theguardian.com/world/2002/jan/04/euro.eu2
4
Classical statistical inference
Classical statistical inference begins with an estimator of the value of θ, denoted by θ̂, and then considers the
sampling distribution of θ̂ for any hypothetical true value of θ. Informally speaking, we can see the estimator
θ̂ as an educated guess of what the true value of θ is. There are different methods of estimation available,
and each one can be evaluated, as we will see, according to certain criteria, such as bias, variance, etc. One
widely used method of estimation, perhaps the most widely used method, is maximum likelihood estimation.
Why the binomial distribution has the probability mass function shown on the very right hand side here is
not something we will derive here, and we will just take it as a given, but it is relatively straightforward to
derive from the above definition of a binomial problem. Note that this probability mass function is a function
that maps values of m ∈ 0 . . . n to the interval for fixed values for θ and n. The corresponding likelihood
m [0, 1]n−m
n
function takes the same formula, i.e. m θ (1 − θ) , and treats it as a function of θ for fixed values of m
and n. In other words, the likelihood function is as follows:
n m
L(θ|m, n) = θ (1 − θ)n−m ,
m
where θ is assumed to takes values in the interval Θ = [0, 1] and where n and m are fixed. In other words, the
likelihood function gives the probability of the observed data for every possible value of the unknown variable
θ ∈ Θ. Technically speaking, any likelihood function is defined up to a proportional constant. What this
means is that multiplying, or dividing, any likelihood function by a fixed constant value results in the same
likelihood function. In practice, therefore, when writing a likelihood function, we usually drop any constant
n
multipliers. In the above likelihood function, the binomial coefficient m is a constant term that does not
change for any value of θ, and so it can be dropped, leading to the likelihood function being written as
L(θ|m, n) = θm (1 − θ)n−m .
The likelihood function for m = 140 and n = 250 is shown in Figure 2a. As we can see from this plot,
the likelihood function is concentrated from around θ = 0.45 to around θ = 0.65. Outside of that range,
the likelihood of any value of θ becomes negligible. In itself, this is very informative. It tells us that the
probability of observing the data that we did, i.e., m = 140 heads in n = 250 spins, is negligible if θ is less
than around 0.45 or greater than around 0.65. Therefore, only values in the range of approximately 0.45 to
0.65 have evidential support from the data.
The value of θ that maximizes the likelihood function is obtained by the usual procedure of optimizing
functions, namely calculating the derivative of the function with respect to θ, setting the derivative equal
to zero, and solving for the value of θ. We can do this more easily by using the logarithm of the likelihood
function, rather than the likelihood function itself. The logarithm of the likelihood function is shown in
Figure 2b. Because the logarithm is a monotonic transformation, the value of θ that maximizes the logarithm
of the likelihood also maximizes the likelihood function. The derivative of the log of the likelihood function
5
a b
logL(θ|n, m)
L(θ|n, m)
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
Figure 2: a) The binomial likelihood function for θ when m = 140 and n = 250. b) The logarithm of the
likelihood function.
Sampling distribution of θ̂
The maximum likelihood estimator can be seen as a random variable. It is a deterministic function of the
observed data m, but m would vary were we repeat the experiment even under identical circumstances. For
example, if we knew the true value of θ, let’s call this true value θ? , and we spun the coin n times, and did so
ad infinitum, then we know that the distribution of m (the observed number of heads) would be the binomial
distribution with sample size n and parameter value of θ? . Given that the maximum likelihood estimator is al-
ways θ̂ = m m
n , the probability that θ̂ takes on the value of n for any given value of m if the true value of θ is θ? is
n m
θ (1 − θ? )n−m .
m ?
In general, the sampling distribution for θ̂ when the true value of θ is θ? can be written as follows.
n θ̂n
P(θ̂|θ? , n) = θ (1 − θ? )n−θ̂n .
θ̂n ?
The sampling distribution of θ̂ when n = 250 and for the example value of θ? = 0.64 is shown in Figure 3.
The expected value of the sampling distribution of θ̂ is θ? . In other words, on average, θ̂ is equal to θ? . This
tells us that the binomial maximum likelihood estimator is an unbiased estimator of the true value of θ,
6
0.05
0.04
mathrmP(θ^|thetastar, n)
0.03
0.02
0.01
0.00
Figure 3: Sampling distribution of the binomial maximum likelihood estimator θ̂ when n = 250 and θ? = 0.64.
Here, we limit the x axis to just beyond those values of θ̂ that have nontrivial probabilities.
i.e. θ? . The variance of the distribution of θ̂ is n1 θ? (1 − θ? ). The lower the variance of the estimator, the less
sampling variability there will be in the value of the estimators. Here, we see that the variance decreases as n
increases, and so as sample size increases, there p is less variability in the estimator’s values. The standard
deviation of the sampling distribution is √1n θ? (1 − θ? ). This is known as the standard error, and often
plays an important role in classical statistical inference, as we will see in later examples.
p-values
Informally speaking, a p-value tells us whether the value of an estimator, or more generally, the value of
any statistic or function of the observed data, is consistent with some hypothetical value of the unknown
variable. If the p-value is low, the estimator’s value is not consistent with the hypothesized value. The higher
the p-value is, the more consistent the estimator’s value is with the hypothesized value. If the p-value is
sufficiently low, we can, practically speaking, rule out or reject the hypothesized value. In general, the p-value
takes values from 0 to 1 and so is an inherently continuous measure of support for a hypothesis. Where
we “draw the line” on this continuum between low and not-low is a matter of convention, but the almost
universally held convention is that a p-value must be lower than at most 0.05 to be considered sufficiently low
for the corresponding hypothesized value to be rejected. This threshold for rejection/non-threshold is usually
signified by α.
Technically speaking, p-values are tail areas of the sampling distribution of the estimator corresponding to a
particular hypothesized value of the unknown variable. Once we have the sampling distribution, p-values
are straightforward to calculate. In the current problem, the unknown variable is θ and we can in principle
hypothesize that its true value is any value between 0 and 1. If, for example, we hypothesize that the true
value of θ is θ? = 0.64, the sampling distribution of θ̂ is that shown in Figure 3. On the basis of this sampling
distribution, we can see that some values for θ̂ are expected and others are not. For example, we see that the
θ̂ values are mostly from around 0.58 to around 0.7, and values of θ̂ much below or above those extremes
rarely occur. On the basis of the estimator’s value that we did obtain, namely θ̂ = 140 250 = 0.56, we can see
that this result seems to be outside the range of values of θ̂ that we would expect if θ? = 0.64. In order to
be precise in our statement of whether θ̂ = 0.56 is beyond what we would expect if θ? = 0.64, we calculate
7
the tail areas of the sampling distribution defined by values as or more extreme than θ̂ = 0.56. These tail
areas are shaded in Figure 4a. The total area in these tails defines the p-value. In other words, the p-value is
the probability of observing a value of the estimator as or more extreme than θ̂ = 0.56 if θ? = 0.64. If the
p-value is low, then we know that θ̂ = 0.56 is far into the tails of the sampling distribution when θ = 0.64. In
this particular example, the area of these tails, and so therefore the p-value, is approximately 0.01. This is
clearly low according to the conventional standards mentioned above, and so therefore we say that the result
θ̂ = 0.56 is not consistent with the hypothesis that θ? = 0.64, and so in practical terms, we can reject the
hypothesis that the true value of θ is θ? = 0.64.
a b 0.05
0.05
0.04
0.04
mathrmP(θ^|θ ∗ , n)
mathrmP(θ^|θ ∗ , n)
0.03 0.03
0.02 0.02
0.01 0.01
0.00 0.00
0.5 0.6 0.7 0.35 0.40 0.45 0.50 0.55 0.60 0.65
θ^ θ^
Figure 4: a) Sampling distribution of the binomial maximum likelihood estimator θ̂ when n = 250 and
θ? = 0.64. b) Sampling distribution of the binomial maximum likelihood estimator θ̂ when n = 250 and
θ? = 0.5. In both cases, as in Figure 3, we limit the x axis to just beyond those values of θ̂ that have nontrivial
probabilities. The shaded tail areas correspond to values of θ̂ that are as or more extreme, relative to the
140
center, than the value of the estimator we observed, which was θ̂ = 250 = 0.56.
In more detail, the p-value is the sum of the two tails of the sampling distribution when θ? = 0.64. The lower
tail is defined by all those values from θ̂ = 0 up to θ̂ = 0.56. These are values of θ̂ that are less than or equal to
the observed θ̂ of 140
250 = 0.56. The upper tail is defined by all those values from θ̂ = 0.64 + |0.64 − 0.56| = 0.72
up to 1. These are the values θ̂ that as extreme relative to θ? = 0.64 as is 0.56, but in the opposite direction.
Thus, the p-value is calculated as follows.
Z 0.56 Z 1
n θ̂n n θ̂n
p-value = θ (1 − θ)n−θ̂n dθ̂ + θ (1 − θ)n−θ̂n dθ̂ ≈ 0.01.
θ̂n θ̂n
|0 {z } | 0.72 {z }
area of lower tail area of upper tail
Here, for clarity, we distinguish between θ̂obs , which is the actual value of the estimator calculated from the
observed data, θH , which is the hypothetized value of θ, and θ̂, which is the estimator’s random variable
whose distribution is the sampling distribution when θ = θH .
We can calculate p-values for binomial problems like this using R with the binom.test command. We need
to pass in the observed number of “successes”, which are heads in this case, as the value of the x parameter,
and the sample size as the value of n. The hypothesized value of θ is passed in as the value of p.
8
binom_model <- binom.test(x = 140, n = 250, p = 0.64)
The value of the maximum likelihood estimator is given by the value of the estimate element of the output
object.
binom_model$estimate
## probability of success
## 0.56
The p-value is given by the value of p.value element.
binom_model$p.value
## [1] 0.01004809
Note that here the lower tail is defined up to 0.5 − |0.5 − 0.56| = 0.44. From this p-value of 0.066, we see that
this is not sufficiently low to reject the null hypothesis at the conventional α = 0.05 threshold, though of
course it is quite close to this threshold too. We can calculate this p-value using binom.test.
null_binom_model <- binom.test(x = 140, n = 250, p = 0.5) # p = 0.5 is default
null_binom_model$p.value
## [1] 0.06642115
As a general point about null hypothesis testing, given that a null hypothesis is a hypothesis of no interesting
effect, if we reject that hypothesis, we say the result is significant. Saying that a result is significant in
this sense of the term, however, is not necessarily saying much. The estimated effect may be small or even
negligible in practical terms, but may still be statistical significant. Moreover, even a highly significant p-value
does not necessarily mean a large effect in practical terms. As an example, there were 731,213 live births in
the United Kingdom in 2018. The p-value for the null hypothesis that the probability of the birth being a
male rather than female is ≈ 3.8 × 10−119 . This is a tiny p-value and so is an extremely significant result.
However, the (maximum likelihood) estimator for the probability of a male birth is 0.514. Although certainly
not practically meaningless, this is nonetheless quite a small effect: it corresponds to around 28 more males
than females in every 1000 births. As such, an extremely statistically significant result corresponds to small
effect in practical terms. In general, because the p-value for a non-null true effect will always decrease as the
sample size increases, we can always construct cases of arbitrarily small p-values for arbitrarily small effects,
and so even practically trivial effects may be highly statistically significant.
Confidence intervals
Confidence intervals are counterparts to p-values. As we’ve seen, each p-value corresponds to a particular
hypothesis about the true value of θ. If the p-value is sufficiently low, we will reject the corresponding
hypothesis. If the p-value is not sufficiently low, we can not reject the hypothesis. However, not rejecting a
hypothesis does not entail that we should accept that hypothesis; it just simply means that we can not rule it
out. The set of hypothetical values of θ that we do not reject at the α p-value threshold corresponds to the
9
1 − α confidence interval. Practically speaking, we can treat all values in this range as the set of plausible
values for θ.
1.00
p−value 0.75
0.50
0.25
0.00
Figure 5: The p-value for each hypothetical value of θ from 0.46 to 0.66 in steps of 10−3 . Shaded in grey
are all those values of θ that have p-values less than α = 0.05. These values of θ would be rejected in a
hypothesis test. All other values of θ have p-values greater than α = 0.05 and so would not be rejected by a
hypothesis test. These value of θ comprise the confidence interval.
In Figure 5, we shade in yellow all those values of θ that would not be rejected by a hypothesis test at
α = 0.05. Therefore, these values of θ are in the 1 − α = 0.95 confidence interval. The lower and upper bounds
of this interval as calculated in this figure, which is based on a discretization of the θ interval into steps of
10−3 , is [0.498, 0.622]. The interval can be calculated exactly by using a relationship between the cumulative
binomial distribution and the cumulative Beta distribution, a method known as the Clopper-Pearson method
(Clopper and Pearson 1934). Using R, we can calculate this Clopper-Pearson confidence interval as follows.
c(qbeta(0.025, 140, 250 - 140 + 1),
qbeta(1-0.025, 140+1, 250 - 140))
## [1] 0.4960662 0.6224941
This will also be returned by the binom.test.
M_binom <- binom.test(x = 140, n = 250)
M_binom$conf.int
## [1] 0.4960662 0.6224941
## attr(,"conf.level")
## [1] 0.95
Confidence intervals have the following frequentist property. In an infinite repetition of an experiment[ˆHere,
we use the term experiment in sense that it is used in probability theory, which is of a well defined procedure
that generates data according to a specified probability distribution.], the 95% confidence interval will contain
the true value of the unknown variable 95% of the time. What this means in the case of the present problem
is that if we were to repeat ad infinitum the original coin spinning experiment under identical conditions,
and so the probability of a heads outcome is a fixed though unknown value of θ, and on each repetition
calculate the 95% confidence interval, then 95% of these confidence intervals would contain the true value
of θ. In Figure 6, we provide an illustration of this phenomenon. For 100 repetitions, we generate data
from a binomial distribution with n = 250 and θ = 0.6. From the data on each repetition, we calculate the
90% confidence interval. Those intervals that do not contain θ = 0.6 are shaded in grey. In this set of 100
10
repetitions, there are 92 intervals out of 100 that contain θ = 0.6.
0.70
0.65
θ
0.60
0.55
0.50
repetitions
Figure 6: The 90% confidence intervals in 100 repetitions of a binomial experiment with n = 250 and θ = 0.6.
Shaded in grey are any confidence intervals that do not contain θ = 0.6.
Priors
Having established our statistical model, to perform Bayesian inference on the value of θ, we must first
provide a probability distribution for the possible values that θ can take on in principle. This is known as the
11
prior distribution. As an example, if we assume that θ can take on any possible value in the interval [0, 1]
and each value has equal likelihood, our prior would be a uniform distribution over θ. On the other hand,
if we assume that θ values are more likely to be equal to θ = 0.5, but possibly be above or below θ = 0.5
too, our prior might be a symmetrical unimodal distribution centered at θ = 0.5. How wide this unimodal
distribution is depends on what we think are the relative probabilities of values close to and far from θ = 0.5.
Likewise, if we assume that θ is likely to correspond to a bias towards heads, then the prior might be another
symmetrical unimodal distribution but centered on some value of θ greater than 0.5. Examples of priors like
this are shown in Figure 7.
1.050
a b 5
1.025 4
3
P(θ)
P(θ)
1.000
2
0.975 1
0
0.950
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
6
c 2.5 d
2.0
4
1.5
P(θ)
P(θ)
1.0
2
0.5
0.0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
Figure 7: Examples of priors over θ. a) A uniform prior. b-c) Priors where θ is more likely to be 0.5 than
otherwise but values greater and less than 0.5 are probable too, and more so in the case of c) whose variance
is wider than the prior in b). d) A prior where θ is more likely to be θ = 0.6 than any other value and more
likely to be above rather than below 0.5.
There is a vast literature on what priors are, how to choose them, and the relative merits of, for example,
subjective, objective, informative, noninformative, weakly informative, reference, maximum entropy, and
other kinds of priors. We will not even attempt a summary of this vast literature here. Our general advice on
the topic is that the choice of a prior should not be automatic, but rather should be treated as a modelling
assumption, just like the assumptions that lead to our choice of the original statistical model. When choosing
the prior, just like when we choose the statistical model, we should follow a number of guiding principles:
we should use our general understanding of the problem at hand, and also of the data itself; our choices
should be reasonable and justifiable rather than arbitrary; our choices should be seen as tentative and subject
to revision if the assumptions or reasoning on which it was based are found to be invalid or wrong. More
specifically, following Gelman, Simpson, and Betancourt (2017), we recommend that our choices of priors
should be guided by whether the prior could generate the type of data that we expect to see in the problem
at hand, that it covers the range of plausible values of the unknown parameter, and most of the prior mass
12
should be the parts of the parameter space that are likely to generate the type of data we are modelling.
It should be emphasised, however, that although in general the choice of priors is always important, it has
more practical consequences in some problems than in others. In particular, in relatively simple models where
the ratio of observed data to unknown variables is relatively high, as is the case with the present problem,
most choices of priors, unless they are extreme, will lead to practically similar conclusions. In complex models,
on the other hand, or models where the amount of data relative to the number of unknown variables is low,
the priors play a much more important role in the final fitted model, and so choices of priors in these contexts
needs to be more careful and judicious.
For the present problem, the parameter θ is the probability that the coin lands heads up after a spin. The
value of θ is a function of both the physical properties of the coin and also the manner in which it is spun. It
is arguably difficult to physically bias a coin so that one side is more likely in a flip (Gelman and Nolan 2002),
but experimental results with different materials can lead to outcomes that are as much as 70-30 biased
(Kerrich 1946), and precise control over how it is flipped can lead to even 100% biased outcomes (Diaconis,
Holmes, and Montgomery 2007). Coin spins, as opposed to flips, can be easily biased to as much as 75-25
bias (Gelman and Nolan 2002). From this, it seems to reasonable to have a unimodal and symmetrical prior
centered at θ = 0.5, but with a relatively wide spread to allow for relatively extreme possibilities. For this, we
will use the prior displayed in Figure 7c. Clearly, the most likely value of θ according to this prior is θ = 0.5,
but values of θ = 0.25 or θ = 0.75 have substantial prior mass in their vicinity, and even extreme values of
greater than θ = 0.9 or less than θ = 0.1 have non-trivial prior mass.
The prior that we have chosen (displayed in Figure 7c) is a beta distribution, which is a probability distribution
over the interval from 0 to 1. There are two parameters in the beta distribution, which we call hyperparameters
to distinguish them from the parameters of the statistical model, and these are conventionally denoted by α
and β. In the particular beta distribution that we have chosen their values are α = 5 and β = 5. The density
function of any beta distribution is
Γ(α + β) α−1
P(θ|α, β) = θ (1 − θ)β−1 ,
Γ(α)Γ(β)
and its mean and variance are
α αβ
hθi = , Var(θ) = .
α+β (α + β)2 (α + β + 1)
For our choice of prior therefore, the mean of the distribution is 0.5 and the variance is 0.023 (and so the
standard deviation is 0.151).
Y ∼ Binom(θ, n = 250),
θ ∼ Beta(α = 5, β = 5)
We can view this model as an expanded generative model of the observed data. In other words, to generate
hypothetical data sets, first, we sample a value of θ from a beta distribution with hyperparameters α = β = 5,
and then we sample a value from a binomial distribution with parameter θ and sample size n = 250. This
expanded model therefore defines a joint probability distribution over Y and θ, conditional on α, β and n,
i.e. P(Y, θ|α, β, n). We will use this join probability distribution, coupled with the fact that we observe the
value of Y , to calculate a probability distribution over the possible values of θ conditional on all the observed
data and the assumptions of the model. This is known as the posterior distribution, and is the central result
in Bayesian inference. To understand how the posterior distribution is calculated, we must introduce some
elementary results from probability theory.
If we have two random variables A and B, an elementary probability theory result shows how the joint
probability distribution of A and B can be factored into products of conditional probability distributions and
13
Perished Survived
Male 1364 367 1731
Female 126 344 470
1490 711 2201
Perished Survived
Male 0.620 0.167 0.786
Female 0.057 0.156 0.214
0.677 0.323 1.000
In other words, the joint probability of A and B is equal to the conditional probability of A given B times
the probability of B, or equally, it is equal to the conditional probability of B given A times the probability
of A. Another elementary result is how we calculate marginal probability distributions from summations over
joint distributions. X X
P(A) = P(A, B) = P(A|B)P(B),
{B} {B}
X X
P(B) = P(A, B) = P(B|A)P(A),
{A} {A}
where {A} and {B} in the summations indicate the set of all values of A and B, respectively.
We can illustrate these results easily using any joint probability distribution table. For example, we can use
the following numbers, which give the number of males and females who survived or not on the RMS Titanic.
There were 2201 people on board, and so we can divide the numbers in each category by 2201 to obtain a
joint probability distribution table.
Note that the column totals and row totals are provided here and these give the probabilities of a person on
board Titanic being a male or female, or surviving or not, respectively. From the joint probability distribution,
we can get the two conditional probability distributions. The first tells us the probability of surviving or not
given that we know that the person is a Male or a Female.
The second conditional probability distribution table tells us the probability of being a male or a female given
that the person survived or not.
Using the joint table, we can see that the probability of being a person being both a male and a survivor is
14
This is equal to the probability of being a male given that the person survived times the probability of being
a survivor.
P(Sex = Male|Survival = Survived)P(Survival = Survived) = 0.516 × 0.323 = 0.167.
It is also equal to the probability of being a survivor given that the person is a male
P(Survival = Survived|Sex = Male)P(Sex = Male) = 0.212 × 0.786 = 0.167.
The same holds for any other element of the joint probability table. Using these tables we can also calculate
marginal probabilities.
P(Sex = Male) = P(Sex = Male, Survival = Survived) + P(Sex = Male, Survival = Perished),
= P(Sex = Male|Survival = Survived)P(Survival = Survived)
+ P(Sex = Male|Survival = Perished)P(Survival = Perished),
= 0.516 × 0.323 + 0.915 × 0.677,
= 0.786
With these elementary results from probability theory, we end up with the following uncontroversial result
known as Bayes’ rule.
P(A|B)P(B)
P(B|A) = P .
{B} P(A|B)P(B)
We can use this result to solve elementary probability puzzles like the following2 .
Box A has 10 lightbulbs, of which 4 are defective. Box B has 6 lightbulbs, of which 1 is defective.
Box C has 8 lightbulbs, of which 3 are defective. If we do choose a nondefective bulb, what is the
probability it came from Box C?
Here, we are being asked for P(Box = C|Bulb = Working). By Bayes’ rule this is
P(Bulb = Working|Box = C)P(Box = C)
P(Box = C|Bulb = Working) = ,
P(Bulb = Working)
where the denominator here is
P(Bulb = Working) =P(Bulb = Working|Box = A)P(Box = A)
+ P(Bulb = Working|Box = B)P(Box = B)
+ P(Bulb = Working|Box = C)P(Box = C).
The conditional probabilities of drawing working bulb from each of box A, B, or C, are given by knowing the
6
number of bulbs in each box and the number of defective bulbs in each, and so are 10 , 56 , 58 , respectively.
The marginal probabilities of box A, B, and C are 13 , 13 , and 13 . From this, we have
51
83
P(Box = C|Bulb = Working) = 6 1 ,
10 3+ 56 13 + 85 13
5
8
= 6 5 5,
10 + 6 + 8
= 0.304
2 This problem is taken from Schaum’s Outlines: Probability (2nd ed., 2000), pages 87-88.
15
Returning now to our joint distribution over Y and θ, i.e., P(Y, θ|α, β, n), from this, we have the following:
The left hand side is the posterior distribution. There are three components on the right hand side. First,
there is P(Y |θ, n). Here, the value of Y and n are known and θ is a free variable, and so P(Y |θ, n) is a
function over θ. This is the likelihood function over θ that we have already encountered, and which we can
also write as L(θ|Y, n). Second, there is P(θ|α, β), which is the prior. Like the likelihood function, this is
also a function over the set of all possible values of θ. The third component is the denominator, which is the
integral of the product of the likelihood function and the prior, integrated over all possible values of θ. This
integral is known as the marginal likelihood, and is a single value that gives the area under the curve of the
function that is the product of the likelihood function and the prior. We can write this as follows.
likelihood prior
z }| { z }| {
L(θ|Y, n) P (θ|α, β)
P(θ|Y, α, β, n) = Z .
L(θ|Y, n)P (θ|α, β)dθ
| {z }
posterior
| {z }
marginal likelihood
What this tells us is that having observed the data, the probability distribution over the possible values of θ
is the normalized product of the likelihood function and the prior over θ. The prior tells us what values θ
could take on in principle. The likelihood function effectively tells us the evidence in favour of any given value
of θ according to the data. We multiply these two functions together (the numerator above) and then divide
by the area under the curve of this product of functions (the marginal likelihood, which is the denominator).
Dividing by the marginal likelihood ensures that the area under the curve of the posterior is normalized so
that its integral is exactly 1.
Filling out the detail of this equation, we have the following.
L(θ|Y, n)P (θ|α, β)
P(θ|Y, α, β, n) ∝ R ,
L(θ|Y, n)P (θ|α, β)dθ
Γ(α+β) α−1
θm (1 − θ)n−m · Γ(α)Γ(β) θ (1 − θ)β−1
=R Γ(α+β) α−1
,
θm (1 − θ)n−m · Γ(α)Γ(β) θ (1 − θ)β−1 dθ
θm+α−1 (1 − θ)n−m+β−1
=R
θm+α−1 (1 − θ)n−m+β−1 dθ
16
a
Figure 8: The posterior, likelihood, and prior in a binomial problem with m = 140 and n = 250 and where
the prior is a beta distribution with α = 5, β = 5. In b) the same functions are plotted as in a) but over a
limited range of the x-axis in order to make the difference between the posterior and the likelihood more
apparent. Note also that the likelihood function is scaled so that it integrates to 1. This is simply to make it
easier to visualize on the same plot at the prior and posterior. Scaling the likelihood by an arbitrary positive
quantity does not affect the calculations of the posterior.
Having a mathematical formula for the posterior distribution, as we do here, is an example of an analytic
solution to the posterior distribution. This is not always possible. In other words, it is not always possible to
have an mathematical expression with a finite number of terms and operations that describes the posterior
distribution exactly. When an analytic solution is not possible, and in fact it is only possible in a relatively
limited number of cases, we must rely on numerical methods to evaluate the posterior distribution, and in
particular, Monte Carlo methods. We will return to this important topic in detail below.
Posterior summaries
The result of any Bayesian inference of any unknown variable is the posterior distribution. On the assumption
that the statistical model and the prior are valid, the posterior distribution provides us everything that is
known about the true value of the variable. For the current problem, this is P(θ|Y = m, α = 5, β = 5, n = 250),
which is a beta distribution with hyperparameters m + α and n − m + β. From the properties of the beta
17
distribution mentioned above, the mean and standard deviation of this posterior are as follows.
s
α+m (α + m)(β + n − m)
hθi = = 0.558, SD(θ) = = 0.031.
n+α+β (α + β + n)2 (α + β + n + 1)
We can also use the cumulative distribution function of the beta distribution to determine that 2.5th and
97.5% percentiles of the posterior. Between these two bounds, there is 95% of the posterior probability mass.
We can calculate this using R with the qbeta command, which is the inverse of the cumulative distribution
function of the beta distribution.
n <- 250
m <- 140
alpha <- 5
beta <- 5
c(qbeta(0.025, m + alpha, n - m + beta),
qbeta(0.975, m + alpha, n - m + beta))
## [1] 0.4970679 0.6174758
This is a posterior interval, also known as a credible interval. More precisely, this interval is the central or
percentile posterior interval.
Another posterior interval of interest is known as the high posterior density (hpd) interval. The precise
definition of the hpd is as follows. The ϕ hpd interval for the posterior density function f (θ) is computed by
finding a probability density value p∗ such that
P({θ : f (θ) ≥ p∗ }) = ϕ.
In other words, we find the value p∗ such that the probability mass of the set of points whose density is
greater than than p∗ is exactly ϕ. By this definition, no value of θ outside the hpd has a higher density than
any value within the hpd. It will also be the shortest posterior interval containing ϕ. The hpd as precisely
defined here is not easily calculated, and in general requires an optimization procedure to solve for p∗ . In this
model, using a numerical optimization technique, we calculate it to be (0.4974, 0.6177). This is obviously
very close to the quantile based posterior interval defined above.
The posterior mean and the hpd interval can be compared to the maximum likelihood estimator and confidence
interval in classical inference. Recall that the maximum likelihood estimator was 0.56 and the 95% confidence
interval was (0.496, 0.622). By contrast, the posterior mean and 95% central posterior interval is 0.558 and
(0.497, 0.617), respectively. While not identical, these are very close and for practical purposes are probably
indistinguishable. This illustrates that classical and Bayesian methods can be, especially in simple models,
indistinguishable.
Despite the similarities of these two priors, if we use the normal distribution over the log odds, the posterior
distribution is as follows
1 |logit(θ)|2
P(θ|m, n, τ ) = θm (1 − θ)n−m θ(1−θ)
1
e− 2τ 2 ,
Z
18
a b
3 3
2 2
P(θ)
P(θ)
1 1
0 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
θ θ
θ
Figure 9: a) A zero mean normal distribution with standard deviation τ = 0.5 over log( 1−θ ). b) A beta
distribution with parameters α = β = 8.42.
where
|logit(θ)|2
Z
Z= θm (1 − θ)n−m θ(1−θ)
1
e− 2τ 2 dθ,
This does not simplify to an analytic expression, and it is not a known probability density function with well
documented properties. The problem here arises primarily because of the integral, which does not have an
analytic solution. In cases like this, we need to resort to numerical alternatives to the analytic solution. For
simple models such as this one, there are many options for how to do this. However, a general approach
that applies to all Bayesian models, including and especially complex and high dimensional (i.e. with large
numbers of unknown variables) models, is to use Monte Carlo sampling methods.
Monte Carlo methods, first introduced by Metropolis and Ulam (1949), can be generally defined as numerical
mehods for approximating mathematical expectations of random variables. If X is a random variable of
dimensionality d whose probability distribution is P(X), and φ(X) is a function of X, then the expectation
or expected value of φ(X) is Z
hφ(X)i = φ(x)P(X = x)dx.
where x1 , x2 . . . xi . . . xn are n samples from √P(X). Particularly important is the fact that the error of
approximation decreases as a function of the n and is independent of d, the dimensionality of X.
Quantities of interest related to the probability distribution P(X) such as the mean, variance, or probabilities
of being above or below a certain value can all be expressed as expectations:
Z Z Z
2
hXi = xP(X = x)dx, V(X) = (x − hXi) P(X = x)dx, P(X > x0 ) = I(x > x0 )P(X = x)dx.
19
What this entails for Bayesian inference is that if we can draw samples from the posterior distribution, Monte
Carlo methods can be used to calculate all quantities of interest related to the posterior. Moreover, this
can be done in high dimensional problems without encountering the exponential rise in approximation error
known as the curse of dimensionality.
There are many Monte Carlo methods for drawing samples from posterior probability distributions. Here, we
will describe just two: the Metropolis sampler, and the Hamiltonian Monte Carlo (hmc) sampler. Both of
these are Markov Chain Monte Carlo (mcmc) samplers, as we will see. The Metropolis sampler was one of
the earliest mcmc samplers, introduced by Metropolis et al. (1953), and has traditionally been one of the
most widely used samplers in Bayesian analysis. The hmc sampler is an extension of the Metropolis sampler.
It is the main sampler that is used the Stan probabilistic programming language that we will use extensively
throughout the remainder of this book.
In order to discuss this topic generally, rather than just for a specific problem or model, we will assume
that our observed data is D and that the unknown variable(s) that we are trying to infer is a d dimensional
variable θ. The posterior distribution is
L(θ|D)P(θ)
P(θ|D) = R ,
L(θ|D)P(θ)dθ
where L(θ|D) is the likelihood function and P(θ) is the prior. We can rewrite the posterior distribution as
Z
1
P(θ|D) = f (θ) , f (θ) = L(θ|D)g(θ), Z = L(θ|D)g(θ)dθ.
Z
Here, f (θ) is the unnormalized posterior distribution, where for notational simplicity we drop explicit reference
to D, Z is the normalisation constant, and g(θ) is the prior distribution.
Metropolis sampler
For the Metropolis sampler, we need only be able to evaluate f (θ) at any given value of θ, and we do not
need to know the value of Z. Evaluating f (θ) is often very straightforward even in complex models. If we can
evaluate the likelihood function at θ and the (normalized or unnormalized) prior at θ, we simply multiply their
values together to obtain f (θ). To draw samples using the metropolis sampler, we begin with an arbitrary
initial value θ̃0 , and then use a proposal distribution based around θ̃0 to propose a new point in the θ space,
which we will call θ̃0 . We can write this proposal distribution Q(θ̃0 |θ̃0 ). In the standard implementation of
the Metropolis sampler, the proposal distribution must be symmetric such that for any two values θ̃i and θ̃j ,
Q(θ̃i |θ̃j ) = Q(θ̃j |θ̃i ). In the Metropolis-Hasting variant of the Metropolis sampler, as we will see, the proposal
distribution need not be symmetric.
Having sampled θ∗ , if f (θ̃0 ) ≥ f (θ̃0 ), we accept that proposed new point and set θ̃1 = θ̃0 . We then propose a
new point in θ space using our proposal distribution centered at θ̃1 . If, on the other hand f (θ̃0 ) < f (θ̃0 ), we
accept θ̃0 with probability
f (θ̃0 )
.
f (θ̃0 )
If we accept it, then we set θ̃1 = θ̃0 and propose a new point using our proposal distribution centered at θ̃1 . If
we reject θ̃0 , we new then propose a new point using the proposal distribution centered at θ̃0 , and repeat the
above steps.
Continuing in this way, we produce a sequence of samples θ̃0 , θ̃1 , θ̃2 . . . that are realizations of random variables
θ0 , θ1 , θ2 . . .. Because each variable is dependent on the immediately preceding variable, and not dependent
on any variable before that, these random variables form a first-order Markov chain. The marginal probability
distributions of these variables are π0 (θ0 ), π1 (θ1 ), π2 (θ2 ) . . .. The first distribution, π0 (θ0 ), is an arbitrary
starting distribution. The nature of the second distribution, π1 (θ1 ), can be understood as follows. If θ0 = θ̃a ,
then the probability that θ1 = θ̃b will be the Q(θ̃0 = θ̃b |θ̃b ) if f (θ̃b ) > f (θ̃a ) and will be Q(θ̃0 = θ̃b |θ̃b ) × ff ((θ̃θ̃b ))
a
20
otherwise. Thus, π1 (θ1 ) is
Z
f (θ1 )
π1 (θ1 ) = min , 1 Q(θ1 |θ0 ) π0 (θ0 )dθ0 .
f (θ0 )
| {z }
T (θ1 |θ0 )
It can be shown that under some general and minimal conditions3 of Markov chains, the sequence of probability
distributions π0 (θ0 ), π1 (θ1 ), π2 (θ2 ) . . . will converge upon a unique distribution that we will label π(θ). This
is known as the invariant or stationary distribution of the Markov chain. Furthermore, if we have a function
ψ(θ) that satisfies
T (θi = θ̃b |θi−1 = θ̃a )ψ(θi−1 = θ̃a ) = T (θi = θ̃a |θi−1 = θ̃b )ψ(θi−1 = θ̃b )
for any two states θ̃a and θ̃b , then ψ(θ) is this invariant distribution. We can see that the ψ(θ) is the posterior
distribution as follows. First, we will assume that f (θ̃a ) > f (θ̃b ). If this is not the case, we simply reverse the
labels of θ̃a and θ̃b . Then we have the following.
T (θ̃b |θ̃a )ψ(θ̃a ) = T (θ̃a |θ̃b )ψ(θ̃b ),
f (θb )
Q(θ̃b |θ̃a )ψ(θ̃a ) = Q(θ̃a |θ̃b )ψ(θ̃b ).
f (θa )
Because of symmetry of the proposal distribution, we have
f (θb )
Q(θ̃b |θ̃a )ψ(θ̃a ) = Q(θ̃b |θ̃a )ψ(θ̃b ),
f (θa )
f (θb )
ψ(θ̃a ) = ψ(θ̃b ),
f (θa )
f (θb ) ψ(θ̃a )
= .
f (θa ) ψ(θ̃b )
From this we have
1
ψ(θ) = f (θ)
Z
being the invariant distribution of the Markov chain.
In Figure 10, we show some of the sequence of distributions π0 , π1 , π2 . . . of a Metropolis sampler as it
converges to the true posterior distribution, which is also shown. For this illustration, we use a binomial
problem like above but where the data is m = 14 and n = 25, use a logit-Normal prior, and assume a uniform
proposal distribution. From this illustration, we see a relatively quick convergence to the true posterior
distribution.
The Metropolis-Hastings variant of the original Metropolis sampler allows for asymmetric proposal distri-
butions. Given an initial sample θ̃i and proposed new sample θ̃0 , sampled from the proposal distribution
Q(θ̃0 |θ̃i ), then we accept θ̃0 if
f (θ̃0 )Q(θ̃i |θ̃0 ) ≥ f (θ̃i )Q(θ̃0 |θ̃i ),
and otherwise we accept it with probability
f (θ̃0 )Q(θ̃i |θ̃0 )
f (θ̃i )Q(θ̃0 |θ̃i ).
3 These are irreducibility, which is the non-zero probability of eventually transitioning from any one state to any other,
aperiodicity, which is having no perfectly cyclic state trajectories, and non transience, which is the non-zero probability of
returning to any given state.
21
Initial 2 iterations 3 iterations
0.05
0.04
0.03
0.02
0.01
0.00
P(θ)
0.04
0.03
0.02
0.01
0.00
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
θ
Figure 10: Convergence of a Metropolis sampler to the posterior distribution having started at an arbitrary
distribution.
Sampling in this way, as with the original Metropolis sampler, we converge upon an invariant probability
distribution over θ that is the posterior distribution Z1 f (θ).
To illustrate the Metropolis algorithm, we will use our binomial problem with the logit normal distribution.
The likelihood function at any value of θ is
L(θ|m, n) = θm (1 − θ)n−m .
22
proposal <- function(theta_0){
runif(1)
}
The following code implements a Metropolis sampler that draws 100 samples.
nsamples <- 100
i <- 1
theta <- numeric(length = nsamples)
theta[i] <- runif(1) # initial sample from U(0,1)
a 1.00 b
0.75
400
count
value
0.50
200
0.25
0.00 0
0 25 50 75 100 0.4 0.5 0.6 0.7
t value
Figure 11: a) The trajectory, or trace-plot, of 100 samples from two separate chains of a Metropolis sampler
for the binomial problem using a logit-normal prior. b) The histogram of ten thousand samples from one of
the samplers.
23
To appreciate hmc, we return to the fact that in any Bayesian analysis, our results are ultimately expectations
of functions according to the posterior distribution. Thus, if the posterior distribution over our unknown
variables is P(θ|D), the expected value of some function φ(θ) according to the posterior is
Z
hφ(θ)i = φ(θ)P(θ|D)dθ.
The value of this integral is largely determined by where P(θ|D) is massed. In high dimensional spaces, most
of this mass is concentrated in a thin shell far from the mode of the distribution that is known as the typical
set (see Betancourt 2017). All mcmc methods ideally sample from this typical set. However, because the
typical set is a thin shell, the random proposals made by the Metropolis sampler will be mostly to regions of
low probability density outside of it, which will be rejected, leading to inefficiency. Likewise, those proposals
that will be accepted will often be just those close to the current state of the sampler, and in this sense, the
sampler becomes stuck in one region and does not explore the typical set.
hmc tries to overcome these problems with the random walk Metropolis by using the gradient of the posterior
distribution to guide its proposals of new points to sample. For any given value of θ, the gradient of the
posterior distribution is evaluated. This indicates the directions with higher or lower posterior density. Were
we to simply follow the gradient, we would end up moving away from the typical set and toward the mode
of the distribution. While the mode will have high density, it will have infinitesimally low volume in high
dimensional problems, and so is not part of the typical set as defined above. In order to remain within the
typical set, therefore, each point in θ space is given a momentum value denoted by r, which gives a direction
and velocity to each point. This momentum speeds up as the object approaches areas of high density, and
slows and reverses in regions of low density.
In order to implement this system, hmc represents the value of θ as a point in a physical space that has a
potential energy and a kinetic energy. The potential energy is determined by its position, and can seen as the
effect of gravity. The kinetic energy is determined by the momentum of the point. Intuitively, we can imagine
this as smooth (frictionless) hilly landscape with a ball with a momentum moving over this landscape. If we
kick the ball, we give it some momentum and it will move in a certain direction with a certain speed. If that
direction is downwards, under the effect of gravity, it will speed up and then when it gets to the bottom of
the hill, its momentum will allow it to continue moving and it will start to roll up the side of another hill.
Eventually, the effect of gravity will start to slow it down, and it will then reverse and roll back down the hill,
and so on. The movement of the ball is physically determined by its position and its momentum and their
respective potential and kinetic energies.
In more detail, the potential energy at each point is given by
24
Given this energy function, the probability distribution corresponding to it is P (r) ∝ e−K(r) , which is a
d dimensional standard normal distribution. The total energy in this system is given by the Hamiltonian
function:
H(θ, r) = U (θ) + K(r).
The change in the position and the momentum is now determined by classical mechanics and is given the
Hamiltonian equations:
dθk ∂H ∂K(r)
= = ,
dt ∂rk ∂rk
drk ∂H ∂U (θ)
=− =− ,
dt ∂θk ∂θk
where dθ drk
dt and dt are the rate of change, at dimension k, of the position θ and momentum r, respectively.
k
∂H ∂H
The ∂rk and ∂θk are the partial derivatives of the Hamiltonian function with respect to rk and θk , respectively.
Now, if we start at any point in θ space and give this point a momentum, the Hamiltonian equations will
determine how the point will move through this space under the actions of both the potential and kinetic
energy.
In order to simulate this continuous dynamical system, which is governed by differential equations, on a
computer, we must use a discrete approximation to its dynamics. Widely used methods such as Runge-Kutta
methods or Euler’s method are possible, but a more suitable method is the leapfrog algorithm, which involves
taking steps, or half-steps, of size δ as follows:
δ ∂U (θ)
rk ← rk − , first half-step in r space,
2 ∂θk
∂K(r)
θk ← θk + δ , step in θ space,
∂rk
δ ∂U (θ)
rk ← rk − , second half-step in r space.
2 ∂θk
As a simple example of Hamiltonian dynamics, let us assume that the posterior distribution is a two
dimensional normal distribution with a mean µ and covariance matrix Σ. In this case, we have
∂U (θ) ∂K(r)
= (θ − µ)| Σ−1 , = rk ,
∂θk ∂rk
and so our leapfrog steps are
δ
rk ← rk − (θ − µ)| Σ−1 , θk ← θk + δrk .
2
Starting with an arbtrary point θ̃ = (0, 1.5), and choosing r at random from a 2d standard normal, we
simulate the Hamiltonian dynamics for a 45 leapfrog steps with a δ = 0.1. This shown in Figure 12. As we
can see, the trajectories move smoothly back and forth through the posterior distribution.
The key feature of hmc that differentiates it from random walk Metropolis is that it replaces the random
proposal distribution with the deterministic dynamics of the Hamiltonian system. Specifically, a point in θ̃
space is chosen at random, and a value of the momentum variable r is chosen from the probability distribution
corresponding to K(r), i.e. P (r) ∝ e−K(r) , the Hamiltonian system deterministically evolves for period of time
to arrive at a new point θ̃0 and new momentum r0 . The new point θ̃0 is then accepted if H(θ̃0 , r0 ) > H(θ̃, r).
Otherwise, θ̃0 is accepted with probability
25
2 2 2
1 1 1
θ1
θ1
θ1
0 0 0
−1 −1 −1
−2 −2 −2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
θ0 θ0 θ0
Figure 12: Three trajectories of a Hamitonian dynamical system where the posterior distribution is a 2d
normal distribution, where each trajectory starts at θ̃ = (0, 1.5), and where we choose the momentum r at
random from a 2d standard normal. The system is simulated for a 45 leapfrog steps with a δ = 0.1.
26
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.23 0.13 -0.02 0.48 1.00 1539 2023
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The histogram of the 4000 posterior samples of θ is shown in Figure 13.
400
count
200
Figure 13: Histogram of the samples from the posterior distribution over θ from a hmc sampler of a
logit-normal prior based binomial model.
Model evaluation
Let us return to the following diagram from the introduction to this chapter.
Model
evaluation inference
Fitted model
In our coverage of classical and then Bayesian methods of inference, we have dealt with the inference arc
to right. This is where we begin with a statistical model that has variables or parameters whose values are
unknown, and then use one general approach or another to effectively obtain estimates, as well as measures
of uncertainty of these estimates, of these unknown variables. The result is a model that it fitted to the data.
However, whether it is a good fit, and more importantly, whether the model is able to generalize to new
data, remains to be seen. Model evaluation is the general term for this stage of the modelling process. Model
evaluation is ultimately a large and multifaceted topic, involving various quantitative and graphical methods.
Here, we will focus on the most common methods, and in particular, on model comparison, which is where we
27
directly compare two or more models against one another.
28
M_1 <- lm(log(price) ~ 1, data = housing_df)
mu_0 <- rep(log(60000), nrow(housing_df))
M_0 <- lm(log(price) ~ 0 + offset(mu_0), data = housing_df)
The logarithms of the likelihoods of M_1 and M_0 at the maximum values are as follows.
logLik(M_1)
## 'log Lik.' -234.2995 (df=2)
logLik(M_0)
## 'log Lik.' -240.6163 (df=1)
The corresponding deviances are -2 times these values, which we may also obtain as follows.
(d_1 <- -2 * logLik(M_1))
## 'log Lik.' 468.599 (df=2)
(d_0 <- -2 * logLik(M_0))
## 'log Lik.' 481.2327 (df=1)
The difference of the deviances ∆D is as follows.
delta_d <- as.numeric(d_0 - d_1)
delta_d
## [1] 12.6336
Assuming M1 and M0 predict the observed data equally well, this difference is distributed as a χ2 distribution
with 1 degree of freedom. This value for the degrees of freedom is based on the fact that one model has two
parameters and the other has just one. We can calculate the p-value for ∆D by calculating the probability of
obtaining a result as or more extreme than ∆D in a χ2 distribution with 1 degree of freedom.
pchisq(delta_d, df = 1, lower.tail = F)
## [1] 0.0003788739
Simply put, this result tells us that a value of ∆D = 12.63 is not an expected result if models M1 and M1
predict the data equal well, as so we can conclude that the M1 , which has the lower deviance, predicts the
data significantly better than M0 .
29
For leave one out cross-validation, the procedure is as follows, with the procedure for any K-fold cross-
validation being similarly defined. Assuming our data is D = y1 , y2 . . . yn , we divide the data into n sets:
where yi is data point i and y¬i is all the remaining data except for data point i. Then, for each i, we fit
the model using y¬i and test how well the fitted model can predict yi . We then calculate the sum of the
predictive performance over all data points. In classical inference based models, for each i, we calculate
θ̂¬i , which is the maximum likelihood or other estimator of the parameter vector θ based on y¬i . We then
calculate log P(yi |θ̂¬i ), which is the logarithm of the predicted probability of yi based on the model with
parameters θ̂¬i . This then leads to
Xn
elpd = log P(yi |θ̂¬i )
i=1
as the overall measure of the model’s out-of-sample predictive performance, which we refer to as expected
log predictive density (elpd). In Bayesian approaches, an analogous procedure is followed. For each i, we
calculate the posterior distribution P(θ|y¬i ), and the model’s overall out-of-sample predictive performance is
n
X Z
elpd = log P(yi |θ)P(θ|y¬i )dθ.
i=1
30
data = slice(housing_df, -i)
)
mu <- coef(m1_not_i)
stdev <- sigma(m1_not_i)
y_i <- slice(housing_df, i) %>% pull(price)
dnorm(log(y_i), mean = mu, sd = stdev, log = T)
}
We can then apply this to all data points and sum the result.
n <- nrow(housing_df)
map_dbl(seq(n), logprediction_m1) %>%
sum()
## [1] -236.2388
Now, let us compare this log-normal model’s performance to a normal model, i.e., where we assume that
for i ∈ 1 . . . n, yi ∼ N(µ, σ 2 ). The log of the prediction of the held out data point is calculated in almost an
identical manner, except that we use the values of the price variable directly and not their logarithm.
logprediction_m0 <- function(i){
m0_not_i <- lm(price ~ 1,
data = slice(housing_df, -i)
)
mu <- coef(m0_not_i)
stdev <- sigma(m0_not_i)
y_i <- slice(housing_df, i) %>% pull(price)
dnorm(y_i, mean = mu, sd = stdev, log = T)
31
loo(m0_bayes)$estimates
## Estimate SE
## elpd_loo -6341.868732 23.4359160
## p_loo 2.981645 0.6150526
## looic 12683.737464 46.8718319
Leaving aside p_loo (a measure of effective number of parameters) and the standard errors (a measure of
uncertainty of the estimates, as defined above), we see here that the elpd_loo and the looic estimates from
the Bayesian models are very similar to those calculated using the classical inference based models above.
AIC
Cross-validation is an excellent method of model evaluation because it addresses the central issue of out-of-
sample generalization, rather than fit to the data, and can be applied to any models, regardless of whether
these models are based on classical or Bayesian methods of inference. On the other hand, traditionally, cross
validation has been seen as too computationally demanding to be used in all data analysis situations. This is
becoming less of a concern now, both because of the computational power and the development of efficient
implementation such as the Pareto smoothed importance sampler method mentioned above. Nonetheless, one
still widely used model evaluation model, the Akaike Information Criterion (aic), originally proposed by
Akaike (1973), can be justified as a very easily computed approximation to leave one out cross validation (see
Stone 1977; Fang 2011). aic is defined as follows.
where k is the number of parameters in the model. Obviously, for models where the log of the likelihood is
available, and where the number of parameters of the model is straightforward to count (which is not always
the case in complex models), the aic is simple to calculate. For example, for the log normal and normal
models that we evaluated above using cross-validation, both of which have k = 2 parameters, the aic is
calculated as follows.
m1 <- lm(log(price) ~ 1, data = housing_df)
m0 <- lm(price ~ 1, data = housing_df)
k <- 2
aic_1 <- as.numeric(2 * k - 2 * logLik(m1))
aic_0 <- as.numeric(2 * k - 2 * logLik(m0))
c(aic_1, aic_0)
## [1] 472.599 12682.711
Clearly, these are very similar to the cross-validation elpd for these two models.
Like elpd and other measures, a model’s aic value is of little value in itself, and so we only interpret
differences in aic between models. Conventional standards (see, for example, Burnham and Anderson 2003,
Chapter 2) hold that aic differences of greater than between 4 or 7 indicate clear superiority of the predictive
power of the model with the lower aic, while differences of 10 or more indicate that the model with the
higher value has essentially no predictive power relative to the model with the lower value. We can appreciate
why these thresholds are followed by considering the concept of Akaike weights (see Burnham and Anderson
2003, 75). Akaike weights provide probabilities for each of a set of K models that are being compared with
one another. They are defined as follows:
1
e− 2 ∆AICk
ρk = P 1 ,
K − 2 ∆AICk
k=1 e
where ∆AICk is the difference between the aic of model k and the lowest aic value in the set of K models.
These probabilities can be interpreted as the probabilities that any model has better predictive performance
32
than the others. If we have just two models, with model k = 1 being the one with the lower aic, then
∆AIC1 = 0 and ∆AIC2 will be some value positive quantity δ. Using Akaike weights, the probability that
the model with the lower aic has the better predictive performance is
1
ρ= .
1 + e−δ/2
For δ values of 4, 6, 9, the corresponding probabilities are 0.88, 0.95, 0.99. These values provide a justification
for the thresholds proposed by Burnham and Anderson (2003).
WAIC
aic is appealing because of its simplicity. As we have seen, despite its simplicity, it can be a highly accurate
approximation to cross-validation elpd. However, this approximation will not hold in all models, especially
large and complex ones. Addition, in some situations calculating the maximum of the likelihood function
is not straightforward, and nor is defining the number of free parameters in the model. A more widely
applicable version of the aic is the Watanabe Akaike Information Criterion (waic). waic was introduced
in (Watanabe 2010) by the name Widely Applicable Information Criterion rather than Watanabe Akaike
Information Criterion. It has been shown that waic is more generally or widely applicable than aic, and is a
close approximation to Bayesian leave-one-outcross-validation, yet can be calculated easily from a model’s
posterior samples. waic is calculated as follows:
n S
! n
!
X 1X s
X
S s
waic = −2 log P(yi |θ ) − Vs=1 (log P (yi |θ )) ,
i=1
S s=1 i=1
where yi , θs etc are as they are defined above for the case of Bayesian elpd. The term Vs=1
S
(·) signifies the
variance of its arguments.
Using Stan based models, waic is easily calculated using the loo package.
waic(m1_bayes)$estimates
## Estimate SE
## elpd_waic -236.232503 15.9133627
## p_waic 1.886645 0.1614885
## waic 472.465006 31.8267254
waic(m0_bayes)$estimates
## Estimate SE
## elpd_waic -6341.868701 23.4362820
## p_waic 2.981614 0.6155333
## waic 12683.737402 46.8725640
As we can see, this is very similar to the Bayesian cross-validation elpd, but can be calculated directly from
posterior samples.
References
Akaike, Hirotogu. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.”
Edited by F Petrov B. N AND Caski.
Betancourt, Michael. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.”
Burnham, Kenneth P, and David R Anderson. 2003. Model Selection and Multimodel Inference: A Practical
Information-Theoretic Approach. Springer Science & Business Media.
Bürkner, Paul-Christian. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R
Journal 10 (1): 395–411.
33
Clopper, Charles J, and Egon S Pearson. 1934. “The Use of Confidence or Fiducial Limits Illustrated in the
Case of the Binomial.” Biometrika 26 (4): 404–13.
Diaconis, Persi, Susan Holmes, and Richard Montgomery. 2007. “Dynamical Bias in the Coin Toss.” SIAM
Review 49 (2): 211–35.
Fang, Yixin. 2011. “Asymptotic Equivalence Between Cross-Validations and Akaike Information Criteria in
Mixed-Effects Models.” Journal of Data Science 9 (1): 15–21.
Gelman, Andrew, and Deborah Nolan. 2002. “You Can Load a Die, but You Can’t Bias a Coin.” The
American Statistician 56 (4): 308–11.
Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. “The Prior Can Often Only Be Understood
in the Context of the Likelihood.” Entropy 19 (10): 555.
Hoffman, Matthew D, and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path
Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–1623.
Kerrich, J. E. 1946. An Experimental Introduction to the Theory of Probability. E. Munksgaard.
Metropolis, Nicholas, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller.
1953. “Equation of State Calculations by Fast Computing Machines.” The Journal of Chemical Physics 21
(6): 1087–92.
Metropolis, Nicholas, and S. Ulam. 1949. “The Monte Carlo Method.” Journal of the American Statistical
Association 44 (247): 335–41.
Neal, Radford. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo,
edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. CRC press.
Stone, M. 1977. “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.”
Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 44–47.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using
Leave-One-Out Cross-Validation and Waic.” Statistics and Computing 27 (5): 1413–32.
Watanabe, Sumio. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable
Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11 (Dec):
3571–94.
34