0% found this document useful (0 votes)

58 views30 pages

NUTS Gelman

The document describes the No-U-Turn Sampler (NUTS), an extension of Hamiltonian Monte Carlo (HMC) that eliminates the need to specify the number of steps (L) parameter. NUTS uses a recursive algorithm to build a set of candidate points that spans the target distribution, stopping when it starts to retrace its steps. Empirically, NUTS performs as efficiently as or more efficiently than well-tuned HMC without requiring tuning of L. The document also presents a method for automatically adapting the step size parameter in both HMC and NUTS using dual averaging.

Uploaded by

Indrajit Chatterjee

We take content rights seriously. If you suspect this is your content, claim it here.

0% found this document useful (0 votes)

58 views30 pages

NUTS Gelman

Uploaded by

Indrajit Chatterjee

We take content rights seriously. If you suspect this is your content, claim it here.

The No-U-Turn Sampler

The No-U-Turn Sampler: Adaptively Setting Path Lengths

in Hamiltonian Monte Carlo

Matthew D. Hoffman [email protected]

Department of Statistics
Columbia University
New York, NY 10027, USA
Andrew Gelman
arXiv:1111.4246v1 [stat.CO] 18 Nov 2011

[email protected]
Departments of Statistics and Political Science
Columbia University
New York, NY 10027, USA

Abstract
Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that
avoids the random walk behavior and sensitivity to correlated parameters that plague many
MCMC methods by taking a series of steps informed by first-order gradient information.
These features allow it to converge to high-dimensional target distributions much more
quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However,
HMC’s performance is highly sensitive to two user-specified parameters: a step size
and a desired number of steps L. In particular, if L is too small then the algorithm
exhibits undesirable random walk behavior, while if L is too large the algorithm wastes
computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that
eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build
a set of likely candidate points that spans a wide swath of the target distribution, stopping
automatically when it starts to double back and retrace its steps. Empirically, NUTS
perform at least as efficiently as and sometimes more efficiently than a well tuned standard
HMC method, without requiring user intervention or costly tuning runs. We also derive a
method for adapting the step size parameter on the fly based on primal-dual averaging.
NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications
such as BUGS-style automatic inference engines that require efficient “turnkey” sampling
algorithms.
Keywords: Markov chain Monte Carlo, Hamiltonian Monte Carlo, Bayesian inference,
adaptive Monte Carlo, dual averaging.

1. Introduction

Hierarchical Bayesian models are a mainstay of the machine learning and statistics com-
munities. Exact posterior inference in such models is rarely tractable, however, and so
researchers and practitioners must usually resort to approximate statistical inference meth-
ods. Deterministic approximate inference algorithms (for example, those reviewed by Wain-
wright and Jordan (2008)) can be efficient, but introduce bias and can be difficult to apply
to some models. Rather than computing a deterministic approximation to a target poste-
rior (or other) distribution, Markov chain Monte Carlo (MCMC) methods offer schemes for
drawing a series of correlated samples that will converge in distribution to the target distri-

1
Hoffman and Gelman

bution (Neal, 1993). MCMC methods are sometimes less efficient than their deterministic
counterparts, but are more generally applicable and are asymptotically unbiased.
Not all MCMC algorithms are created equal. For complicated models with many param-
eters, simple methods such as random-walk Metropolis (Metropolis et al., 1953) and Gibbs
sampling (Geman and Geman, 1984) may require an unacceptably long time to converge
to the target distribution. This is in large part due to the tendency of these methods to
explore parameter space via inefficient random walks (Neal, 1993). When model parameters
are continuous rather than discrete, Hamiltonian Monte Carlo (HMC), also known as hybrid
Monte Carlo, is able to suppress such random walk behavior by means of a clever auxiliary
variable scheme that transforms the problem of sampling from a target distribution into the
problem of simulating Hamiltonian dynamics (Neal, 2011). The cost of HMC per indepen-
dent sample from a target distribution of dimension D is roughly O(D5/4 ), which stands in
sharp contrast with the O(D2 ) cost of random-walk Metropolis (Creutz, 1988).
HMC’s increased efficiency comes at a price. First, HMC requires the gradient of the
log-posterior. Computing the gradient for a complex model is at best tedious and at worst
impossible, but this requirement can be made less onerous by using automatic differentiation
(Griewank and Walther, 2008). Second, HMC requires that the user specify at least two
parameters: a step size and a number of steps L for which to run a simulated Hamiltonian
system. A poor choice of either of these parameters will result in a dramatic drop in HMC’s
efficiency. Methods from the adaptive MCMC literature (see Andrieu and Thoms (2008) for
a review) can be used to tune on the fly, but setting L typically requires one or more costly
tuning runs, as well as the expertise to interpret the results of those tuning runs. This hurdle
limits the more widespread use of HMC, and makes it challenging to incorporate HMC into
a general-purpose inference engine such as BUGS (Gilks and Spiegelhalter, 1992), JAGS
(http://mcmc-jags.sourceforge.net), Infer.NET (Minka et al.), HBC (Daume III, 2007), or
PyMC (Patil et al., 2010).
The main contribution of this paper is the No-U-Turn Sampler (NUTS), an MCMC
algorithm that closely resembles HMC, but eliminates the need to choose the problematic
number-of-steps parameter L. We also provide a new dual averaging (Nesterov, 2009)
scheme for automatically tuning the step size parameter in both HMC and NUTS, making
it possible to run NUTS with no hand-tuning at all. We will show that the tuning-free
version of NUTS samples as efficiently as (and sometimes more efficiently than) HMC, even
ignoring the cost of finding optimal tuning parameters for HMC. Thus, NUTS brings the
efficiency of HMC to users (and generic inference systems) that are unable or disinclined to
spend time tweaking an MCMC algorithm.

2. Hamiltonian Monte Carlo

In Hamiltonian Monte Carlo (HMC) (Neal, 2011, 1993; Duane et al., 1987), we introduce an
auxiliary momentum variable rd for each model variable θd . In the usual implementation,
these momentum variables are drawn independently from the standard normal distribution,
yielding the (unnormalized) joint density

p(θ, r) ∝ exp{L(θ) − 21 r · r}, (1)

2
The No-U-Turn Sampler

Algorithm 1 Hamiltonian Monte Carlo

Given θ0 , , L, L, M :
for m = 1 to M do
Sample r0 ∼ N (0, I).
Set θm ← θm−1 , θ̃ ← θm−1 , r̃ ← r0 .
for i = 1 to L do
Set θ̃, r̃ ← Leapfrog(θ̃, r̃, ).
end for
exp{L(θ̃)− 21 r̃·r̃}
With probability α = min 1, exp{L(θm−1 )− 1 r0 ·r0 } , set θm ← θ̃, rm ← −r̃.
2
end for

function Leapfrog(θ, r, )
Set r̃ ← r + (/2)∇θ L(θ).
Set θ̃ ← θ + r̃.
Set r̃ ← r̃ + (/2)∇θ L(θ̃).
return θ̃, r̃.

where L is the logarithm of the joint density of the variables of interest θ (up to a normalizing
constant) and x · y denotes the inner product of the vectors x and y. We can interpret this
augmented model in physical terms as a fictitious Hamiltonian system where θ denotes a
particle’s position in D-dimensional space, rd denotes the momentum of that particle in
the dth dimension, L is a position-dependent negative potential energy function, 21 r · r is
the kinetic energy of the particle, and log p(θ, r) is the negative energy of the particle. We
can simulate the evolution over time of the Hamiltonian dynamics of this system via the
“leapfrog” integrator, which proceeds according to the updates
rt+/2 = rt + (/2)∇θ L(θt ); θt+ = θt + rt+/2 ; rt+ = rt+/2 + (/2)∇θ L(θt+ ), (2)
where rt and θt denote the values of the momentum and position variables r and θ at time
t and ∇θ denotes the gradient with respect to θ. Since the update for each coordinate
depends only on the other coordinates, the leapfrog updates are volume-preserving—that
is, the volume of a region remains unchanged after mapping each point in that region to a
new point via the leapfrog integrator.
A standard procedure for drawing M samples via Hamiltonian Monte Carlo is described
in Algorithm 1. I denotes the identity matrix and N (µ, Σ) denotes a multivariate normal
distribution with mean µ and covariance matrix Σ. For each sample m, we first resample
the momentum variables from a standard multivariate normal, which can be inetpreted as
a Gibbs sampling update. We then apply L leapfrog updates to the position and momen-
tum variables θ and r, generating a proposal position-momentum pair θ̃, r̃. We propose
setting θm = θ̃ and rm = −r̃, and accept or reject this proposal according to the Metropo-
lis algorithm (Metropolis et al., 1953). This is a valid Metropolis proposal because it is
time-reversible and the leapfrog integrator is volume-preserving; using an algorithm for
simulating Hamiltonian dynamics that did not preserve volume would seriously complicate
the computation of the Metropolis acceptance probability. The negation of r̃ in the pro-
posal is theoretically necessary to produce time-reversibility, but can be omitted in practice

3
Hoffman and Gelman

if one is only interested in sampling from p(θ). The algorithm’s original name, “Hybrid
Monte Carlo,” refers to the hybrid approach of alternating between updating θ and r via
Hamiltonian simulation and updating r via Gibbs sampling.
p(θ̃,r̃)
The term log p(θ,r) , on which the acceptance probability α depends, is the negative
change in energy of the simulated Hamiltonian system from time 0 to time L. If we could
simulate the Hamiltonian dynamics exactly, then α would always be 1, since energy is con-
served in Hamiltonian systems. The error introduced by using a discrete-time simulation
p(θ̃,r̃)
depends on the step size parameter —specifically, the change in energy | log p(θ,r) | is pro-
2 3
portional to for large L, or if L = 1 (Leimkuhler and Reich, 2004). In theory the
error can grow without bound as a function of L, but in practice it typically does not when
using the leapfrog discretization. This allows us to run HMC with many leapfrog steps,
generating proposals for θ that have high probability of acceptance even though they are
distant from the previous sample.
The performance of HMC depends strongly on choosing suitable values for and L. If
is too large, then the simulation will be inaccurate and yield low acceptance rates. If
is too small, then computation will be wasted taking many small steps. If L is too small,
then successive samples will be close to one another, resulting in undesirable random walk
behavior and slow mixing. If L is too large, then HMC will generate trajectories that loop
back and retrace their steps. This is doubly wasteful, since work is being done to bring the
proposal θ̃ closer to the initial position θm−1 . Worse, if L is chosen so that the parameters
jump from one side of the space to the other each iteration, then the Markov chain may
not even be ergodic (Neal, 2011). More realistically, an unfortunate choice of L may result
in a chain that is ergodic but slow to move between regions of low and high density.

3. Eliminating the Need to Hand-Tune HMC

HMC is a powerful algorithm, but its usefulness is limited by the need to tune the step size
parameter and number of steps L. Tuning these parameters for any particular problem re-
quires some expertise, and usually one or more preliminary runs. Selecting L is particularly
problematic; it is difficult to find a simple metric for when a trajectory is too short, too long,
or “just right,” and so practitioners commonly rely on heuristics based on autocorrelation
statistics from preliminary runs (Neal, 2011).
Below, we present the No-U-Turn Sampler (NUTS), an extension of HMC that eliminates
the need to specify a fixed value of L. In section 3.2 we present schemes for setting based
on the dual averaging algorithm of Nesterov (2009).

3.1 No-U-Turn Hamiltonian Monte Carlo

Our first goal is to devise an MCMC sampler that retains HMC’s ability to suppress random
walk behavior without the need to set the number L of leapfrog steps that the algorithm
takes to generate a proposal. We need some criterion to tell us when we have simulated
the dynamics for “long enough,” i.e., when running the simulation for more steps would
no longer increase the distance between the proposal θ̃ and the initial value of θ. We use
a convenient criterion based on the dot product between r̃ (the current momentum) and
θ̃ − θ (the vector from our initial position to our current position), which is the derivative

4
The No-U-Turn Sampler

Figure 1: Example of building a binary tree via repeated doubling. Each doubling proceeds
by choosing a direction (forwards or backwards in time) uniformly at random,
then simulating Hamiltonian dynamics for 2j leapfrog steps in that direction,
where j is the number of previous doublings (and the height of the binary tree).
The figures at top show a trajectory in two dimensions (with corresponding binary
tree in dashed lines) as it evolves over four doublings, and the figures below show
the evolution of the binary tree. In this example, the directions chosen were
forward (light orange node), backward (yellow nodes), backward (blue nodes),
and forward (green nodes).

with respect to time (in the Hamiltonian system) of half the squared distance between the
initial position θ and the current position θ̃:

d (θ̃ − θ) · (θ̃ − θ) d
= (θ̃ − θ) · (θ̃ − θ) = (θ̃ − θ) · r̃. (3)
dt 2 dt

In other words, if we were to run the simulation for an infinitesimal amount of additional
time, then this quantity is proportional to the progress we would make away from our
starting point θ.
This suggests an algorithm in which one runs leapfrog steps until the quantity in equation
3 becomes less than 0; such an approach would simulate the system’s dynamics until the
proposal location θ̃ started to move back towards θ. Unfortunately this algorithm does
not guarantee time reversibility, and is therefore not guaranteed to converge to the correct
distribution. NUTS overcomes this issue by means of a recursive algorithm reminiscent of
the doubling procedure devised by Neal (2003) for slice sampling.
NUTS begins by introducing a slice variable u with conditional distribution p(u|θ, r) =
Uniform(u; [0, exp{L(θ) − 12 r · r}]), which renders the conditional distribution p(θ, r|u) =
Uniform(θ, r; {θ0 , r0 | exp{L(θ) − 12 r · r} ≥ u}). This slice sampling step is not strictly neces-
sary, but it simplifies both the derivation and the implementation of NUTS. In addition to

5
Hoffman and Gelman

0.4

0.3

0.2

0.1

−0.1
−0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse
is a contour of the target distribution, the black open circles are the positions θ
traced out by the leapfrog integrator and associated with elements of the set of
visited states B, the black solid circle is the starting position, the red solid circles
are positions associated with states that must be excluded from the set C of
possible next samples because their joint probability is below the slice variable u,
and the positions with a red “x” through them correspond to states that must be
excluded from C to satisfy detailed balance. The blue arrow is the vector from the
positions associated with the leftmost to the rightmost leaf nodes in the rightmost
height-3 subtree, and the magenta arrow is the (normalized) momentum vector
at the final state in the trajectory. The doubling process stops here, since the
blue and magenta arrows make an angle of more than 90 degrees. The crossed-
out nodes with a red “x” are in the right half-tree, and must be ignored when
choosing the next sample.

being more complicated, the analogous algorithm that eliminates the slice variable seems
empirically to be slightly less efficient than the algorithm presented in this paper.

6
The No-U-Turn Sampler

At a high level, after resampling u|θ, r, NUTS uses the leapfrog integrator to trace out a
path forwards and backwards in fictitious time, first running forwards or backwards 1 step,
then forwards or backwards 2 steps, then forwards or backwards 4 steps, etc. This doubling
process implicitly builds a balanced binary tree whose leaf nodes correspond to position-
momentum states, as illustrated in Figure 1. The doubling is halted when the subtrajectory
from the leftmost to the rightmost nodes of any balanced subtree of the overall binary tree
starts to double back on itself (i.e., the fictional particle starts to make a “U-turn”). At
this point NUTS stops the simulation and samples from among the set of points computed
during the simulation, taking care to preserve detailed balance. Figure 2 illustrates an
example of a trajectory computed during an iteration of NUTS.
Pseudocode implementing a efficient version of NUTS is provided in Algorithm 3. A
detailed derivation follows below, along with a simplified version of the algorithm that
motivates and builds intuition about Algorithm 3 (but uses much more memory and makes
smaller jumps).

3.1.1 Derivation of simplified NUTS algorithm

NUTS further augments the model p(θ, r) ∝ exp{L(θ) − 12 r · r} with a slice variable u (Neal,
2003). The joint probability of θ, r, and u is

p(θ, r, u) ∝ I[u ∈ [0, exp{L(θ) − 21 r · r}]], (4)

where I[·] is 1 if the expression in brackets is true and 0 if it is false. The (unnormalized)
marginal probability of θ and r (integrating over u) is

p(θ, r) ∝ exp{L(θ) − 21 r · r}, (5)

as in standard HMC. The conditional probabilities p(u|θ, r) and p(θ, r|u) are each uniform,
so long as the condition u ≤ exp{L(θ) − 21 r · r} is satisfied.
We also add a finite set C of candidate position-momentum states and another finite set
B ⊇ C to the model. B will be the set of all position-momentum states that the leapfrog
integrator traces out during a given NUTS iteration, and C will be the subset of those
states to which we can transition without violating detailed balance. B will be built up by
randomly taking forward and backward leapfrog steps, and C will selected deterministically
from B. The random procedure for building B and C given θ, r, u, and will define a
conditional distribution p(B, C|θ, r, u, ), upon which we place the following conditions:
C.1: All elements of C must be chosen in a way that preserves volume. That is, any
deterministic transformations of θ, r used to add a state θ0 , r0 to C must have a Jacobian
with unit determinant.

C.2: p((θ, r) ∈ C|θ, r, u, ) = 1.

C.3: p(u ≤ exp{L(θ0 ) − 12 r0 · r0 }|(θ0 , r0 ) ∈ C) = 1.

C.4: If (θ, r) ∈ C and (θ0 , r0 ) ∈ C then for any B, p(B, C|θ, r, u, ) = p(B, C|θ0 , r0 , u, ).
C.1 ensures that p(θ, r|(θ, r) ∈ C) ∝ p(θ, r), i.e. if we restrict our attention to the elements
of C then we can treat the unnormalized probability density of a particular element of C as

7
Hoffman and Gelman

an unnormalized probability mass. C.2 says that the current state θ, r must be included in
C. C.3 requires that any state in C be in the slice defined by u, i.e., that any state (θ0 , r0 ) ∈ C
must have equal (and positive) conditional probability density p(θ0 , r0 |u). C.4 states that B
and C must have equal probability of being selected regardless of the current state θ, r as
long as (θ, r) ∈ C (which it must be by C.2).
Deferring for the moment the question of how to construct and sample from a distribu-
tion p(B, C|θ, r, u, ) that satisfies these conditions, we will now show that the the following
procedure leaves the joint distribution p(θ, r, u, B, C|) invariant:

1. sample r ∼ N (0, I),

2. sample u ∼ Uniform([0, exp{L(θt ) − 21 r · r}]),

3. sample B, C from their conditional distribution p(B, C|θt , r, u, ),

4. sample θt+1 , r ∼ T (θt , r, C),

where T (θ0 , r0 |θ, r, C) is a transition kernel that leaves the uniform distribution over C in-
variant, i.e., T must satisfy

1 X I[(θ0 , r0 ) ∈ C]
T (θ0 , r0 |θ, r, C) = (6)
|C| |C|
(θ,r)∈C

for any θ0 , r0 . The notation θt+1 , r ∼ T (θt , r, C) denotes that we are resampling r in a way
that depends on its current value.
Steps 1, 2, and 3 resample r, u, B, and C from their conditional joint distribution given
t
θ , and therefore together constitute a valid Gibbs sampling update. Step 4 is valid because
the joint distribution of θ and r given u, B, C, and is uniform on the elements of C:

p(θ, r|u, B, C, ) ∝ p(B, C|θ, r, u, )p(θ, r|u)

∝ p(B, C|θ, r, u, )I[u ≤ exp{L(θ) − 21 r · r}] (7)
∝ I[(θ, r) ∈ C].

Condition C.1 allows us to treat the unnormalized conditional density p(θ, r|u) ∝ I[u ≤
exp{L(θ) − 12 r · r}] as an unnormalized conditional probability mass function. Conditions
C.2 and C.4 ensure that p(B, C|θ, r, u, ) ∝ I[(θ, r) ∈ C] because by C.2 (θ, r) must be in C,
and by C.4 for any B, C pair p(B, C|θ, r, u, ) is constant as a function of θ and r as long as
(θ, r) ∈ C. Condition C.3 ensures that (θ, r) ∈ C ⇒ u ≤ exp{L(θ) − 21 r · r} (so the p(θ, r|u, )
term is redundant). Thus, equation 7 implies that the joint distribution of θ and r given u
and C is uniform on the elements of C, and we are free to choose a new θt+1 , rt+1 from any
transition kernel that leaves this uniform distribution on C invariant.
We now turn our attention to the specific form for p(B, C|θ, r, u, ) used by NUTS.
Conceptually, the generative process for building B proceeds by repeatedly doubling the
size of a binary tree whose leaves correspond to position-momentum states. These states
will constitute the elements of B. The initial tree has a single node corresponding to the
initial state. Doubling proceeds by choosing a random direction vj ∼ Uniform({−1, 1}) and
taking 2j leapfrog steps of size vj (i.e., forwards in fictional time if vj = 1 and backwards in

8
The No-U-Turn Sampler

fictional time if vj = −1), where j is the current height of the tree. (The initial single-node
tree is defined to have height 0.) For example, if vj = 1, the left half of the new tree is the
old tree and the right half of the new tree is a balanced binary tree of height j whose leaf
nodes correspond to the 2j position-momentum states visited by the new leapfrog trajectory.
This doubling process is illustrated in Figure 1. Given the initial state θ, r and the step size
, there are 2j possible trees of height j that can be built according to this procedure, each
of which is equally likely. Conversely, the probability of reconstructing a particular tree of
height j starting from any leaf node of that tree is 2−j regardless of which leaf node we
start from.
We cannot keep expanding the tree forever, of course. We want to continue expanding B
until one end of the trajectory we are simulating makes a “U-turn” and begins to loop back
towards another position on the trajectory. At that point continuing the simulation is likely
to be wasteful, since the trajectory will retrace its steps and visit locations in parameter
space close to those we have already visited. We also want to stop expanding B if the
error in the simulation becomes extremely large, indicating that any states discovered by
continuing the simulation longer are likely to have astronomically low probability. (This
may happen if we use a step size that is too large, or if the target distribution includes
hard constraints that make the log-density L go to −∞ in some regions.)
The second rule is easy to formalize—we simply stop doubling if the tree includes a leaf
node whose state θ, r satisfies
1
L(θ) − r · r − log u < −∆max (8)
2
for some nonnegative ∆max . We recommend setting ∆max to a large value like 1000 so
that it does not interfere with the algorithm so long as the simulation is even moderately
accurate.
We must be careful when defining the first rule so that we can build a sampler that
neither violates detailed balance nor introduces excessive computational overhead. To de-
termine whether to stop doubling the tree at height j, NUTS considers the 2j − 1 balanced
binary subtrees of the height-j tree that have height greater than 0. NUTS stops the dou-
bling process when for one of these subtrees the states θ− , r− and θ+ , r+ associated with
the leftmost and rightmost leaves of that subtree satisfies

(θ+ − θ− ) · r− < 0 or (θ+ − θ− ) · r+ < 0. (9)

That is, we stop if continuing the simulation an infinitesimal amount either forward or back-
ward in time would reduce the distance between the position vectors θ− and θ+ . Evaluating
the condition in equation 9 for each balanced subtree of a tree of height j requires 2j+1 − 2
inner products, which is comparable to the number of inner products required by the 2j − 1
leapfrog steps needed to compute the trajectory. Except for very simple models with very
little data, the cost of these inner products is usually negligible compared to the cost of
computing gradients.
This doubling process defines a distribution p(B|θ, r, u, ). We now define a deterministic
process for deciding which elements of B go in the candidate set C, taking care to satisfy
conditions C.1–C.4 on p(B, C|θ, r, u, ) laid out above. C.1 is automatically satisfied, since
leapfrog steps are volume preserving and any element of C must be within some number

9
Hoffman and Gelman

of leapfrog steps of every other element of C. C.2 is satisfied as long as we include the
initial state θ, r in C, and C.3 is satisfied if we exclude any element θ0 , r0 of B for which
exp{L(θ0 ) − 21 r0 · r0 } < u. To satisfy condition C.4, we must ensure that p(B, C|θ, r, u, ) =
p(B, C|θ0 , r0 , u, ) for any (θ0 , r0 ) ∈ C. For any start state (θ0 , r0 ) ∈ B, there is at most one
series of directions {v0 , . . . , vj } for which the doubling process will reproduce B, so as long
as we choose C deterministically given B either p(B, C|θ0 , r0 , u, ) = 2−j = p(B, C|θ, r, u, )
or p(B, C|θ0 , r0 , u, ) = 0. Thus, condition C.4 will be satisfied as long as we exclude from
C any state θ0 , r0 that could not have generated B. The only way such a state can arise is
if starting from θ0 , r0 results in the stopping conditions in equations 8 or 9 being satisfied
before the entire tree has been built, causing the doubling process to stop too early. There
are two cases to consider:

1. The doubling procedure was stopped because either equation 8 or equation 9 was
satisfied by a state or subtree added during the final doubling iteration. In this case
we must exclude from C any element of B that was added during this final doubling
iteration, since starting the doubling process from one of these would lead to a stopping
condition being satisfied before the full tree corresponding to B has been built.

2. The doubling procedure was stopped because equation 9 was satisfied for the leftmost
and rightmost leaves of the full tree corresponding to B. In this case no stopping
condition was met by any state or subtree until B had been completed, and condition
C.4 is automatically satisfied.

Algorithm 2 shows how to construct C incrementally while building B. After resam-

pling the initial momentum and slice variables, it uses a recursive procedure resembling a
depth-first search that eliminates the need to explicitly store the tree used by the doubling
procedure. The BuildTree() function takes as input an initial position θ and momentum r,
a slice variable u, a direction v ∈ {−1, 1}, a depth j, and a step size . It takes 2j leapfrog
steps of size v (i.e. forwards in time if v = 1 and backwards in time if v = −1), and returns

1. the backwardmost and forwardmost position-momentum states θ− , r− and θ+ , r+

among the 2j new states visited;

2. a set C 0 of position-momentum states containing each newly visited state θ0 , r0 for

which exp{L(θ0 ) − 21 r0 · r0 } > u; and

3. an indicator variable s; s = 0 indicates that a stopping criterion was met by some state
or subtree of the subtree corresponding to the 2j new states visited by BuildTree().

At the top level, NUTS repeatedly calls BuildTree() to double the number of points that
have been considered until either BuildTree() returns s = 0 (in which case doubling stops
and the new set C 0 that was just returned must be ignored) or equation 9 is satisfied for
the new backwardmost and forwardmost position-momentum states θ− , r− and θ+ , r+ yet
considered (in which case doubling stops but we can use the new set C 0 ). Finally, we select
the next position and momentum θm , r uniformly at random from C, the union of all of the
valid sets C 0 that have been returned, which clearly leaves the uniform distribution over C
invariant.

10
The No-U-Turn Sampler

Algorithm 2 Naive No-U-Turn Sampler

Given θ0 , , L, M :
for m = 1 to M do
Resample r0 ∼ N (0, I).
Resample u ∼ Uniform([0, exp{L(θm−1 − 12 r0 · r0 }])
Initialize θ− = θm−1 , θ+ = θm−1 , r− = r0 , r+ = r0 , j = 0, C = {(θm−1 , r0 )}, s = 1.
while s = 1 do
Choose a direction vj ∼ Uniform({−1, 1}).
if vj = −1 then
θ− , r− , −, −, C 0 , s0 ← BuildTree(θ− , r− , u, vj , j, ).
else
−, −, θ+ , r+ , C 0 , s0 ← BuildTree(θ+ , r+ , u, vj , j, ).
end if
if s0 = 1 then
C ← C ∪ C0.
end if
s ← s0 I[(θ+ − θ− ) · r− ≥ 0]I[(θ+ − θ− ) · r+ ≥ 0].
j ← j + 1.
end while
Sample θm , r uniformly at random from C.
end for

function BuildTree(θ, r, u, v, j, )
if j = 0 then
Base case—take one leapfrog step in the direction v.
θ0 , r0 ←
Leapfrog(θ, r, v).
0 0
{(θ , r )} if u ≤ exp{L(θ0 ) − 12 r0 · r0 }
C0 ←
∅ else
s0 ← I[u < exp{∆max + L(θ0 ) − 12 r0 · r0 }].
return θ0 , r0 , θ0 , r0 , C 0 , s0 .
else
Recursion—build the left and right subtrees.
θ− , r− , θ+ , r+ , C 0 , s0 ← BuildTree(θ, r, u, v, j − 1, ).
if v = −1 then
θ− , r− , −, −, C 00 , s00 ← BuildTree(θ− , r− , u, v, j − 1, ).
else
−, −, θ+ , r+ , C 00 , s00 ← BuildTree(θ+ , r+ , u, v, j − 1, ).
end if
s0 ← s0 s00 I[(θ+ − θ− ) · r− ≥ 0]I[(θ+ − θ− ) · r+ ≥ 0].
C 0 ← C 0 ∪ C 00 .
return θ− , r− , θ+ , r+ , C 0 , s0 .
end if

To summarize, Algorithm 2 defines a transition kernel that leaves p(θ, r, u, B, C|) invari-
ant, and therefore leaves the target distribution p(θ) ∝ exp{L(θ)} invariant. It does so by
resampling the momentum and slice variables r and u, simulating a Hamiltonian trajectory
forwards and backwards in time until that trajectory either begins retracing its steps or
encounters a state with very low probability, carefully selecting a subset C of the states
encountered on that trajectory that lie within the slice defined by the slice variable u, and

11
Hoffman and Gelman

finally choosing the next position and momentum variables θm and r uniformly at random
from C. Figure 2 shows an example of a trajectory generated by an iteration of NUTS where
equation 9 is satisfied by the height-3 subtree at the end of the trajectory. Below, we will
introduce some improvements to algorithm 2 that boost the algorithm’s memory efficiency
and allow it to make larger jumps on average.

3.1.2 Efficient NUTS

Algorithm 2 requires 2j − 1 evaluations of L(θ) and its gradient (where j is the number
of times BuildTree() is called), and O(2j ) additional operations to determine when to stop
doubling. In practice, for all but the smallest problems the cost of computing L and its
gradient still dominates the overhead costs, so the computational cost of algorithm 2 per
leapfrog step is comparable to that of a standard HMC algorithm. However, Algorithm
2 also requires that we store 2j position and momentum vectors, which may require an
unacceptably large amount of memory. Furthermore, there are alternative transition kernels
that satisfy detailed balance with respect to the uniform distribution on C that produce
larger jumps on average than simple uniform sampling. Finally, if a stopping criterion
is satisfied in the middle of the final doubling iteration then there is no point in wasting
computation to build up a set C 0 that will never be used.
The third issue is easily addressed—if we break out of the recursion as soon as we
encounter a zero value for the stop indicator s then the correctness of the algorithm is
unaffected and we save some computation. We can address the second issue by using a more
sophisticated transition kernel to move from one state (θ, r) ∈ C to another state (θ0 , r0 ) ∈ C
while leaving the uniform distribution over C invariant. This kernel admits a memory-
efficient implementation that only requires that we store O(j) position and momentum
vectors, rather than O(2j ).
Consider the transition kernel

 I[w0 ∈C new ]
if |C new | > |C old |,
0 |C new |
T (w |w, C) = new 0 new ]

|C new |
, (10)
 |C old | I[w ∈Cnew + 1 − I[w 0 = w] if |C new | ≤ |C old |
|C | |C | old
|C |

where w and w0 are shorthands for position-momentum states (θ, r), C new and C old are disjoint
subsets of C such that C new ∪ C old = C, and w ∈ C old . In English, T proposes a move from C old
|C new |
to a random state in C new
and accepts the move with probability |C old | . This is equivalent
to a Metropolis-Hastings kernel with proposal distribution q(w0 , C old 0 , C new 0 |w, C old , C new ) ∝
I[w0 ∈ C new ]I[C old 0 = C new ]I[C new 0 = C old ], and it is straightforward to show that it satisfies
detailed balance with respect to the uniform distribution on C, i.e.

p(w|C)T (w0 |w, C) = p(w0 |C)T (w|w0 , C), (11)

and that T therefore leaves the uniform distribution over C invariant. If we let C new be
the (possibly empty) set of elements added to C during the final iteration of the doubling
(i.e. those returned by the final call to BuildTree() and C old be the older elements of C,
then we can replace the uniform sampling of C at the end of Algorithm 2 with a draw
from T (θt , rt , C) and leave the uniform distribution on C invariant. In fact, we can apply T
after every doubling, proposing a move to each new half-tree in turn. Doing so leaves the

12
The No-U-Turn Sampler

uniform distribution on each partially built C invariant, and therefore does no harm to the
invariance of the uniform distribution on the fully built set C. Repeatedly applying T in
this way increases the probability that we will jump to a state θt+1 far from the initial state
θt ; considering the process in reverse, it is as though we first tried to jump to the other
side of C, then if that failed tried to make a more modest jump, and so on. This transition
kernel is thus akin to delayed-rejection MCMC methods (Tierney and Mira, 1999), but in
this setting we can avoid the usual costs associated with evaluating new proposals.
The transition kernel above still requires that we be able to sample uniformly from the
set C 0 returned by BuildTree(), which may contain as many as 2j−1 elements. In fact, we
can sample from C 0 without maintaining the full set C 0 in memory by exploiting the binary
tree structure in Figure 1. Consider a subtree of the tree explored in a call to BuildTree(),
and let Csubtree denote the set of its leaf states that are in C 0 : we can factorize the probability
that a state (θ, r) ∈ Csubtree will be chosen uniformly at random from C 0 as

1 |Csubtree | 1
p(θ, r|C 0 ) = 0
= 0
(12)
|C | |C | |Csubtree |
= p((θ, r) ∈ Csubtree |C)p(θ, r|(θ, r) ∈ Csubtree , C).

That is, p(θ, r|C 0 ) is the product of the probability of choosing some node from the subtree
multiplied by the probability of choosing θ, r uniformly at random from Csubtree . We use
this observation to sample from C 0 incrementally as we build up the tree. Each subtree
above the bottom layer is built of two smaller subtrees. For each of these smaller subtrees,
we sample a θ, r pair from p(θ, r|(θ, r) ∈ Csubtree ) to represent that subtree. We then choose
between these two pairs, giving the pair representing each subtree weight proportional to
how many elements of C 0 are in that subtree. This continues until we have completed the
subtree associated with C 0 and we have returned a sample θ0 from C 0 and an integer weight
n0 encoding the size of C 0 , which is all we need to apply T . This procedure only requires that
we store O(j) position and momentum vectors in memory, rather than O(2j ), and requires
that we generate O(2j ) extra random numbers (a cost that again is usually very small
compared with the 2j − 1 gradient computations needed to run the leapfrog algorithm).
Algorithm 3 implements all of the above improvements in pseudocode. Matlab code im-
plementing the algorithm is also available at http://www.cs.princeton.edu/~mdhoffma,
and a C++ implementation will also be available as part of the soon-to-be-released Stan
inference package.

3.2 Adaptively Tuning

Having addressed the issue of how to choose the number of steps L, we now turn our
attention to the step size parameter . To set for both NUTS and HMC, we propose using
stochastic optimization with vanishing adaptation (Andrieu and Thoms, 2008), specifically
an adaptation of the primal-dual algorithm of Nesterov (2009).
Perhaps the most commonly used vanishing adaptation algorithm in MCMC is the
stochastic approximation method of Robbins and Monro (1951). Suppose we have a statistic
Ht that describes some aspect of the behavior of an MCMC algorithm at iteration t ≥ 1,

13
Hoffman and Gelman

Algorithm 3 Efficient No-U-Turn Sampler

Given θ0 , , L, M :
for m = 1 to M do
Resample r0 ∼ N (0, I).
Resample u ∼ Uniform([0, exp{L(θm−1 − 12 r0 · r0 }])
Initialize θ− = θm−1 , θ+ = θm−1 , r− = r0 , r+ = r0 , j = 0, θm = θm−1 , n = 1, s = 1.
while s = 1 do
Choose a direction vj ∼ Uniform({−1, 1}).
if vj = −1 then
θ− , r− , −, −, θ0 , n0 , s0 ← BuildTree(θ− , r− , u, vj , j, ).
else
−, −, θ+ , r+ , θ0 , n0 , s0 ← BuildTree(θ+ , r+ , u, vj , j, ).
end if
if s0 = 1 then
0
With probability min{1, nn }, set θm ← θ0 .
end if
n ← n + n0 .
s ← s0 I[(θ+ − θ− ) · r− ≥ 0]I[(θ+ − θ− ) · r+ ≥ 0].
j ← j + 1.
end while
end for

function BuildTree(θ, r, u, v, j, )
if j = 0 then
Base case—take one leapfrog step in the direction v.
θ0 , r0 ← Leapfrog(θ, r, v).
n0 ← I[u ≤ exp{L(θ0 ) − 12 r0 · r0 }].
s0 ← I[u < exp{∆max + L(θ0 ) − 12 r0 · r0 }].
return θ0 , r0 , θ0 , r0 , θ0 , n0 , s0 .
else
Recursion—implicitly build the left and right subtrees.
θ− , r− , θ+ , r+ , θ0 , n0 , s0 ← BuildTree(θ, r, u, v, j − 1, ).
if s0 = 1 then
if v = −1 then
θ− , r− , −, −, θ00 , n00 , s00 ← BuildTree(θ− , r− , u, v, j − 1, ).
else
−, −, θ+ , r+ , θ00 , n00 , s00 ← BuildTree(θ+ , r+ , u, v, j − 1, ).
end if
00
With probability n0n+n00 , set θ0 ← θ00 .
s0 ← s00 I[(θ+ − θ− ) · r− ≥ 0]I[(θ+ − θ− ) · r+ ≥ 0]
n0 ← n0 + n00
end if
return θ− , r− , θ+ , r+ , θ0 , n0 , s0 .
end if

and define its expectation h(x) as

T
1X
h(x) ≡ Et [Ht |x] ≡ lim E[Ht |x], (13)
T →∞ T
t=1

14
The No-U-Turn Sampler

where x ∈ R is a tunable parameter to the MCMC algorithm. For example, if αt is the

Metropolis acceptance probability for iteration t, we might define Ht = δ − αt , where δ is
the desired average acceptance probability. If h is a nondecreasing function of x and a few
other conditions such as boundedness of the iterates xt are met (see Andrieu and Thoms
(2008) for details), the update
xt+1 ← xt − ηt Ht (14)
is guaranteed to cause h(xt ) to converge to 0 as long as the step size schedule defined by ηt
satisfies the conditions X X
ηt = ∞; ηt2 < ∞. (15)
t t

These conditions are satisfied by schedules of the form ηt ≡ t−κ for κ ∈ (0.5, 1]. As long
as the per-iteration impact of the adaptation goes to 0 (as it will if ηt ≡ t−κ and κ > 0)
the asymptotic behavior of the sampler is unchanged. That said, in practice x often gets
“close enough” to an optimal value well before the step size η has gotten close enough to
0 to avoid disturbing the Markov chain’s stationary distribution. A common practice is
therefore to adapt any tunable MCMC parameters during the burn-in phase, and freeze the
tunable parameters afterwards (e.g., (Gelman et al., 2004)).
Dual averaging: The optimal values of the parameters to an MCMC algorithm dur-
ing the burn-in phase and the stationary phase are often quite different. Ideally those
parameters would therefore adapt quickly as we shift from the sampler’s initial, transient
regime to its stationary regime. However, the diminishing step sizes of Robbins-Monro give
disproportionate weight to the early iterations, which is the opposite of what we want.
Similar issues motivate the dual averaging scheme of Nesterov (2009), an algorithm
for nonsmooth and stochastic convex optimization. Since solving an unconstrained convex
optimization problem is equivalent to finding a zero of a nondecreasing function (i.e., the
(sub)gradient of the cost function), it is straightforward to adapt dual averaging to the
problem of MCMC adaptation by replacing stochastic gradients with the statistics Ht .
Again assuming that we want to find a setting of a parameter x ∈ R such that h(x) ≡
Et [Ht |x] = 0, we can apply the updates
√ t
t
1 X
xt+1 ←µ− Hi ; x̄t+1 ← ηt xt+1 + (1 − ηt )x̄t , (16)
γ t + t0
i=1

where µ is a freely chosen point that the iterates xt are shrunk towards, γ > 0 is a free
parameter that controls the amount of shrinkage towards µ, t0 ≥ 0 is a free parameter that
stabilizes the initial iterations of the algorithm, ηt ≡ t−κ is a step size schedule obeying the
conditions in equation 15, and we define x̄1 = x1 . As in Robbins-Monro, the per-iteration
impact of these updates on x goes to 0 as t goes to infinity. Specifically, for large t we have

xt+1 − xt = O(−Ht t−0.5 ), (17)

which clearly goes to 0 as long as the statistic Ht is bounded. The sequence of averaged
iterates x̄t is guaranteed to converge to a value such that h(x̄t ) converges to 0.
The update scheme in equation 16 is slightly more elaborate than the update scheme
of Nesterov (2009), which implicitly has t0 ≡ 0 and κ ≡ 1. Introducing these parameters

15
Hoffman and Gelman

addresses issues that are more important in MCMC adaptation than in more conventional
stochastic convex optimization settings. Setting t0 > 0 improves the stability of the algo-
rithm in early iterations, which prevents us from wasting computation by trying out extreme
values. This is particularly important for NUTS, and for HMC when simulation lengths are
specified in terms of the overall simulation length L instead of a fixed number of steps L.
In both of these cases, lower values of result in more work being done per sample, so we
want to avoid casually trying out extremely low values of . Setting the parameter κ < 1
allows us to give higher weight to more recent iterates and more quickly forget the iterates
produced during the early burn-in stages. The benefits of introducing these parameters are
less apparent in the settings originally considered by Nesterov, where the cost of a stochastic
gradient computation is assumed to be constant and the stochastic gradients are assumed
to be drawn i.i.d. given the parameter x.
Allowing t0 > 0 and κ ∈ (0.5, 1] does not affect the asymptotic convergence of the dual
averaging algorithm. For any κ ∈ (0.5, 1], x̄t will√ eventually converge to the same value
1 Pt
√
t 1 t t 1
√
t t
√
t i=1 xt . We can rewrite the term γ t+t0 as γ(t+t0 ) t ; γ(t+t0 ) is still O( t), which is the
only feature needed to guarantee convergence.
We used the values γ = 0.05, t0 = 10, and κ = 0.75 for all our experiments. We arrived
at these values by trying a few settings for each parameter by hand with NUTS and HMC
(with simulation lengths specified in terms of L) on the stochastic volatility model described
below and choosing a value for each parameter that seemed to produce reasonable behavior.
Better results might be obtained with further tweaking, but these default parameters seem
to work consistently well for both NUTS and HMC for all of the models that we tested. It
is entirely possible that these parameter settings may not work as well for other sampling
algorithms or for H statistics other than the ones described below.
Setting in HMC: In HMC we want to find a value for the step size that is neither
too small (which would waste computation by taking needlessly tiny steps) nor too large
(which would waste computation by causing high rejection rates). A standard approach is
to tune so that HMC’s average Metropolis acceptance probability is equal to some value
δ. Indeed, it has been shown that (under fairly strong assumptions) the optimal value of
for a given simulation length L is the one that produces an average Metropolis acceptance
probability of approximately 0.65 (Beskos et al., 2010; Neal, 2011). For HMC, we define a
criterion hHMC () so that
( )
p(θ̃ t , r̃ t )
HtHMC ≡ min 1, ; hHMC () ≡ Et [HtHMC |], (18)
p(θt−1 , rt,0 )

where θ̃t and r̃t are the proposed position and momentum at the tth iteration of the Markov
chain, θt−1 and rt,0 are the initial position and (resampled) momentum for the tth iteration
of the Markov chain, HtHMC is the acceptance probability of this tth HMC proposal and
hHMC is the expected average acceptance probability of the chain in equilibrium for a fixed
. Assuming that hHMC is nonincreasing as a function of , we can apply the updates in
equation 16 with Ht ≡ δ − HtHMC and x ≡ log to coerce hHMC = δ for any δ ∈ (0, 1).
Setting in NUTS: Since there is no single accept/reject step in NUTS we must define
an alternative statistic to Metropolis acceptance probability. For each iteration we define

16
The No-U-Turn Sampler

Algorithm 4 Heuristic for choosing an initial value of

function FindReasonableEpsilon(θ)
Initialize = 1, r ∼ N (0, I).
Set θ0 , rh0 ← Leapfrog(θ,
i r, ).
0 0
a ← 2I p(θ ,r )
> 0.5 − 1.
p(θ,r)
0 0
a
while p(θ ,r )
p(θ,r) > 2−a do
← 2a .
Set θ0 , r0 ← Leapfrog(θ, r, ).
end while
return .

Algorithm 5 Hamiltonian Monte Carlo with Dual Averaging

Given θ0 , δ, λ, L, M, M adapt :
Set 0 = FindReasonableEpsilon(θ), µ = log(100 ), ¯0 = 1, H̄0 = 0, γ = 0.05, t0 = 10, κ = 0.75.
for m = 1 to M do
Reample r0 ∼ N (0, I).
Set θm ← θm−1 , θ̃ ← θm−1 , r̃ ← r0 , Lm = max{1, Round(λ/m−1 )}.
for i = 1 to Lm do
Set θ̃, r̃ ← Leapfrog(θ̃, r̃, m−1 ).
end for
exp{L(θ̃)− 1 r̃·r̃}
n o
With probability α = min 1, exp{L(θm−1 )−2 1 r0 ·r0 } , set θm ← θ̃, rm ← −r̃.
2
if m ≤ M adapt
then
1 1
Set H̄m = 1 − m+t0
H̄m−1 + m+t0 (δ − α).
√
m
Set log m = µ − γ H̄m , log
¯m = m−κ log m + (1 − m−κ ) log ¯m−1 .
else
Set m = ¯M adapt .
end if
end for

the statistic HtNUTS and its expectation when the chain has reached equilibrium as

NUTS 1 X p(θ, r)
Ht ≡ final min 1, t−1 , r t,0 )
; hNUTS ≡ Et [HtNUTS ], (19)
|Bt | final
p(θ
θ,r∈Bt

where Btfinal is the set of all states explored during the final doubling of iteration t of the
Markov chain and θt−1 and rt,0 are the initial position and (resampled) momentum for the
tth iteration of the Markov chain. H NUTS can be understood as the average acceptance
probability that HMC would give to the position-momentum states explored during the final
doubling iteration. As above, assuming that H NUTS is nonincreasing in , we can apply the
updates in equation 16 with Ht ≡ δ − H NUTS and x ≡ log to coerce hNUTS = δ for any
δ ∈ (0, 1).
Finding a good initial value of : The dual averaging scheme outlined above should
work for any initial value 1 and any setting of the shrinkage target µ. However, convergence
will be faster if we start from a reasonable setting of these parameters. We recommend
choosing an initial value 1 according to the simple heuristic described in Algorithm 4. In

17
Hoffman and Gelman

Algorithm 6 No-U-Turn Sampler with Dual Averaging

Given θ0 , δ, L, M, M adapt :
Set 0 = FindReasonableEpsilon(θ), µ = log(100 ), ¯0 = 1, H̄0 = 0, γ = 0.05, t0 = 10, κ = 0.75.
for m = 1 to M do
Sample r0 ∼ N (0, I).
Resample u ∼ Uniform([0, exp{L(θm−1 − 12 r0 · r0 }])
Initialize θ− = θm−1 , θ+ = θm−1 , r− = r0 , r+ = r0 , j = 0, θm = θm−1 , n = 1, s = 1.
while s = 1 do
Choose a direction vj ∼ Uniform({−1, 1}).
if vj = −1 then
θ− , r− , −, −, θ0 , n0 , s0 , α, nα ← BuildTree(θ− , r− , u, vj , j, m−1 θm−1 , r0 ).
else
−, −, θ+ , r+ , θ0 , n0 , s0 , α, nα ← BuildTree(θ+ , r+ , u, vj , j, m−1 , θm−1 , r0 ).
end if
if s0 = 1 then
0
With probability min{1, nn }, set θm ← θ0 .
end if
n ← n + n0 .
s ← s0 I[(θ+ − θ− ) · r− ≥ 0]I[(θ+ − θ− ) · r+ ≥ 0].
j ← j + 1.
end while
if m ≤ M adapt then
1 1
Set H̄m = 1 − m+t 0
H̄m−1 + m+t 0
(δ − nαα ).
√
m
Set log m = µ − γ
H̄m , log ¯m = m−κ log m + (1 − m−κ ) log ¯m−1 .
else
Set m = ¯M adapt .
end if
end for

function BuildTree(θ, r, u, v, j, , θ0 , r0 )
if j = 0 then
Base case—take one leapfrog step in the direction v.
θ0 , r0 ← Leapfrog(θ, r, v).
n0 ← I[u ≤ exp{L(θ0 ) − 12 r0 · r0 }].
s0 ← I[u < exp{∆max + L(θ0 ) − 21 r0 · r0 }].
return θ0 , r0 , θ0 , r0 , θ0 , n0 , s0 , min{1, exp{L(θ0 ) − 12 r0 · r0 − L(θ0 ) + 21 r0 · r0 }}, 1.
else
Recursion—implicitly build the left and right subtrees.
θ− , r− , θ+ , r+ , θ0 , n0 , s0 , α0 , n0α ← BuildTree(θ, r, u, v, j − 1, , θ0 , r0 ).
if s0 = 1 then
if v = −1 then
θ− , r− , −, −, θ00 , n00 , s00 , α00 , n00α ← BuildTree(θ− , r− , u, v, j − 1, , θ0 , r0 ).
else
−, −, θ+ , r+ , θ00 , n00 , s00 , α00 , n00α ← BuildTree(θ+ , r+ , u, v, j − 1, , θ0 , r0 ).
end if
00
With probability n0n+n00 , set θ0 ← θ00 .
Set α0 ← α0 + α00 , n0α ← n0α + n00α .
s0 ← s00 I[(θ+ − θ− ) · r− ≥ 0]I[(θ+ − θ− ) · r+ ≥ 0]
n0 ← n0 + n00
end if
return θ− , r− , θ+ , r+ , θ0 , n0 , s0 , α0 , n0α .
end if

18
The No-U-Turn Sampler

English, this heuristic repeatedly doubles or halves the value of 1 until the acceptance
probability of the Langevin proposal with step size 1 crosses 0.5. The resulting value of 1
will typically be small enough to produce reasonably accurate simulations but large enough
to avoid wasting large amounts of computation. We recommend setting µ = log(101 ), since
this gives the dual averaging algorithm a preference for testing values of that are larger
than the initial value 1 . Large values of cost less to evaluate than small values of , and
so erring on the side of trying large values can save computation.
Algorithms 5 and 6 show how to implement HMC (with simulation length specified in
terms of L rather than L) and NUTS while incorporating the dual averaging algorithm
derived in this section, with the above initialization scheme. Algorithm 5 requires as input
a target simulation length λ ≈ L, a target mean acceptance probability δ, and a num-
ber of iterations M adapt after which to stop the adaptation. Algorithm 6 requires only a
target mean acceptance probability δ and a number of iterations M adapt . Matlab code im-
plementing both algorithms can be found at http://www.cs.princeton.edu/~mdhoffma,
and C++ implementations will be available as part of the Stan inference package.

4. Empirical Evaluation

In this section we examine the effectiveness of the dual averaging algorithm outlined in
section 3.2, examine what values of the target δ in the dual averaging algorithm yield
efficient samplers, and compare the efficiency of NUTS and HMC.
For each target distribution, we ran HMC (as implemented in algorithm 5) and NUTS (as
implemented in algorithm 6) with four target distributions for 2000 iterations, allowing the
step size to adapt via the dual averaging updates described in section 3.2 for the first 1000
iterations. In all experiments the dual averaging parameters were set to γ = 0.05, t0 = 10,
and κ = 0.75. We evaluated HMC with 10 logarithmically spaced target simulation lengths
λ per target distribution. For each target distribution the largest value of λ that we tested
was 40 times the smallest value of λ that we tested, meaning that each successive λ is
401/9 ≈ 1.5 times larger than the previous λ. We tried 15 evenly spaced values of the dual
averaging target δ between 0.25 and 0.95 for NUTS and 8 evenly spaced values of the dual
averaging target δ between 0.25 and 0.95 for HMC. For each sampler-simulation length-δ-
target distribution combination we ran 10 iterations with different random seeds. In total,
we ran 3,200 experiments with HMC and 600 experiments with NUTS.
We measure the efficiency of each algorithm in terms of effective sample size (ESS)
normalized by the number of gradient evaluations used by each algorithm. The ESS of
a set of M correlated samples θ1:M with respect to some function f (θ) is the number of
independent draws from the target distribution p(θ) that would give a Monte Carlo estimate
of the mean under p of f (θ) with the same level of precision as the estimate given by the mean
of f for the correlated samples θ1:M . That is, the ESS of a sample is a measure of how many
independent samples a set of correlated samples is worth for the purposes of estimating the
mean of some function; a more efficient sampler will give a larger ESS for less computation.
We use the number of gradient evaluations performed by an algorithm as a proxy for the
total amount of computation performed; in all of the models and distributions we tested the
computational overhead of both HMC and NUTS is dominated by the cost of computing

19
Hoffman and Gelman

gradients. Details of the method we use to estimate ESS are provided in appendix A. In
each experiment, we discarded the first 1000 samples as burn-in when estimating ESS.
ESS is inherently a univariate statistic, but all of the distributions we test HMC and
NUTS on are multivariate. Following Girolami and Calderhead (2011) we compute ESS
separately for each dimension and report the minimum ESS across all dimensions, since we
want our samplers to effectively explore all dimensions of the target distribution. For each
dimension we compute ESS in terms of the variance of the estimator of that dimension’s
mean and second central moment (where the estimate of the mean used to compute the
second central moment is taken from a separate long run of 50,000 iterations of NUTS with
δ = 0.5), reporting whichever statistic has a lower effective sample size. We include the
second central moment as well as the mean because for simulation lengths L that hit a
resonance of the target distribution HMC can produce samples that are anti-correlated.
These samples yield low-variance estimators of parameter means, but very high-variance
estimators of parameter variances, so computing ESS only in terms of the mean of θ can be
misleading.

4.1 Models and Datasets

To evaluate NUTS and HMC, we used the two algorithms to sample from four target distri-
butions, one of which was synthetic and the other three of which are posterior distributions
arising from real datasets.
250-dimensional multivariate normal (MVN): In these experiments the target dis-
tribution was a zero-mean 250-dimensional multivariate normal with known precision matrix
A, i.e.,
p(θ) ∝ exp{− 12 θT Aθ}. (20)
The matrix A was generated from a Wishart distribution with identity scale matrix and
250 degrees of freedom. This yields a target distribution with many strong correlations.
The same matrix A was used in all experiments.
Bayesian logistic regression (LR): In these experiments the target distribution is the
posterior of a Bayesian logistic regression model fit to the German credit dataset (available
from the UCI repository (Frank and Asuncion, 2010)). The target distribution is

p(α, β|x, y) ∝ p(y|x, α, β)p(α)p(β)

1 1 (21)
α2
P
∝ exp{− i log(1 + exp{−yi (α + xi · β}) − 2σ 2
− 2σ 2
β · β},

where xi is a 24-dimensional vector of numerical predictors associated with a customer

i, yi is −1 if customer i should be denied credit and 1 if that customer should receive
credit, α is an intercept term, and β is a vector of 24 regression coefficients. All predictors
are normalized to have zero mean and unit variance. α and each element of β are given
weak zero-mean normal priors with variance σ 2 = 100. The dataset contains predictor and
response data for 1000 customers.
Hierarchical Bayesian logistic regression (HLR): In these experiments the target
distribution is again the posterior of a Bayesian logistic regression model fit to the German
credit dataset, but this time the variance parameter in the prior on α and β is given an

20
The No-U-Turn Sampler

250-D Normal
NUTS HMC εL≈1.00 HMC εL≈1.51 HMC εL≈2.27 HMC εL≈3.42 HMC εL≈5.15 HMC εL≈7.76 HMC εL≈11.70 HMC εL≈17.62 HMC εL≈26.55 HMC εL≈40.00
0.4
h - delta
0.2
0.0
-0.2
0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Logistic Regression
NUTS HMC εL≈0.05 HMC εL≈0.08 HMC εL≈0.11 HMC εL≈0.17 HMC εL≈0.26 HMC εL≈0.39 HMC εL≈0.58 HMC εL≈0.88 HMC εL≈1.33 HMC εL≈2.00
0.4
h - delta

0.2
0.0
-0.2
0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Hierarchical Logistic Regression

NUTS HMC εL≈0.03 HMC εL≈0.04 HMC εL≈0.06 HMC εL≈0.09 HMC εL≈0.13 HMC εL≈0.19 HMC εL≈0.29 HMC εL≈0.44 HMC εL≈0.66 HMC εL≈1.00
0.4
h - delta

0.2
0.0
-0.2
0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Stochastic Volatility
NUTS HMC εL≈0.10 HMC εL≈0.15 HMC εL≈0.23 HMC εL≈0.34 HMC εL≈0.52 HMC εL≈0.78 HMC εL≈1.17 HMC εL≈1.76 HMC εL≈2.65 HMC εL≈4.00
0.4
h - delta

0.2
0.0
-0.2
0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Figure 3: Discrepancies between the realized average acceptance probability statistic h and
its target δ for the multivariate normal, logistic regression, hierarchical logistic
regression, and stochastic volatility models. Each point’s distance from the x-
axis shows how effectively the dual averaging algorithm tuned the step size for
a single experiment. Leftmost plots show experiments run with NUTS, other
plots show experiments run with HMC with a different setting of L.

exponential prior and estimated as well.

Also, we expand the predictor vectors by including
two-way interactions, resulting in 24
2 + 24 = 300-dimensional vectors of predictors x and
a 300-dimensional vector of coefficients β. These elaborations on the model make for a
more challenging problem; the posterior is in higher dimensions, and the variance term σ 2
interacts strongly with the remaining 301 variables. The target distribution for this problem
is

p(α, β, σ 2 |x, y) ∝ p(y|x, α, β)p(β|σ 2 )p(α|σ 2 )p(σ 2 )

1 1 N
α2 log σ 2 − λσ 2 },
P
∝ exp{− i log(1 + exp{−yi xi · β}) − 2σ 2
− 2σ 2
β ·β− 2
(22)

where N = 1000 is the number of customers and λ is the rate parameter to the prior on
σ 2 . We set λ = 0.01, yielding a weak exponential prior distribution on σ 2 whose mean and
standard deviation are 100.

Stochastic volatility (SV): In the final set of experiments the target distribution is the
posterior of a relatively simple stochastic volatility model fit to 3000 days of returns from
the S&P 500 index. The model assumes that the observed values of the index are generated

21
Hoffman and Gelman

MVN LR HLR SV
2.0

1.5
ε / final ε

1.0

0.5

0.0

0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
iteration

Figure 4: Plots of the convergence of ¯ as a function of the number of iterations of NUTS

with dual averaging with δ = 0.65 applied to the multivariate normal (MVN),
logistic regression (LR), hierarchical logistic regression (HLR), and stochastic
volatility (SV) models. Each trace is from an independent run. The y-axis shows
the value of ¯, divided by one of the final values of ¯ so that the scale of the traces
for each problem can be readily compared.

by the following generative process:

τ ∼ Exponential(100); ν ∼ Exponential(100); s1 ∼ Exponential(100);

log yi −log yi−1
log si>1 ∼ Normal(log si−1 , τ −1 ); si ∼ tν , (23)

where si>1 refers to a scale parameter si where i > 1. We integrate out the precision
parameter τ to speed mixing, leading to the 3001-dimensional target distribution

Q3000
p(s, ν|y) ∝ e−0.01ν e−0.01s1 ( i=1 tν (s−1
i (log yi − log yi−1 )))×

(0.01 + 0.5 3000 2 − 3001

P
i=2 (log si − log si−1 ) )
2 . (24)

4.2 Convergence of Dual Averaging

Figure 3 plots the realized versus target values of the statistics hHMC and hNUTS . The h
statistics were computed from the 1000 post-burn-in samples. The dual averaging algorithm
of section 3.2 usually does a good job of coercing the statistic h to its desired value δ. It
performs somewhat worse for the stochastic volatility model, which we attribute to the
longer burn-in period needed for this model; since it takes more samples to reach the
stationary regime for the stochastic volatility model, the adaptation algorithm has less time
to tune to be appropriate for the stationary distribution. This is particularly true for
HMC with small values of δ, since the overly high rejection rates caused by setting δ too
small lead to slower convergence.
Figure 4 plots the convergence of the averaged iterates ¯m as a function of the number of
dual averaging updates for NUTS with δ = 0.65. Except for the stochastic volatility model,
which requires longer to burn in, ¯ roughly converges within a few hundred iterations.

22
The No-U-Turn Sampler

MVN LR HLR SV

δ=0.25
8000
4000
0

δ=0.30
8000
4000
0

δ=0.35
8000
4000
0

δ=0.40
8000
4000
0

δ=0.45
8000
4000
0

δ=0.50
8000
4000
0

δ=0.55
8000
4000
0
8000

δ=0.60
Count

4000
0
8000

δ=0.65
4000
0
8000

δ=0.70
4000
0
8000

δ=0.75
4000
0
8000

δ=0.80
4000
0
8000

δ=0.85
4000
0

δ=0.90
8000
4000
0
8000

δ=0.95
4000
0

22 24 26 28 210 212 22 24 26 28 210 212 22 24 26 28 210 212 22 24 26 28 210 212

NUTS Trajectory Length

Figure 5: Histograms of the trajectory lengths generated by NUTS with various accep-
tance rate targets δ for the multivariate normal (MVN), logistic regression (LR),
hierarchical logistic regression (HLR), and stochastic volatility (SV) models.

4.3 NUTS Trajectory Lengths

Figure 5 shows histograms of the trajectory lengths generated by NUTS. Most of the tra-
jectory lengths are integer powers of two, indicating that the U-turn criterion in equation
9 is usually satisfied only after a doubling is complete and not by one of the intermedi-
ate subtrees generated during the doubling process. This behavior is desirable insofar as it
means that we only occasionally have to throw out entire half-trajectories to satisfy detailed
balance.

23
Hoffman and Gelman

250-D Normal
NUTS HMC εL≈1.00 HMC εL≈1.51 HMC εL≈2.27 HMC εL≈3.42 HMC εL≈5.15 HMC εL≈7.76 HMC εL≈11.70 HMC εL≈17.62 HMC εL≈26.55 HMC εL≈40.00
ESS per gradient 0.00012

0.00010

0.00008

0.00006

0.00004

0.00002

0.00000

0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Logistic Regression
NUTS HMC εL≈0.05 HMC εL≈0.08 HMC εL≈0.11 HMC εL≈0.17 HMC εL≈0.26 HMC εL≈0.39 HMC εL≈0.58 HMC εL≈0.88 HMC εL≈1.33 HMC εL≈2.00
ESS per gradient

0.03000

0.02000

0.01000

0.00000

0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Hierarchical Logistic Regression

NUTS HMC εL≈0.03 HMC εL≈0.04 HMC εL≈0.06 HMC εL≈0.09 HMC εL≈0.13 HMC εL≈0.19 HMC εL≈0.29 HMC εL≈0.44 HMC εL≈0.66 HMC εL≈1.00

0.00300
ESS per gradient

0.00250
0.00200
0.00150
0.00100
0.00050
0.00000

0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Stochastic Volatility
NUTS HMC εL≈0.10 HMC εL≈0.15 HMC εL≈0.23 HMC εL≈0.34 HMC εL≈0.52 HMC εL≈0.78 HMC εL≈1.17 HMC εL≈1.76 HMC εL≈2.65 HMC εL≈4.00
0.00050
ESS per gradient

0.00040

0.00030

0.00020

0.00010

0.00000

0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9
delta

Figure 6: Effective sample size (ESS) as a function of δ and (for HMC) simulation length
L for the multivariate normal, logistic regression, hierarchical logistic regression,
and stochastic volatility models. Each point shows the ESS divided by the number
of gradient evaluations for a separate experiment; lines denote the average of the
points’ y-values for a particular δ. Leftmost plots are NUTS’s performance, each
other plot shows HMC’s performance for a different setting of L.

The trajectory length (measured in number of states visited) grows as the acceptance
rate target δ grows, which is to be expected since a higher δ will lead to a smaller step
size , which in turn will mean that more leapfrog steps are necessary before the trajectory
doubles back on itself and satisfies equation 9.

24
The No-U-Turn Sampler

Metropolis Gibbs NUTS Independent

-1

-2

-3

-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15

Figure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots
compare 1,000 independent draws from a highly correlated 250-dimensional distribu-
tion (right) with 1,000,000 samples (thinned to 1,000 samples for display) generated by
random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display)
generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS
(second from right). Only the first two dimensions are shown here.

4.4 Comparing the Efficiency of HMC and NUTS

Figure 6 compares the efficiency of HMC (with various simulation lengths λ ≈ L) and
NUTS (which chooses simulation lengths automatically). The x-axis in each plot is the
target δ used by the dual averaging algorithm from section 3.2 to automatically tune the step
size . The y-axis is the effective sample size (ESS) generated by each sampler, normalized by
the number of gradient evaluations used in generating the samples. HMC’s best performance
seems to occur around δ = 0.65, suggesting that this is indeed a reasonable default value
for a variety of problems. NUTS’s best performance seems to occur around δ = 0.6, but
does not seem to depend strongly on δ within the range δ ∈ [0.45, 0.65]. δ = 0.6 therefore
seems like a reasonable default value for NUTS.
On the two logistic regression problems NUTS is able to produce effectively indepen-
dent samples about as efficiently as HMC can. On the multivariate normal and stochastic
volatility problems, NUTS with δ = 0.6 outperforms HMC’s best ESS by about a factor of
three.
As expected, HMC’s performance degrades if an inappropriate simulation length is cho-
sen. Across the four target distributions we tested, the best simulation lengths λ for HMC
varied by about a factor of 100, with the longest optimal λ being 17.62 (for the multivari-
ate normal) and the shortest optimal λ being 0.17 (for the simple logistic regression). In
practice, finding a good simulation length for HMC will usually require some number of
preliminary runs. The results in Figure 6 suggest that NUTS can generate samples at least
as efficiently as HMC, even discounting the cost of any preliminary runs needed to tune
HMC’s simulation length.

25
Hoffman and Gelman

4.5 Qualitative Comparison of NUTS, Random-Walk Metropolis, and Gibbs

In section 4.4, we compared the efficiency of NUTS and HMC. In this section, we informally
demonstrate the advantages of NUTS over the popular random-walk Metropolis (RWM)
and Gibbs sampling algorithms. We ran NUTS, RWM, and Gibbs sampling on the 250-
dimensional multivariate normal distribution described in section 4.1. NUTS was run with
δ = 0.5 for 2,000 iterations, with the first 1,000 iterations being used as burn-in and to adapt
. This required about 1,000,000 gradient and likelihood evaluations in total. We ran RWM
for 1,000,000 iterations with an isotropic normal proposal distribution whose variance was
selected beforehand to produce the theoretically optimal acceptance rate of 0.234 (Gelman
et al., 1996). The cost per iteration of RWM is effectively identical to the cost per gradient
evaluation of NUTS, and the two algorithms ran for about the same amount of time. We
ran Gibbs sampling for 1,000,000 sweeps over the 250 parameters. This took longer to run
than NUTS and RWM, since for the multivariate normal each Gibbs sweep costs more than
a single gradient evaluation; we chose to nonetheless run the same number of Gibbs sweeps
as RWM iterations, since for some other models Gibbs sweeps can be done more efficiently.
Figure 7 visually compares independent samples (projected onto the first two dimen-
sions) from the target distribution with samples generated by the three MCMC algorithms.
RWM has barely begun to explore the space. Gibbs does better, but still has left parts
of the space unexplored. NUTS, on the other hand, is able to generate many effectively
independent samples.
We use this simple example to visualize the relative performance of NUTS, Gibbs,
and RWM on a moderately high-dimensional distribution exhibiting strong correlations.
For the multivariate normal, Gibbs or RWM would of course work much better after an
appropriate rotation of the parameter space. But finding and applying an appropriate
rotation can be expensive when the number of parameters D gets large, and RWM and
Gibbs both require O(D2 ) operations per effectively independent sample even under the
highly optimistic assumption that a transformation can be found that renders all parameters
i.i.d. and can be applied cheaply (e.g. in O(D) rather than the usual O(D2 ) cost of matrix-
vector multiplication and the O(D3 ) cost of matrix inversion). This is shown for RWM by
Creutz (1988), and for Gibbs is the result of needing to apply a transformation requiring
O(D) operations D times per Gibbs sweep. For complicated models, even more expensive
transformations often cannot render the parameters sufficiently independent to make RWM
and Gibbs run efficiently. NUTS, on the other hand, is able to efficiently sample from
high-dimensional target distributions without needing to be tuned to the shape of those
distributions.

5. Discussion
We have presented the No-U-Turn Sampler (NUTS), a variant of the powerful Hamilto-
nian Monte Carlo (HMC) Markov chain Monte Carlo (MCMC) algorithm that eliminates
HMC’s dependence on a number-of-steps parameter L but retains (and in some cases im-
proves upon) HMC’s ability to generate effectively independent samples efficiently. We also
developed a method for automatically adapting the step size parameter shared by NUTS
and HMC via an adaptation of the dual averaging algorithm of Nesterov (2009), making
it possible to run NUTS with no hand tuning at all. The dual averaging approach we

26
The No-U-Turn Sampler

developed in this paper could also be applied to other MCMC algorithms in place of more
traditional adaptive MCMC approaches based on the Robbins-Monro stochastic approxi-
mation algorithm (Andrieu and Thoms, 2008; Robbins and Monro, 1951).
In this paper we have only compared NUTS with the basic HMC algorithm, and not its
extensions, several of which are reviewed by Neal (2011). We only considered simple kinetic
energy functions of the form 12 r · r, but both NUTS and HMC can benefit from introducing
a “mass” matrix M and using the kinetic energy function 12 rT M −1 r. If M −1 approximates
the covariance matrix of p(θ), then this kinetic energy function will reduce the negative
impacts strong correlations and bad scaling have on the efficiency of both NUTS and HMC.
Another extension of HMC introduced by Neal (1994) considers windows of proposed states
rather than simply the state at the end of the trajectory to allow for larger step sizes without
sacrificing acceptance rates (at the expense of introducing a window size parameter that
must be tuned). The effectiveness of the windowed HMC algorithm suggests that NUTS’s
lack of a single accept/reject step may be responsible for some of its performance gains over
vanilla HMC.
Girolami and Calderhead (2011) recently introduced Riemannian Manifold Hamilto-
nian Monte Carlo (RMHMC), a variant on HMC that simulates Hamiltonian dynamics in
Riemannian rather than Euclidean spaces, effectively allowing for position-dependent mass
matrices. Although the worst-case O(D3 ) matrix inversion costs associated with this al-
gorithm often make it expensive to apply in high dimensions, when these costs are not
too onerous RMHMC’s ability to adapt its kinetic energy function makes it very efficient.
There are no technical obstacles that stand in the way of combining NUTS’s ability to adapt
its trajectory lengths with RMHMC’s ability to adapt its mass matrices; exploring such a
hybrid algorithm seems like a natural direction for future research.
Like HMC, NUTS can only be used to resample unconstrained continuous-valued vari-
ables with respect to which the target distribution is differentiable almost everywhere. HMC
and NUTS can deal with simple constraints such as nonnegativity or restriction to the sim-
plex by an appropriate change of variable, but discrete variables must either be summed out
or handled by other algorithms such as Gibbs sampling. In models with discrete variables,
NUTS’s ability to automatically choose a trajectory length may make it more effective than
HMC when discrete variables are present, since it is not tied to a single simulation length
that may be appropriate for one setting of the discrete variables but not for others.
Some models include hard constraints that are too complex to eliminate by a simple
change of variables. Such models will have regions of the parameter space with 0 posterior
probability. When HMC encounters such a region, the best it can do is stop short and restart
with a new momentum vector, wasting any work done before violating the constraints (Neal,
2011). By contrast, when NUTS encounters a 0-probability region it stops short and samples
from the set of points visited up to that point, making at least some progress.
NUTS with dual averaging makes it possible for Bayesian data analysts to obtain the
efficiency of HMC without spending time and effort hand-tuning HMC’s parameters. This
is desirable even for those practitioners who have experience using and tuning HMC, but it
is especially valuable for those who lack this experience. In particular, NUTS’s ability to
operate efficiently without user intervention makes it well suited for use in generic inference
engines in the mold of BUGS (Gilks and Spiegelhalter, 1992), which until now have largely
relied on much less efficient algorithms such as Gibbs sampling. We are currently devel-

27
Hoffman and Gelman

oping an automatic Bayesian inference system called Stan, which uses NUTS as its core
inference algorithm for continuous-valued parameters. Stan promises to be able to generate
effectively independent samples from complex models’ posteriors orders of magnitude faster
than previous systems such as BUGS and JAGS.
In summary, NUTS makes it possible to efficiently perform Bayesian posterior inference
on a large class of complex, high-dimensional models with minimal human intervention. It is
our hope that NUTS will allow researchers and data analysts to spend more time developing
and testing models and less time worrying about how to fit those models to data.

Acknowledgments

This work was partially supported by Institute of Education Sciences grant ED-GRANTS-
032309-005, Department of Energy grant de-sc0002099, National Science Foundation grant
ATM-0934516, and National Science Foundation grant SES-1023189.

Appendix A. Estimating Effective Sample Size

For a function f (θ), a target distribution p(θ), and a Markov chain Monte Carlo (MCMC)
sampler that produces a set of M correlated samples drawn from some distribution q(θ1:M )
such that q(θm ) = p(θm ) for any m ∈ {1, . . . , M }, the effective sample size (ESS) of θ1:M is
the number of independent samples that would be needed to obtain a Monte Carlo estimate
of the mean of f with equal variance to the MCMC estimate of the mean of f :
1 PM s
1:M Vq [ M s=1 f (θ )] M
ESSq,f (θ )=M = PM −1 ;
Vp [f (θ)]
M
1 + 2 s=1 (1 − M s
)ρfs
Eq [(f (θt ) − Ep [f (θ)])(f (θt−s ) − Ep [f (θ)])]
ρfs ≡ , (25)
Vp [f (θ)]

where ρfs denotes the autocorrelation under q of f at lag s and Vp [x] denotes the variance
of a random variable x under the distribution p(x).
To estimate ESS, we first compute the following estimate of the autocorrelation spectrum
for the function f (θ):
M
1 X
ρ̂fs = 2 (f (θm ) − µ̂f )(f (θm−s ) − µ̂f ), (26)
σ̂f (M − s) m=s+1

where the estimates µ̂f and σ̂f2 of the mean and variance of the function f are computed with
high precision from a separated 50,000-sample run of NUTS with δ = 0.5. We do not take
these estimates from the chain whose autocorrelations we are trying to estimate—doing
so can lead to serious underestimates of the level of autocorrelation (and thus a serious
overestimate of the number of effective samples) if the chain has not yet converged or has
not yet generated a fair number of effectively independent samples.
Any estimator of ρfs is necessarily noisy for large lags s, so using the naive estimator
ˆ q,f (θ1:M ) =
ESS M
PM −1 f will yield bad results. Instead, we truncate the sum over
s
1+2 s=1 (1− M )ρ̂s

28
The No-U-Turn Sampler

the autocorrelations when the autocorrelations first dip below 0.05, yielding the estimator

ˆ q,f (θ1:M ) = M
ESS PMfcutoff ; Mfcutoff ≡ min s s.t. ρ̂fs < 0.05. (27)
s f s
1+2 s=1 (1 − M )ρ̂s

We found that this method for estimating ESS gave more reliable confidence intervals
for MCMC estimators than the autoregressive approach used by CODA (Plummer et al.,
2006). (The more accurate estimator comes at the expense of needing to compute a costly
high-quality estimate of the true mean and variance of the target distribution.) The 0.05
cutoff is somewhat arbitrary; in our experiments we did not find the results to be very
sensitive to the precise value of this cutoff.

References
C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18
(4):343–373, 2008.

A. Beskos, N. Pillai, G. Roberts, J. Sanz-Serna, and A. Stuart. Optimal tuning of the

hybrid monte-carlo algorithm. Arxiv preprint arXiv:1001.4460, 2010.

M. Creutz. Global Monte Carlo algorithms for many-fermion systems. Physical Review D,
38(4):1228–1238, 1988.

H. Daume III. HBC: Hierarchical Bayes compiler, 2007. URL http://hal3.name/HMC.

A. Duane, A. Kennedy, B. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters
B, 195(2):216–222, 1987.

A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http://archive.
ics.uci.edu/ml.

A. Gelman, G. Roberts, and W. Gilks. Efficient Metropolis jumping rules. Bayesian statis-
tics, 5:599–608, 1996.

A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman & Hall,
2004.

S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,
6:721–741, 1984.

W. Gilks and D. Spiegelhalter. A language and program for complex Bayesian modelling.
The Statistician, 3:169–177, 1992.

M. Girolami and B. Calderhead. Riemann manifold langevin and hamiltonian monte carlo
methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73
(2):123–214, 2011.

A. Griewank and A. Walther. Evaluating derivatives: principles and techniques of algorith-

mic differentiation. Society for Industrial and Applied Mathematics (SIAM), 2008.

29
Hoffman and Gelman

B. Leimkuhler and S. Reich. Simulating Hamiltonian dynamics, volume 14. Cambridge

University Press, 2004.

N. Metropolis, A. Rosenbluth, M. Rosenbluth, M. Teller, and E. Teller. Equations of state

calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092,
1953.

T. Minka, J. Winn, J. Guiver, and D. Knowles. Infer.NET 2.4, Microsoft Research Cam-
bridge, 2010. http://research.microsoft.com/infernet.

R. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report
CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993.

R. Neal. An improved acceptance procedure for the hybrid Monte Carlo algorithm. Journal
of Computational Physics, 111:194–203, 1994.

R. Neal. Slice sampling. Annals of Statistics, 31(3):705–741, 2003.

R. Neal. Handbook of Markov Chain Monte Carlo, chapter 5: MCMC Using Hamiltonian
Dynamics. CRC Press, 2011.

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical pro-

gramming, 120(1):221–259, 2009.

A. Patil, D. Huard, and C. Fonnesbeck. PyMC: Bayesian stochastic modelling in python.

Journal of Statistical Software, 35(4):1–81, 2010.

M. Plummer, N. Best, K. Cowles, and K. Vines. CODA: Convergence diagnosis and output
analysis for MCMC. R News, 6(1):7–11, March 2006.

H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical

Statistics, pages 400–407, 1951.

L. Tierney and A. Mira. Some adaptive Monte Carlo methods for Bayesian inference.
Statistics in Medicine, 18:2507–2515, 1999.

M. Wainwright and M. Jordan. Graphical models, exponential families, and variational

inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.

An Efficient Implementation of Riemannian Manifold Hamiltonian Monte Carlo For Gaussian Process Models
No ratings yet
An Efficient Implementation of Riemannian Manifold Hamiltonian Monte Carlo For Gaussian Process Models
9 pages
Hamiltonian Monte Carlo For Efficient Gaussian Sampling: Long and Random Steps
No ratings yet
Hamiltonian Monte Carlo For Efficient Gaussian Sampling: Long and Random Steps
30 pages
Variational Langevin Hamiltonian Monte Carlo For Distant Multi-Modal Sampling
No ratings yet
Variational Langevin Hamiltonian Monte Carlo For Distant Multi-Modal Sampling
25 pages
G H M C N N: Eneralizing Amiltonian Onte Arlo With Eural Etworks
No ratings yet
G H M C N N: Eneralizing Amiltonian Onte Arlo With Eural Etworks
15 pages
Markov Chain Monte Carlo:: A Workhorse For Modern Scientific Computation
No ratings yet
Markov Chain Monte Carlo:: A Workhorse For Modern Scientific Computation
22 pages
Markov Chain Monte Carlo
No ratings yet
Markov Chain Monte Carlo
13 pages
Mcmc-A Comparative Study
No ratings yet
Mcmc-A Comparative Study
29 pages
The Metropolis-Hastings Algorithm: C.P. Robert
No ratings yet
The Metropolis-Hastings Algorithm: C.P. Robert
15 pages
An Introduction To MCMC For Machine Learning: Abstract
No ratings yet
An Introduction To MCMC For Machine Learning: Abstract
39 pages
Intro to MCMC in Machine Learning
No ratings yet
Intro to MCMC in Machine Learning
39 pages
Monte Carlo
No ratings yet
Monte Carlo
59 pages
Cra I U Rosenthal Ann Rev
No ratings yet
Cra I U Rosenthal Ann Rev
40 pages
Bayesian Modelling Tuts-12-15
No ratings yet
Bayesian Modelling Tuts-12-15
4 pages
MCMC Final Edition
No ratings yet
MCMC Final Edition
17 pages
2017-The Geometric Foundations of Hamiltonian Monte Carlo
No ratings yet
2017-The Geometric Foundations of Hamiltonian Monte Carlo
42 pages
CPSC 540: Machine Learning: Monte Carlo Methods
No ratings yet
CPSC 540: Machine Learning: Monte Carlo Methods
32 pages
Advanced MCMC Methods for Researchers
No ratings yet
Advanced MCMC Methods for Researchers
70 pages
1 s2.0 S0304414912002621 Main
No ratings yet
1 s2.0 S0304414912002621 Main
39 pages
SI Nonlin
No ratings yet
SI Nonlin
14 pages
Bayesian Inference
No ratings yet
Bayesian Inference
28 pages
NeurIPS 2019 Sample Adaptive MCMC Paper
No ratings yet
NeurIPS 2019 Sample Adaptive MCMC Paper
12 pages
ML - Unit-V-1
No ratings yet
ML - Unit-V-1
42 pages
Monte Carlo Method
0% (1)
Monte Carlo Method
23 pages
CPSC 440: Advanced Machine Learning: Monte Carlo Methods
No ratings yet
CPSC 440: Advanced Machine Learning: Monte Carlo Methods
30 pages
Jun S, Liu 2001 - Monte Carlo Strategies in Scientific Cotnputing
No ratings yet
Jun S, Liu 2001 - Monte Carlo Strategies in Scientific Cotnputing
350 pages
Markov Chain Monte Carlo For Dummies: Anada
No ratings yet
Markov Chain Monte Carlo For Dummies: Anada
65 pages
Annurev Statistics 022513 115540
No ratings yet
Annurev Statistics 022513 115540
26 pages
Stochastic Gradient MCMC Recipe
No ratings yet
Stochastic Gradient MCMC Recipe
16 pages
Unit V Graphical Models
No ratings yet
Unit V Graphical Models
23 pages
Understanding Monte Carlo Methods
No ratings yet
Understanding Monte Carlo Methods
5 pages
Particle Gibbs Without The Gibbs Bit
No ratings yet
Particle Gibbs Without The Gibbs Bit
12 pages
CSE291D Lecture 6: Monte Carlo Methods 2: Markov Chain Monte Carlo
No ratings yet
CSE291D Lecture 6: Monte Carlo Methods 2: Markov Chain Monte Carlo
66 pages
Chib Greenberg 1995
No ratings yet
Chib Greenberg 1995
12 pages
UNIT-5 Markov Chain Monte Carlo Methods
No ratings yet
UNIT-5 Markov Chain Monte Carlo Methods
17 pages
Monte Carlo Simulation Technique
No ratings yet
Monte Carlo Simulation Technique
48 pages
Monte Carlo Method
No ratings yet
Monte Carlo Method
7 pages
Metropolis-Hastings Algorithm - Wikipedia
No ratings yet
Metropolis-Hastings Algorithm - Wikipedia
10 pages
Structural Breaks in Path Dependent Models
No ratings yet
Structural Breaks in Path Dependent Models
45 pages
Accelerating MCMC with Modern Hardware
No ratings yet
Accelerating MCMC with Modern Hardware
26 pages
Journal of Statistical Software
No ratings yet
Journal of Statistical Software
41 pages
Scalable Monte Carlo For Bayesian Learning: Paul Fearnhead, Christopher Nemeth, Chris J. Oates and Chris Sherlock
No ratings yet
Scalable Monte Carlo For Bayesian Learning: Paul Fearnhead, Christopher Nemeth, Chris J. Oates and Chris Sherlock
244 pages
Big Data JPM
No ratings yet
Big Data JPM
31 pages
MCMC Convergence Diagnostics Review
No ratings yet
MCMC Convergence Diagnostics Review
52 pages
Lecture 3 9.66
No ratings yet
Lecture 3 9.66
45 pages
Statistical Computation Algorithm Based On Markov Chain Monte Carlo Sampling To Solve Multivariable Nonlinear Optimization
No ratings yet
Statistical Computation Algorithm Based On Markov Chain Monte Carlo Sampling To Solve Multivariable Nonlinear Optimization
5 pages
Hogg 2018 ApJS 236 11
No ratings yet
Hogg 2018 ApJS 236 11
18 pages
PDF Sampling: Markov Chain Monte Carlo: X N I I
No ratings yet
PDF Sampling: Markov Chain Monte Carlo: X N I I
13 pages
Montepython: Implementing Quantum Monte Carlo Using Python
No ratings yet
Montepython: Implementing Quantum Monte Carlo Using Python
17 pages
Chib UnderstandingMetropolisHastingsAlgorithm 1995
No ratings yet
Chib UnderstandingMetropolisHastingsAlgorithm 1995
10 pages
Hidden Markov
No ratings yet
Hidden Markov
42 pages
Four Lectures On Computational Statistical Physics: February 2009
No ratings yet
Four Lectures On Computational Statistical Physics: February 2009
38 pages
Markov Chain Monte Carlo For Computing Rare Event Probabilities For A Heavy Tailed Random Walk
No ratings yet
Markov Chain Monte Carlo For Computing Rare Event Probabilities For A Heavy Tailed Random Walk
18 pages
03 Markov Chain Monte Carlo
No ratings yet
03 Markov Chain Monte Carlo
4 pages
Maxent 00 B
No ratings yet
Maxent 00 B
25 pages
Linear Momentum in Physics Form 4
100% (1)
Linear Momentum in Physics Form 4
41 pages
Appeal Judgment: Tay Kok Wah Case
No ratings yet
Appeal Judgment: Tay Kok Wah Case
16 pages
Chapter 01
No ratings yet
Chapter 01
26 pages
ProblemSet9 Solutions
No ratings yet
ProblemSet9 Solutions
8 pages
Bartlett 1932 Memory War of The Ghosts
No ratings yet
Bartlett 1932 Memory War of The Ghosts
6 pages
Propp Function's Types:: 1-Absentation: One of The Members of A Family Absents Himself From Home
No ratings yet
Propp Function's Types:: 1-Absentation: One of The Members of A Family Absents Himself From Home
10 pages
Using A Multifaceted Approach To Working With Children Who Have Differences in Sensory Processing and Integration, 2017
No ratings yet
Using A Multifaceted Approach To Working With Children Who Have Differences in Sensory Processing and Integration, 2017
11 pages
Creating Multi User Role Based Admin Using PHP Mysql and Bootstrap
No ratings yet
Creating Multi User Role Based Admin Using PHP Mysql and Bootstrap
8 pages
Consumer Attitudes: Motorola in Bangladesh
No ratings yet
Consumer Attitudes: Motorola in Bangladesh
28 pages
Sulzer ZA40S Engine Training Course
No ratings yet
Sulzer ZA40S Engine Training Course
4 pages
Peter Deunov - Methods For Self Improvement
100% (2)
Peter Deunov - Methods For Self Improvement
49 pages
CUSTOMER ANALYSIS - Report
No ratings yet
CUSTOMER ANALYSIS - Report
10 pages
ANSYS CFX, Release 10.0
No ratings yet
ANSYS CFX, Release 10.0
62 pages
Lesson 7.2 - Determining Textual Evidence
No ratings yet
Lesson 7.2 - Determining Textual Evidence
14 pages
Front Cover Dac 21903
No ratings yet
Front Cover Dac 21903
3 pages
Fundamental Food Microbiology Fourth Edition Ray Instant Access 2025
No ratings yet
Fundamental Food Microbiology Fourth Edition Ray Instant Access 2025
59 pages
West Bengal Environment Report
No ratings yet
West Bengal Environment Report
404 pages
Introduction To Marketing Chapter-1: Meaning of Market
No ratings yet
Introduction To Marketing Chapter-1: Meaning of Market
6 pages
Coaching for Hotel Staff Development
No ratings yet
Coaching for Hotel Staff Development
52 pages
Translation Techniques Explained
No ratings yet
Translation Techniques Explained
8 pages
TO THE APPLICANT: This Form Should Be Given To Professors Who Are Able To Evaluate You For Graduate Study
No ratings yet
TO THE APPLICANT: This Form Should Be Given To Professors Who Are Able To Evaluate You For Graduate Study
2 pages
Crossword
No ratings yet
Crossword
1 page
Parts of A Microscope and Their Functions
100% (1)
Parts of A Microscope and Their Functions
3 pages
Vampire The Masquerade - Paris by Night
No ratings yet
Vampire The Masquerade - Paris by Night
38 pages
Og March Price With Mop List 2025
No ratings yet
Og March Price With Mop List 2025
2 pages
D. A. Knowles
No ratings yet
D. A. Knowles
7 pages
Présentation
No ratings yet
Présentation
58 pages
Evaluating A Case Study Using Bloom's Taxonomy of Education: ASE Tudy
No ratings yet
Evaluating A Case Study Using Bloom's Taxonomy of Education: ASE Tudy
10 pages
Key Đề Cương Học Kì 2 Thpt Phan Châu Trinh
No ratings yet
Key Đề Cương Học Kì 2 Thpt Phan Châu Trinh
21 pages
collegeListP2 PDF
No ratings yet
collegeListP2 PDF
71 pages