Bayesian Optimization
Bayesian Optimization
OPTIMIZATION
ROMAN GARNETT
Bayesian Optimization
Bayesian optimization is a methodology for optimizing expensive objective functions that has
proven success in the sciences, engineering, and beyond. This timely text provides a self-contained
and comprehensive introduction to the subject, starting from scratch and carefully developing all
the key ideas along the way. This bottom-up approach illuminates unifying themes in the design of
Bayesian optimization algorithms and builds a solid theoretical foundation for approaching novel
situations.
The core of the book is divided into three main parts, covering theoretical and practical aspects
of Gaussian process modeling, the Bayesian approach to sequential decision making, and the real-
ization and computation of practical and effective optimization policies.
Following this foundational material, the book provides an overview of theoretical convergence
results, a survey of notable extensions, a comprehensive history of Bayesian optimization, and an
extensive annotated bibliography of applications.
B AY E S I A N O P T I M I Z AT I O N
Shaftesbury Road, Cambridge CB2 8EA, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314β321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi β 110025, India
103 Penang Road, #05β06/07, Visioncrest Commercial, Singapore 238467
[Link]
Information on this title: [Link]/9781108425780
DOI: 10.1017/9781108348973
Β© Roman Garnett 2023
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
First published 2023
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-42578-0 Hardback
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will
remain, accurate or appropriate.
CON TEN TS
preface ix
notation xiii
1 introduction 1
1.1 Formalization of Optimization 2
1.2 The Bayesian Approach 5
2 gaussian processes 15
2.1 Definition and Basic Properties 16
2.2 Inference with Exact and Noisy Observations 18
2.3 Overview of Remainder of Chapter 26
2.4 Joint Gaussian Processes 26
2.5 Continuity 28
2.6 Differentiability 30
2.7 Existence and Uniqueness of Global Maxima 33
2.8 Inference with Non-Gaussian Observations and Constraints 35
2.9 Summary of Major Ideas 41
This material has been published by Cambridge University Press as Bayesian Optimization. v
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
vi contents
9 implementation 201
9.1 Gaussian Process Inference, Scaling, and Approximation 201
9.2 Optimizing Acquisition Functions 207
9.3 Starting and Stopping Optimization 210
9.4 Summary of Major Ideas 212
c gradients 307
references 331
index 353
PREFACE
Can I ask a dumb question about gps? Letβs say that Iβm doing
function approximation on an interval with a gp. So Iβve got this
mean function π(π₯) and a variance function
π£ (π₯). Is it true
that if
I pick a particular point π₯, then π π (π₯) βΌ N π(π₯), π£ (π₯) ? Please
say yes.
If this is true, then I think the idea of doing Bayesian optimization
using gps is, dare I say, trivial.
This material has been published by Cambridge University Press as Bayesian Optimization. ix
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
x preface
The nuances of some applications require modifications to the basic Chapter 11: extensions
sequential optimization scheme that is the focus of the bulk of the book,
and Chapter 11 introduces several notable extensions to this basic setup.
Each is systematically presented through the unifying lens of Bayesian
decision theory to illustrate how one might proceed when facing a novel
situation.
Finally, Chapter 12 provides a brief and standalone history of Bayesian Chapter 12: brief history of Bayesian
optimization. This was perhaps the most fun chapter for me to write, optimization
if only because it forced me to plod through old Soviet literature (in an
actual library! What a novelty these days!). To my surprise I was able to
antedate many Bayesian optimization policies beyond their commonly
attested origin, including expected improvement, knowledge gradient,
probability of improvement, and upper confidence bound. (A reader
familiar with the literature may be surprised to learn the last of these
was actually the first policy discussed by kushner in his 1962 paper.)
Despite my best efforts, there may still be stones left to be overturned
before the complete history is revealed.
Dependencies between the main chapters are illustrated in the mar-
gin. There are two natural linearizations of the material. The first is the 2 3 4
one I adopted and personally prefer, which covers modeling prior to
decision making. However, one could also proceed in the other order,
7 6 5
reading Chapters 5β7 first, then looping back to Chapter 2. After cov-
ering the material in these chapters (in either order), the remainder of
the book can be perused at will. Logical partial paths through the book
include:
8
Roman Garnett
St. Louis, Missouri, November 2022
NOTATION
All vectors are column vectors and are denoted in lowercase bold: x β βπ. vectors and matrices
Matrices are denoted in uppercase bold: A.
We adopt the βnumerator layoutβ convention for matrix calculus: matrix calculus convention
the derivative of a vector by a scalar is a (column) vector, whereas the
derivative of a scalar by a vector is a row vector. This results in the chain
rule proceeding from left-to-right; for example, if a vector x(π ) depends chain rule
on a scalar parameter π , then for a function π (x), we have:
ππ ππ πx
= .
ππ πx ππ
When an indicator function is required, we use the Iverson bracket indicator functions
notation. For a statement π , we have:
(
1 if π is true;
[π ] =
0 otherwise.
This material has been published by Cambridge University Press as Bayesian Optimization. xiii
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
xiv notation
symbol description
β‘ identical equality of functions; for a constant π, π β‘ π is a constant function
β gradient operator
β
termination option: the action of immediately terminating optimization
βΊ either Pareto dominance or the LΓΆwner order: for symmetric A, B, A βΊ B if and only if B β A
is positive definite
is sampled according to: π is a realization of a random variable with probability density π (π)
Γ Γ Γ
π βΌ π (π)
π Xπ disjoint union of {Xπ }: π Xπ = π (π₯, π) | π₯ β Xπ
|A| determinant of square matrix A
|x| Euclidean norm of vector x; |x β y| is thus the Euclidean distance between vectors x and y
β₯ π β₯ HπΎ norm of function π in reproducing kernel Hilbert space HπΎ
Aβ1 inverse of square matrix A
xβ€ transpose of vector x
0 vector or matrix of zeros
A action space for a decision
πΌ (π₯; D) acquisition function evaluating π₯ given data D
πΌπ (π₯; D) expected marginal gain in π’ (D) after observing at π₯ then making π β 1 additional optimal
observations given the outcome
πΌπβ (D) value of D with horizon π: expected marginal gain in π’ (D) from π additional optimal obser-
vations
πΌ ei expected improvement
πΌπ β mutual information between π¦ and π β
πΌ kg knowledge gradient
πΌ pi probability of improvement
πΌπ₯ β mutual information between π¦ and π₯ β
πΌ ucb upper confidence bound
πΌ ts Thompson sampling βacquisition function:β a draw π βΌ π (π | D)
π½ confidence parameter in Gaussian process upper confidence bound policy
π½ (x; D) batch acquisition function evaluating x given data D; may have modifiers analogous to πΌ
C prior covariance matrix of observed values y: C = cov[y]
π (D) cost of acquiring data D
chol A Cholesky decomposition of positive definite matrix A: if π² = chol A, then A = π²π²β€
corr[π,π ] correlation of random variables π and π ; with a single argument, corr[π] = corr[π, π]
cov[π,π ] covariance of random variables π and π ; with a single argument, cov[π] = cov[π, π]
D set of observed data, D = (x, y)
D,β² D1 set of observed data after observing at π₯: D β² = D βͺ (π₯, π¦) = (x,β² yβ²)
Dπ set of observed data after π observations
π· kl [π β₯ π] KullbackβLeibler divergence between distributions with probability densities π and π
Ξ(π₯, π¦) marginal gain in utility after acquiring observation (π₯, π¦): Ξ(π₯, π¦) = π’ (D β²) β π’ (D)
πΏ (π β π) Dirac delta distribution on π with point mass at π
diag x diagonal matrix with diagonal x
πΌ, πΌπ expectation, expectation with respect to π
π measurement error associated with an observation at π₯: π = π¦ β π
π objective function; π : X β β
π |Y the restriction of π onto the subdomain Y β X
πβ globally maximal value of the objective function: π β = max π
πΎπ information capacity of an observation process given π iterations
notation xv
symbol description
GP (π ; π, πΎ) Gaussian process on π with mean function π and covariance function πΎ
HπΎ reproducing kernel Hilbert space associated with kernel πΎ
HπΎ [π΅] ball of radius π΅ in HπΎ : {π | β₯π β₯ HπΎ β€ π΅}
π» [π] discrete or differential entropy of random variable π
π» [π | D] discrete or differential of random variable π after conditioning on D
πΌ (π;π ) mutual information between random variables π and π
πΌ (π;π | D) mutual information between random variables π and π after conditioning on D
I identity matrix
πΎ prior covariance function: πΎ = cov[π ]
πΎD posterior covariance function given data D: πΎD = cov[π | D]
πΎm MatΓ©rn covariance function
πΎse squared exponential covariance function
π
cross-covariance between π and observed values y: π
(π₯) = cov[y, π | π₯]
β either a length-scale parameter or the lookahead horizon
π output-scale parameter
M space of models indexed by the hyperparameter vector π½
m prior expected value of observed values y, m = πΌ[y]
π either the prior mean function, π = πΌ[π ], or the predictive mean of π: π = πΌ[π | π₯, D] = πD (π₯)
πD posterior mean function given data D: πD = πΌ[π | D]
N (π; π, πΊ) multivariate normal distribution on π with mean vector π and covariance matrix πΊ
N measurement error covariance corresponding to observed values y
O is asymptotically bounded above by: for nonnegative functions π, π of π, π = O(π) if π /π is
asymptotically bounded by a constant as π β β
Oβ as above with logarithmic factors suppressed: π = Oβ (π) if π (π) (log π)π = O(π) for some π
Ξ© is asymptotically bounded below by: π = Ξ©(π) if π = O(π )
π probability density
π either an approximation to probability density π or a quantile
β«π§ function
Ξ¦(π§) standard normal cumulative density function: Ξ¦(π§) = ββ π (π§ ) dπ§ β²
β²
symbol description
π either decision horizon (in the context of decision making) or number of optimization itera-
tions passed (in the context of asymptotic analysis)
Ξ is asymptotically bounded above and below by: π = Ξ(π) if π = O(π) and π = Ξ©(π)
π½ vector of hyperparameters indexing a model space M
tr A trace of square matrix A
π’ (D) utility of data D
var[π] variance of random variable π
π₯ putative input location of the objective function
x either a sequence of observed locations x = {π₯π } or (when the distinction is important) a
vector-valued input location
π₯β a location attaining the globally maximal value of π : π₯ β β arg max π ; π (π₯ β ) = π β
X domain of objective function
π¦ value resulting from an observation at π₯
y observed values resulting from observations at locations x
π§ π§-score of measurement π¦ at π₯: π§ = (π¦ β π)/π
1
IN TRODUCTION
This material has been published by Cambridge University Press as Bayesian Optimization. 1
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
2 introduction
Optimization policy
Directly solving for the location of global optima is infeasible except in
exceptional circumstances. The tools of traditional calculus are virtually
powerless in this setting; for example, enumerating and classifying every
stationary point in the domain would be tedious at best and perhaps
even impossible. Mathematical optimization instead takes an indirect
approach: we design a sequence of experiments to probe the objective
function for information that, we hope, will reveal the solution to (1.1).
The iterative procedure in Algorithm 1.1 formalizes this process. We
begin with an initial (possibly empty) dataset D that we grow incremen-
tally through a sequence of observations of our design. In each iteration,
an optimization policy inspects the available data and selects a point
π₯ β X where we make our next observation.5 This action in turn reveals 5 Here βpolicyβ has the same meaning as in
a corresponding value π¦ provided by the system under study. We append other decision-making contexts: it maps our
state (indexed by our data, D) to an action (the
the newly observed information to our dataset and finally decide whether location of our next observation, π₯).
to continue with another observation or terminate and return the current
data. When we inevitably do choose to terminate, the returned data can
be used by an external consumer as desired, for example to inform a
subsequent decision. terminal recommendations: Β§ 5.1, p. 90
We place no restrictions on how an optimization policy is imple-
mented beyond mapping an arbitrary dataset to some point in the do-
main for evaluation. A policy may be deterministic or stochastic, as
demonstrated respectively by the prototypical examples of grid search
and random search. In fact, these popular policies are nonadaptive and
completely ignore the observed data. However, when observations only
come at significant cost, we will naturally prefer policies that adapt their
behavior in light of evolving information. The primary challenge in opti-
4 introduction
Observation model
For optimization to be feasible, the observations we obtain must provide
information about the objective function that can guide our search and in
aggregate determine the solution to (1.1). A near-universal assumption in
mathematical optimization is that observations yield exact evaluations of
the objective function at our chosen locations. However, this assumption
is unduly restrictive: many settings feature inexact measurements due
to noisy sensors, imperfect simulation, or statistical approximation. A
typical example featuring additive observation noise is shown in the
Inexact observations of an objective function margin. Although the objective function is not observed directly, the
corrupted by additive noise. noisy measurements nonetheless constrain the plausible options due to
strong dependence on the objective.
We thus relax the assumption of exact observation and instead as-
sume that observations are realized by a stochastic mechanism depending
measured value, π¦ on the objective function. Namely, we assume that the value π¦ resulting
observation location, π₯ from an observation at some point π₯ is distributed according to an ob-
servation model depending on the underlying objective function value
objective function value, π = π (π₯) π = π (π₯):
π (π¦ | π₯, π). (1.2)
Through judicious design of the observation model, we may consider a
wide range of observation mechanisms.
conditional independence of observations As with the optimization policy, we do not make any assumptions
given objective values about the nature of the observation model, save one. Unless otherwise
mentioned, we assume that a set of multiple measurements y are condi-
tionally independent given the corresponding observation locations x
and objective function values π = π (x):
Γ
π (y | x, π) = π (π¦π | π₯π , ππ ). (1.3)
π
π (π¦ | π₯, π)
This is not strictly necessary but is overwhelmingly common in practice
and will simplify our presentation considerably.
One particular observation model will enjoy most of our attention in
π¦ this book: additive Gaussian noise. Here we model the value π¦ observed
at π₯ as
π π¦ = π + π,
Additive Gaussian noise: the distribution where π represents measurement error. Errors are assumed to be Gaussian
of the value π¦ observed at π₯ is Gaussian, distributed with mean zero, implying a Gaussian observation model:
centered on the objective function value π.
π (π¦ | π₯, π, ππ ) = N (π¦; π, ππ2 ). (1.4)
observation noise scale, ππ Here the observation noise scale ππ may optionally depend on π₯, allowing
heteroskedastic noise: Β§ 2.2, p. 25 us to model both homoskedastic or heteroskedastic errors.
1.2. the bayesian approach 5
Termination
The final decision we make in each iteration of optimization is whether
to terminate immediately or continue with another observation. As with
the optimization policy, we do not assume any particular mechanism by
which this decision is made. Termination may be deterministic β such
as stopping after reaching a certain optimization goal or exhausting
a preallocated observation budget β or stochastic, and may optionally
depend on the observed data. In many cases, the time of termination
may in fact not be under the control of the optimization routine at all
but instead decided by an external agent. However, we will also consider
scenarios where the optimization procedure can dynamically choose optimal termination: Β§ 5.4, p. 103
when to return based upon inspection of the available data. practical termination: Β§ 9.3, p. 210
observation with some measure of faith that the outcome will ultimately
prove beneficial and justify the cost of obtaining it. The sequential nature
of optimization further compounds the weight of this uncertainty, as the
outcome of each observation not only has an immediate impact, but also
forms the basis on which all future decisions are made. Developing an
effective policy requires somehow addressing this uncertainty.
The Bayesian approach systematically relies on probability and Bayes-
ian inference to reason about the uncertain quantities arising during
optimization. This critically includes the objective function itself, which
is treated as a random variable to be inferred in light of our prior ex-
pectations and any available data. In Bayesian optimization, this belief
then takes an active role in decision making by guiding the optimiza-
tion policy, which may evaluate the merit of a proposed observation
location according to our belief about the value we might observe. We
introduce the key ideas of this process with examples below, starting
with a refresher on Bayesian inference.
Bayesian inference
To frame the following discussion, we offer a quick overview of Bayesian
6 The literature is vast. The following references inference as a reminder to the reader. This introduction is far from
are excellent, but no list can be complete: complete, but there are numerous excellent references available.6
d. j. c. mackay (2003). Information Theory, In- Bayesian inference is a framework for inferring uncertain features of
ference, and Learning Algorithms. Cambridge a system of interest from observations grounded in the laws of probability.
University Press. To illustrate the basic ideas, we may begin by identifying some unknown
a. oβhagan and j. forster (2004). Kendallβs feature of a given system that we wish to reason about. In the context of
Advanced Theory of Statistics. Vol. 2b: Bayes- optimization, this might represent, for example, the value of the objective
ian Inference. Arnold.
function at a given location, or the location π₯ β or value π β of the global
j. o. berger (1985). Statistical Decision Theory optimum (1.1). We will take the first of these as a running example:
and Bayesian Analysis. SpringerβVerlag. inferring about the value of an objective function at some arbitrary point
π₯, π = π (π₯). We will shortly extend this example to inference about the
entire objective function.
In the Bayesian approach to inference, all unknown quantities are
treated as random variables. This is a powerful convention as it allows us
to represent beliefs about these quantities with probability distributions
reflecting their plausible values. Inference then takes the form of an
inductive process where these beliefs are iteratively refined in light of
observed data by appealing to probabilistic identities.
As with any induction, we must start somewhere. Here we begin with
prior distribution, π (π | π₯) a so-called prior distribution (or simply prior) π (π | π₯), which encodes
what we consider to be plausible values for π before observing any
7 Here we assume the location of interest π₯ is data.7 The prior distribution allows us to inject our knowledge about and
known, hence our conditioning the prior on experience with the system of interest into the inferential process, saving
its value.
us from having to begin βfrom scratchβ or entertain patently absurd
possibilities. The left panel of Figure 1.1 illustrates a prior distribution
for our example, indicating support over a range of values.
Once a prior has been established, the next stage of inference is
to refine our initial beliefs in light of observed data. Suppose in our
1.2. the bayesian approach 7
π π π
π¦ π¦
Figure 1.1: Bayesian inference for an unknown function value π = π (π₯). Left: a prior distribution over π; middle: the
likelihood of the marked observation π¦ according to an additive Gaussian noise observation model (1.4) (prior
shown for reference); right: the posterior distribution in light of the observation and the prior (prior and
likelihood shown for reference).
π (π | π₯) π (π¦ | π₯, π)
π (π | π₯, π¦) = . (1.5)
π (π¦ | π₯)
The posterior is proportional to the prior weighted by the likelihood of
the observed value. The denominator is a constant with respect to π that
ensures normalization:
β«
π (π¦ | π₯) = π (π¦ | π₯, π) π (π | π₯) dπ . (1.6)
The right panel of Figure 1.1 shows the posterior resulting from the
measurement in the middle panel. The posterior represents a compromise
between our experience (encoded in the prior) and the information
contained in the data (encoded in the likelihood).
Throughout this book we will use the catchall notation D to repre- data informing posterior belief, D
sent all the information influencing a posterior belief; here the relevant
information is D = (π₯, π¦), and the posterior distribution is then π (π | D).
As mentioned previously, Bayesian inference is an inductive process
whereby we can continue to refine our beliefs through additional ob-
servation. At this point, the induction is trivial: to incorporate a new
8 introduction
observation, what was our posterior serves as the prior in the context
posterior predictive, π (π¦ β² | π₯, D) of the new information, and multiplying by the likelihood and renor-
malizing yields a new posterior. We may continue in this manner as
desired.
The posterior distribution is not usually the end result of Bayesian
inference but rather a springboard enabling follow-on tasks such as
prediction or decision making, both of which are integral to Bayesian
optimization. To address the former, suppose that after deriving the
posterior (1.5), we wish to predict the result of an independent, repeated
π¦β²
noisy observation at π₯, π¦ .β² Treating the outcome as a random variable,
π¦ we may derive its distribution by integrating our posterior belief about
π against the observation model (1.2):8
Posterior predictive distribution for a repeated β«
measurement at π₯ for our running example.
The location of our first measurement π¦ and
π (π¦ β² | π₯, D) = π (π¦ β² | π₯, π) π (π | π₯, D) dπ; (1.7)
the posterior distribution of π are shown for
reference. There is more uncertainty in π¦ β² this is known as the posterior predictive distribution for π¦ .β² By integrating
than π due to the effect of observation noise. over all possible values of π weighted by their plausibility, the posterior
predictive distribution naturally accounts for uncertainty in the unknown
8 This expression takes the same form as (1.6), objective function value; see the figure in the margin.
which is simply the (prior) predictive distribu- The Bayesian approach to decision making also relies on a posterior
tion evaluated at the actual observed value.
belief about unknown features affecting the outcomes of our decisions,
as we will discuss shortly.
Figure 1.2: An example prior process for an objective defined on an interval. We illustrate the marginal belief at every
location with its mean and a 95% credible interval and also show three example functions sampled from the
prior process.
We summarize the marginal belief of the model, for each point in the
domain showing the prior mean and a 95% credible interval for the cor-
responding function value. We also show three functions sampled from
the prior process, each exhibiting the assumed behavior. We encourage
the reader to become comfortable with this plotting convention, as we
will use it throughout this book. In particular we eschew axis labels, as plotting conventions
they are always the same: the horizontal axis represents the domain X
and the vertical axis the function value. Further, we do not mark units on
axes to stress relative rather than absolute behavior, as scale is arbitrary
in this illustration.
We can encode a vast array of information into the prior process
and can model significantly more complex structure than in this sim-
ple example. We will explore the world of possibilities in Chapter 3,
including interaction at different scales, nonstationarity, low intrinsic nonstationarity, warping: Β§ 3.4, p. 56
dimensionality, and more. low intrinsic dimensionality: Β§ 3.5, p. 61
With the prior process in hand, suppose we now make a set of
observations at some locations x, revealing corresponding values y; we
aggregate this information into a dataset D = (x, y). Bayesian inference observed data, D = (x, y)
accounts for these observations by forming the posterior process π (π | D). objective function posterior, π (π | D)
The derivation of the posterior process can be understood as a two-
stage process. First we consider the impact of the data on the correspond-
ing function values π alone (1.5):
9 The given expression sweeps some details un-
π (π | D) β π (π | x) π (y | x, π). (1.9) der the rug. A careful derivation of the pos-
terior process proceeds by finding the poste-
The quantities on the right-hand side are known: the first term is given rior of an arbitrary finite-dimensional vector
by the prior process (1.8), and the second by the observation model (1.3), π β = π (xβ ):
which serves the role of a likelihood. We now extend the posterior on π
to all of π :9 β«
π (πβ | xβ , D) =
β«
π (π | D) = π (π | x, π) π (π | D) dπ. (1.10) π (π β | xβ , x, π) π (π | D) dπ,
Figure 1.3: The posterior process for our example scenario in Figure 1.2 conditioned on three exact observations.
Figure 1.4: A prototypical acquisition function corresponding to our example posterior from Figure 1.3.
objective function
1
2
exploitation
3
4
5
6
7
8
9
10
11
12
exploitation
13
14
15
16
17
18
19
exploration
Figure 1.5: The posterior after the indicated number of steps of an example Bayesian optimization policy, starting from
the posterior in Figure 1.4. The marks show the points chosen by the policy, progressing from top to bottom.
Observations sufficiently close to the optimum are marked in bold; the optimum was located on iteration 7.
1.2. the bayesian approach 13
In this chapter we will introduce Gaussian processes (gps), a conve- 4 The termβnonparametricβ is something of a
misnomer. A nonparametric objective func-
nient class of nonparametric regression models widely used in Bayesian tion model has parameters but their dimen-
optimization. We will begin by defining Gaussian processes and deriving sion is infinite β we effectively parameterize
some basic properties, then demonstrate how to perform inference from the objective by its value at every point.
observations. In the case of exact observation and additive Gaussian
noise, we can perform this inference exactly, resulting in an updated
posterior Gaussian process. We will continue by considering some the-
oretical properties of Gaussian processes relevant to optimization and
inference with non-Gaussian observation models.
This material has been published by Cambridge University Press as Bayesian Optimization. 15
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
16 gaussian processes
π (π ) = GP (π ; π, πΎ)
Let us pause to consider the implications of this choice. First, note that
var[π | π₯] = πΎ (π₯, π₯) = 1 at every point π₯ β X, and thus the covariance πΎ (π₯, π₯ 0 )
function (2.4) also measures the correlation between the function values 1
π and π .β² This correlation decreases with the distance between π₯ and π₯ ,β²
falling from unity to zero as these points become increasingly separated;
see the illustration in the margin. We can loosely interpret this as a
statistical consequence of continuity: function values at nearby locations
|π₯ β π₯ 0 |
are highly correlated, whereas function values at distant locations are 0
effectively independent. This assumption also implies that observing 0 3
the function at some point π₯ provides nontrivial information about the The squared exponential covariance (2.4) as a
function at sufficiently nearby locations (roughly when |π₯ β π₯ β² | < 3). We function of the distance between inputs.
will explore this implication further shortly.
18 gaussian processes
Figure 2.1: Our example Gaussian process on the domain X = [0, 30]. We illustrate the marginal belief at every location
with its mean and a 95% credible interval and also show three example functions sampled from the process.
predictive credible intervals For a Gaussian process, the marginal distribution of any single func-
tion value is univariate normal (2.2):
Sampling
We may gain more insight by inspecting samples drawn from our exam-
ple process reflecting the joint distribution of function values. Although
it is impossible to represent an arbitrary function on X in finite memory,
we can approximate the sampling process by taking a dense grid x β X
sampling from a multivariate normal and sampling the corresponding function values from their joint multi-
distribution: Β§ a.2, p. 299 variate normal distribution (2.2). Plotting the sampled vectors against
the chosen grid reveals curves approximating draws from the Gaussian
process. Figure 2.1 illustrates this procedure for our example using a grid
of 1000 equally spaced points. Each sample is smooth and has several
local optima distributed throughout the domain β for some applications,
this might be a reasonable model for an objective function on X.
Figure 2.2: The posterior for our example scenario in Figure 2.1 conditioned on three exact observations.
!
π π πΎ π
β€
π (π, y) = GP ; , . (2.6)
y m π
C
This notation, analogous to (a.12), extends the Gaussian process on π to 12 We assume C is positive definite; if it were only
include the entries of y; that is, we assume the distribution of any finite positive semidefinite, there would be wasteful
linear dependence among observations.
subset of function and/or observed values is multivariate normal. We
specify the joint distribution via the marginal distribution of y:12 observation mean and covariance, m, C
and the cross-covariance function between y and π : cross-covariance between observations and
function values, π
π
(π₯) = cov[y, π | π₯]. (2.8)
20 gaussian processes
Although it may seem absurd that we could identify and observe a vector
satisfying such strong restrictions on its distribution, we can already
deduce several examples from first principles, including:
inference from exact observations: Β§ 2.2, p. 22 β’ any vector of function values (2.2),
affine transformations: Β§ a.2, p. 298 β’ any affine transformation of function values (a.10), and
derivatives and expectations: Β§ 2.6, p. 30 β’ limits of such quantities, such as partial derivatives or expectations.
Further, we may condition on any of the above even if corrupted by
independent additive Gaussian noise, as we will shortly demonstrate.
conditioning a multivariate normal We may condition the joint distribution (2.6) on y analogously to the
distribution: Β§ a.2, p. 299 finite-dimensional case (a.14), resulting in a Gaussian process posterior
on π. Writing D = y for the observed data, we have:
π (π | D) = GP (π ; πD , πΎD ), (2.9)
posterior mean and covariance, πD , πΎD where
πD (π₯) = π (π₯) + π
(π₯)β€ Cβ1 (y β m);
(2.10)
πΎD (π₯, π₯ β²) = πΎ (π₯, π₯ β²) β π
(π₯)β€ Cβ1 π
(π₯ β²).
13 This is a useful exercise! The result will be a This can be verified by computing the joint distribution of an arbitrary
stochastic process with multivariate normal finite set of function values and y and conditioning on the latter (a.14).13
finite-dimensional distributions, a Gaussian
process by definition (2.5).
The above result provides a simple procedure for gp posterior infer-
ence from any vector of observations satisfying (2.6):
1. compute the marginal distribution of y (2.7),
2. derive the cross-covariance function π
(2.8), and
3. find the posterior distribution of π via (2.9β2.10).
We will realize this procedure for several special cases below. However,
we will first demonstrate how we may seamlessly handle measurements
corrupted by additive Gaussian noise and build intuition for the posterior
distribution by dissecting its moments in terms of the statistics of the
observations and the correlation structure of the prior.
β
where S is diagonal with πππ = πΆππ = std[π¦π ] and P = corr[y] is the
observation correlation matrix. We may then rewrite the posterior mean
of π as
π + ππβ€ Pβ1 z,
where z and π represent the vectors of measurement π§-scores and the
cross-correlation between π and y, respectively:
π¦π β ππ [π
(π₯)] π
π§π = ; ππ = .
π π ππ π
15 It can be instructive to contrast the behavior The posterior mean is now in the same form as the scalar case (2.12),
of the posterior when conditioning on two with the introduction of the observation correlation matrix moderating
highly correlated values versus two indepen-
dent ones. In the former case, the posterior
the π§-scores to account for dependence between the observed values.15
does not change much as a result of the sec- The posterior standard deviation of π in the vector-valued case is
ond measurement, as dependence reduces the βοΈ
effective number of measurements.
π 1 β πβ€ Pβ1π,
π (π | x) = N (π; π, πΊ),
2.2. inference with exact and noisy observations 23
π (π | D) = GP (π ; π D , πΎD ),
where
πD (π₯) = π (π₯) + πΎ (π₯, x)πΊβ1 (π β π);
(2.14)
πΎD (π₯, π₯ β²) = πΎ (π₯, π₯ β²) β πΎ (π₯, x)πΊβ1 πΎ (x, π₯ β²).
Our previous Figure 2.2 illustrates the posterior resulting from con- example and discussion
ditioning our gp prior in Figure 2.1 on three exact measurements, with
high-level analysis of its behavior in the accompanying text.
N = ππ2 I, (2.16)
and independent heteroskedastic noise with scale depending on location special case: independent heteroskedastic
according to a function ππ : X β β β₯0 : noise
For a given observation location π₯, we will simply write ππ for the observation noise scale, ππ
associated noise scale, leaving any dependence on π₯ implicit.
24 gaussian processes
Figure 2.3: Posteriors for our example gp from Figure 2.1 conditioned on 15 noisy observations with independent ho-
moskedastic noise (2.16). The signal-to-noise ratio is 10 for the top example, 3 for the middle example, and 1 for
the bottom example.
homoskedastic example and discussion Figure 2.3 shows a sequence of posterior distributions resulting from
conditioning our example gp on data corrupted by increasing levels of
homoskedastic noise (2.16). As the noise level increases, the observations
have diminishing influence on our belief, with some extreme values
eventually being partially explained away as outliers. As measurements
are assumed to be inexact, the posterior mean is not compelled to inter-
polate perfectly through the observations, as in the exact case (Figure
2.2. inference with exact and noisy observations 25
Figure 2.4: The posterior distribution for our example gp from Figure 2.1 conditioned on 50 observations with heteroskedas-
tic observation noise (2.17). We show predictive credible intervals for both the latent objective function and
noisy observations; the standard deviation of the observation noise increases linearly from left-to-right.
2.2). Further, with increasing levels of noise, our posterior belief reflects
significant residual uncertainty in the function, even in regions with
multiple nearby observations.
We illustrate an example of Gaussian process inference with het- heteroskedastic example and discussion
eroskedastic noise (2.17) in Figure 2.4, where the signal-to-noise ratio
decreases smoothly from left-to-right over the domain. Although the
observations provide relatively even coverage, our posterior uncertainty
is minimal on the left-hand side of the domain β where the measure-
ments provide maximal information β and increases as our observations
become more noisy and less informative.
We will often require the posterior predictive distribution for a noisy posterior predictive distribution for noisy
measurement π¦ that would result from observing at a given location π₯. observations
The posterior distribution on π (2.19) provides the posterior predictive
distribution for the latent function value π = π (π₯) (2.5):
but does not account for the effect of observation noise. In the case of
independent additive Gaussian noise (2.16β2.17), deriving the posterior
predictive distribution is trivial; we have (a.15):
covariance function for observation noise, π where π is the noise covariance: π (π₯, π₯ β²) = cov[π, π β² | π₯, π₯ β²]. The poste-
rior of the observation process is then a gp with
Definition
To elaborate, consider a set of functions {ππ : Xπ β β} we wish to
model.21 We define the disjoint union of these functions βπ β defined on
Γ
disjoint union of {ππ }, βπ
the disjoint union22 of their domains X = Xπ β by insisting its restric- disjoint union of {Xπ }, X
tion to each domain be compatible with the corresponding function:
We will call this construction a joint Gaussian process on {ππ }. joint Gaussian process
It is often convenient to decompose the moments of a joint gp into
their restrictions on relevant subspaces. For example, consider a joint gp
(2.22) on π : F β β and π : G β β. After defining
π π β‘ π| F ; ππ β‘ π| G ;
πΎ π β‘ πΎ | F ΓF ; πΎπ β‘ πΎ | GΓG ; πΎ ππ β‘ πΎ | F ΓG ; πΎππ β‘ πΎ | GΓF ,
we can see that π and π in fact have marginal gp distributions:23 23 In fact, any restriction of a gp-distributed func-
tion has a gp (or multivariate normal) distri-
π (π ) = GP (π ; π π , πΎ π ); π (π) = GP (π; ππ , πΎπ ), (2.23) bution.
When convenient we will notate a joint gp in terms of these decomposed 24 We also used this notation in (2.6), where the
functions, here writing:24 βdomainβ of the vector y can be taken to be
some finite index set of appropriate size.
!
π ππ πΎπ πΎ ππ
π (π, π) = GP ; , . (2.25)
π ππ πΎππ πΎπ
Example
We can demonstrate the behavior of a joint Gaussian process by extend-
ing our running example gp on π : [0, 30] β β. Recall the prior on π has
28 gaussian processes
Figure 2.5: A joint Gaussian process over two functions on the shared domain X = [0, 30]. The marginal belief over both
functions is the same as our example gp from Figure 2.1, but the cross-covariance (2.26) between the functions
strongly couples their behavior. We also show a sample from the joint distribution illustrating the strong
correlation induced by the joint prior.
2.5 continuity
In this and the following sections we will establish some important
properties of Gaussian processes determined by the properties of their
2.5. continuity 29
π (π | D)
π (π | D)
Figure 2.6: The joint posterior for our example joint gp prior in Figure 2.5 conditioned on five exact observations of each
function.
2.6 differentiability
We can approach the question of differentiability by again reasoning
about the limiting behavior of linear transformations of function values.
Suppose π : X β β with X β βπ has distribution GP (π ; π, πΎ), and
consider the πth partial derivative of π at x, if it exists:
ππ π (x + βeπ ) β π (x)
(x) = lim ,
ππ₯π ββ0 β
where eπ is the πth standard basis vector. For β > 0, the value in the limit
is Gaussian distributed as a linear transformation of Gaussian-distributed
random variables (a.9). Assuming the corresponding partial derivative
of the mean exists at x and the corresponding partial derivative with
respect to each input of the covariance function exists at x = x,β² then as
sequences of normal rvs: Β§ a.2, p. 300 β β 0 the partial derivative converges in distribution to a Gaussian:
ππ ππ ππ π2πΎ
π (x) | x = N (x); (x), (x, x) .
ππ₯π ππ₯π ππ₯π ππ₯π ππ₯πβ²
differentiability in mean square If this property holds for each coordinate 1 β€ π β€ π, then π is said to be
differentiable in mean square at x.
joint gp between function and gradient If π is differentiable in mean square everywhere in the domain, the
process itself is called differentiable in mean square, and we have the
remarkable result that the function and its gradient have a joint Gaussian
process distribution:
!
π π πΎ πΎ ββ€
π (π, βπ ) = GP ; , . (2.28)
βπ βπ βπΎ βπΎ ββ€
2.6. differentiability 31
ππ
ππ₯
Figure 2.7: The joint posterior of the function and its derivative for our example Gaussian process from Figure 2.2. The
dashed line in the lower plot corresponds to a derivative of zero.
and πΎ ββ€: X Γ X β (βπ ) β maps pairs of points to row vectors: transpose of covariance between π (x) and
βπ (xβ² ), πΎ ββ€
πΎ ββ€(x, xβ²) = βπΎ (x,β² x) β€ .
Finally, the function βπΎ ββ€: X Γ X β βπΓπ represents the result of covariance between βπ (x) and βπ (xβ² ),
applying both operations, mapping a pair of points to the covariance βπΎ ββ€
matrix between the entries of the corresponding gradients:
ππ ππ β² π2πΎ
βπΎβ (x, x ) π π = cov
β€ β² β²
(x), β² (x ) x, x = (x, xβ²).
ππ₯π ππ₯π ππ₯π ππ₯πβ²
ππ
ππ₯
Figure 2.8: The joint posterior of the derivative of our example Gaussian process after adding a new observation nearby
another suggesting a large positive slope. The dashed line in the lower plot corresponds to a derivative of zero.
ππ
ππ₯
Figure 2.9: The joint posterior of the derivative of our example Gaussian process after adding an exact observation of the
derivative at the indicated location. The dashed line in the lower plot corresponds to a derivative of zero.
β«
π = π (π₯) π (π₯) dπ₯,
then (under mild conditions) we again have a joint Gaussian process 33 This can be shown, for example, by consider-
distribution over π and π .33 This enables both inference about π and ing the limiting distribution of Riemann sums.
conditioning on noisy observations of integrals, such as a Monte Carlo 34 a. oβhagan (1991). BayesβHermite Quadrature.
estimate of an expectation. The former is the basis for Bayesian quadra- Journal of Statistical Planning and Inference
29(3):245β260.
ture, an analog of Bayesian optimization bringing Bayesian experimental
design to bear on numerical integration.32, 34, 35 35 c. e. rasmussen and z. ghahramani (2002).
Bayesian Monte Carlo. neurips 2002.
mutual information and entropy search: Β§ 7.6, As π is unknown, these quantities are random variables. Many Bayesian
p. 135 optimization algorithms operate by reasoning about the distributions of
(and uncertainties in) these quantities induced by our belief on π.
There are two technical issues we must address. The first is whether
we can be certain that a globally optimal value π β exists when the objec-
tive function is random. If existence is not guaranteed, then its distribu-
tion is meaningless. The second issue is one of uniqueness: assuming the
objective does attain a maximal value, can we be certain the optimum
is unique? In general π₯ β is a set-valued random variable, and thus its
distribution might have support over arbitrary subsets of the domain,
rendering it complicated to reason about. However, if we could ensure
the uniqueness of π₯ β, its distribution would have support on X rather
than its power set, allowing more straightforward inference.
Both the existence of π β and uniqueness of π₯ β are tacitly assumed
throughout the Bayesian optimization literature when building algo-
rithms based on distributions of these quantities, but these properties
are not guaranteed for arbitrary Gaussian processes. However, we can
ensure these properties hold almost surely under mild conditions.
Counterexamples
Although the above conditions for ensuring existence of π β and unique-
ness of π₯ β are fairly mild, it is easy to construct counterexamples.
Consider a function on the closed unit interval, which we note is 41 It turns out this naΓ―ve model of white noise
compact: π : [0, 1] β β. We endow π with a βwhite noiseβ41 Gaussian has horrible mathematical properties, but it is
sufficient for this counterexample.
process with
π (π₯) β‘ 0; πΎ (π₯, π₯ β²) = [π₯ = π₯ β²].
Now π almost surely does not have a maximum. Roughly, because the 42 Let π = β β© [0, 1] = {ππ } be the rationals in
value of π at every point in the domain is independent of every other, the domain and let π β be a putative maximum.
Defining ππ = π (ππ ), we must have ππ β€ π β
there will almost always be a point with value exceeding any putative for every π; call this event π΄.
maximum.42 However, the conditions of sample path continuity were Define the event π΄π by π β exceeding the
violated as the covariance is discontinuous at π₯ = π₯ .β² first π elements of π. From independence,
We may also construct a Gaussian process that almost surely achieves Γ
π
a maximum that is not unique. Consider a random function π defined Pr(π΄π ) = Pr(ππ β€ π β ) = Ξ¦(π β ) π,
on the (compact) interval [0, 4π] defined by the parametric model π=1
observations objective mean, Gaussian noise 95% credible interval, Gaussian noise
Figure 2.10: Regression with observations corrupted with heavy-tailed noise. The triangular marks indicate observations
lying beyond the plotted range. Shown is the posterior distribution of an objective function (along with
ground truth) modeling the errors as Gaussian. The posterior is heavily affected by the outliers.
π (π¦ | π₯, π)
One obvious limitation is an incompatibility with naturally non-
Gaussian observations. A scenario particularly relevant to optimization
is heavy-tailed noise. Consider the data shown in Figure 2.10, where
some observations represent extreme outliers. These errors are poorly
modeled as Gaussian, and attempting to infer the underlying objective
π¦ function with the additive Gaussian noise model leads to overfitting
and poor predictive performance. A Student-π‘ error model with π β 4
π degrees of freedom provides a robust alternative:43
A Student-π‘ error model (solid) with a Gaus-
sian error model (dashed) for reference. The π (π¦ | π₯, π) = T (π¦; π, ππ2 , π). (2.31)
heavier tails of the Student-π‘ model can better
explain large outliers. The heavier tails of this model can better explain large outliers; un-
fortunately, the non-Gaussian nature of this model also renders exact
43 k. l. lange et al. (1989). Robust Statistical Mod- inference impossible. We will demonstrate how to overcome this impasse.
eling Using the π‘ Distribution. Journal of the Constraints on an objective function, such as bounds on given func-
American Statistical Association 84(408):881β
896. tion values, can also provide valuable information during optimization,
but many natural constraints cannot be reduced to observations that can
be handled in closed form. Several Bayesian optimization policies impose
hypothetical constraints on the objective function when designing each
observation, requiring inference from intractable constraints even when
the observations themselves pose no difficulties.
To see how constraints might arise in optimization, consider a Gaus-
sian process belief on a one-dimensional objective π, and suppose we
wish to condition on π on having a local maximum at a given loca-
differentiability, derivative observations: Β§ 2.6, tion π₯. Assuming the function is twice differentiable, we can invoke the
p. 30 second-derivative test to encode this information in two constraints:
true distribution
samples Figure 2.11: The probability density function of an example
distribution along with 50 samples drawn inde-
pendently from the distribution. In Monte Carlo
approaches, the distribution is effectively approxi-
mated by a mixture of Dirac delta distributions at
the sample locations.
The functions {π‘π } are called factors or local functions that may comprise factors, local functions, {π‘π }
a likelihood augmented by any desired (hard or soft) constraints. The
term βlocal functionsβ arises because each factor often depends only on 45 For example, when observations are condi-
a low-dimensional subspace of y, often a single entry.45 tionally independent given the corresponding
function values, the likelihood factorizes into
The posterior on y (2.33) in turn induces a posterior on π : a product of one-dimensional factors (1.3):
β« Γ
π (π | D) = π (π | y) π (y | D) dy. (2.34) π (y | x, π) = π (π¦π | π₯π , ππ ).
π
Figure 2.12: Regression with observations corrupted with heavy-tailed noise. The triangular marks indicate observations
lying beyond the plotted range. Shown is the posterior distribution of an objective function (along with
ground truth) modeling the errors as Student-π‘ distributed with π = 4 degrees of freedom. The posterior was
approximated from 100 000 Monte Carlo samples. Comparing with the additive Gaussian noise model from
Figure 2.10, this model effectively ignores the outliers and the fit is excellent.
1 βοΈ 1 βοΈ
π π
π (π | D) β π (π | yπ ) = GP (π ; πDπ , πΎD ). (2.35)
π π=1 π π=1
1 βοΈ
π
π (π | π₯, D) β N (π; ππ , π 2 ); ππ = πDπ (π₯); π 2 = πΎD (π₯, π₯).
π π=1
(2.36)
Although slightly more complex than the Gaussian marginals of a Gaus-
sian process, this is often convenient enough for most needs.
example: Student-π‘ observation model A Monte Carlo approximation to the posterior for the heavy-tailed
dataset from Figure 2.10 is shown in Figure 2.12. The observations were
modeled as corrupted by Student-π‘ errors with π = 4 degrees of freedom.
The posterior was approximated using a truly excessive number of sam-
ples (100 000, with a burn-in of 10 000) from the y posterior drawn using
elliptical slice sampling.47 The outliers in the data are ignored and the
predictive performance is excellent.
2.8. inference with non-gaussian observations and constraints 39
We are free to design this approximation as we see fit. There are several
general-purpose approaches available, distinguished by how they ap-
proach maximizing the fidelity of fitting the true posterior (2.33). These
include the Laplace approximation, Gaussian expectation propagation, Laplace approximation: Β§ b.1, p. 301
and variational Bayesian inference. The first two of these methods are Gaussian expectation propagation: Β§ b.2
covered in Appendix b, and nickisch and rasmussen provide an exten- p. 302
sive survey of these and other approaches in the context of Gaussian
process binary classification.48 48 h. nickisch and c. e. rasmussen (2008). App-
Regardless of the details of the approximation scheme, the high-level proximations for Binary Gaussian Process
Classification. Journal of Machine Learning Re-
result is the same β the normal approximation (2.37) in turn induces an search 9(Oct):2035β2078.
approximate Gaussian process posterior on π. To demonstrate this, we
consider the posterior on π that would arise from a direct observation of
y (2.9β2.10) and integrate against the approximate posterior (2.37):
β«
π (π | D) β π (π | y) π(y | D) dy = GP (π ; πD , πΎD ), (2.38)
where
πD (π₯) = π (π₯) + π
(π₯)β€ Cβ1 ( mΜ β m);
(2.39)
πΎD (π₯, π₯ β²) = πΎ (π₯, π₯ β²) β π
(π₯)β€ Cβ1 (C β CΜ) Cβ1 π
(π₯ β²).
CΜ = C β C (C + N) β1 C, (2.40)
To demonstrate the power of approximate inference, we return to example: conditioning on a local optimum
our motivating scenario of conditioning a one-dimensional process on
having a local maximum at an identified point π₯, which we can achieve by
40 gaussian processes
π₯ π₯ π₯
Figure 2.13: Approximately conditioning a Gaussian process to have a local maximum at the marked point π₯. We show
each stage of the conditioning process with a sample drawn from the corresponding posterior. We begin
with the unconstrained process (left), which we condition on the first derivative being zero at π₯ using exact
inference (middle). Finally we use Gaussian expectation propagation to approximately condition on the second
derivative being negative at π₯.
π (β) = N (β; π, π 2 ).
Λ π Λ2 ).
π (β | D) β π(β | D) = N (β; π,
0 0
Figure 2.14: A demonstration of Gaussian expectation propagation. On the left we have a Gaussian belief on the second
derivative, π (β). We wish to constrain this value to be negative, introducing a step-function factor encoding
the constraint, [β < 0]. The resulting distribution is non-Gaussian (right), but we can approximate it with a
Gaussian, which induces an updated gp posterior on the function approximately incorporating the constraint.
Going beyond this example, we may use the approach outlined above 50 h. nickisch and c. e. rasmussen (2008). App-
to realize a general framework for Bayesian nonlinear regression by proximations for Binary Gaussian Process
Classification. Journal of Machine Learning Re-
combining a gp prior on a latent function with an observation model search 9(Oct):2035β2078.
appropriate for the task at hand, then approximating the posterior as
51 j. mΓΈller et al. (1998). Log Gaussian Cox Pro-
desired. The convenience and modeling flexibility offered by Gaussian cesses. Scandinavian Journal of Statistics 25(3):
processes can easily justify any extra effort required for approximating 451β482.
the posterior. This can be seen as a nonlinear extension of the well-known 52 r. p. adams et al. (2009). Tractable Nonpara-
family of generalized linear models.49 metric Bayesian Inference in Poisson Pro-
This approach is quite popular and has been realized countless times. cesses with Gaussian Process Intensities. icml
Notable examples include binary classification using a logistic or probit 2009.
observation model,50 modeling point processes as a nonhomogeneous 53 m. kuss (2006). Gaussian Process Models for
Robust Regression, Classification, and Rein-
Poisson process with unknown intensity,51, 52 and robust regression with forcement Learning. Ph.D. thesis. Technische
heavy-tailed additive noise such as Laplace53 or Student-π‘54, 55 distributed UniversitΓ€t Darmstadt.[Β§ 5.4]
errors. With regard to the latter and our previous heavy-tailed noise 54 r. m. neal (1997). Monte Carlo Implementation
example, a Laplace approximation to the posterior for the data in Figures of Gaussian Process Models for Bayesian Re-
2.10β2.12 with the Student-π‘ observation model produces an approximate gression and Classification. Technical report
posterior in excellent agreement with the Monte Carlo approximation (9702). Department of Statistics, University of
Toronto.
in Figure 2.12; see Figure 2.15. The cost of approximate inference in this
55 p. jylΓ€nki et al. (2011). Robust Gaussian Pro-
case was dramatically (several orders of magnitude) cheaper than Monte cess Regression with a Student-π‘ Likelihood.
Carlo sampling. Journal of Machine Learning Research 12(99):
3227β3257.
56 diaconis identified an early application of gps
2.9 summary of major ideas by poincarΓ© for nonlinear regression:
Gaussian processes have been studied β in one form or another β for over p. diaconis (1988). Bayesian Numerical Anal-
100 years.56 Although we have covered a lot of ground in this chapter, we ysis. In: Statistical Decision Theory and Related
have only scratched the surface of an expansive body of literature. A good Topics iv.
entry point to that literature is rasmussen and williamsβs monograph, h. poincarΓ© (1912). Calcul des probabilitΓ©s.
GauthierβVillars.
which focuses on machine learning applications of Gaussian processes
but also covers their theoretical underpinnings and properties in depth.57 57 c. e. rasmussen and c. k. i. williams (2006).
Gaussian Processes for Machine Learning. mit
A good companion to this work is the book of adler and taylor, which Press.
takes a deep dive into the properties and geometry of sample paths, 58 r. j. adler and j. e. taylor (2007). Random
including statistical properties of their maxima.58 Fields and Geometry. SpringerβVerlag.
42 gaussian processes
This joint distribution allows us to condition a Gaussian process on derivative observations: Β§ 2.6, p. 32
(potentially noisy) derivative observations.
β’ The existence and uniqueness of global maxima for Gaussian process existence and uniqueness of global maxima:
sample paths can be guaranteed under mild assumptions on the mean Β§ 2.7, p. 33
and covariance functions. Establishing these properties ensures that
the location π₯ β and value π β of the global maximum are well-founded 59 In particular, policies grounded in information
random variables, which will be critical for some optimization methods theory under the umbrella of βentropy search.β
See Β§ 7.6, p. 135 for more.
introduced later in the book.59
β’ Inference from non-Gaussian observations and constraints is possible inference with non-Gaussian observations
via Monte Carlo sampling or Gaussian approximate inference. and constraints: Β§ 2.8, p. 35
The development of some system of a priori distributions suitable 1 j. mockus (1974). On Bayesian Methods for
for different classes of the function π is probably the most important Seeking the Extremum. Optimization Tech-
problem in the application of [the] Bayesian approach to. . . global
niques: ifip Technical Conference.
optimization.
This material has been published by Cambridge University Press as Bayesian Optimization. 45
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
46 modeling with gaussian processes
Figure 3.1: The importance of the prior mean function in determining sample path behavior. The models in the first two
panels differ in their mean function but share the same covariance function. Sample path behavior is identical
up to translation. The model in the third panel features the same mean function as the first panel but a different
covariance function. Samples exhibit dramatically different behavior.
π (π₯; π) β‘ π, (3.1)
π (π | π) = GP (π ; π β‘ π, πΎ),
where the uncertainty in the mean has been absorbed into the prior
covariance function. We may now use this prior directly, avoiding any
4 Noting that π and π form a joint Gaussian pro- estimation of π. The unknown mean will be automatically marginalized
cess, we may perform inference as described in both the prior and posterior process, and we may additionally derive
in Β§ 2.4, p. 26 to reveal their joint posterior.
the posterior belief over π given data if it is of interest.4
basis functions, π where the vector-valued function π : X β βπ defines the basis functions
weight vector, π· and π· is a vector of weights.5
Now consider a parametric Gaussian process prior with a mean
function of this form (3.4) and arbitrary covariance function πΎ. Placing
a multivariate normal prior on π·,
6 a. oβhagan (1978). Curve Fitting and Optimal and marginalizing yields the marginal prior,6, 7
Design for Prediction. Journal of the Royal Sta-
tistical Society Series B (Methodological) 40(1): π (π ) = GP (π ; π, πΆ),
1β42.
7 c. e. rasmussen and c. k. i. williams (2006). where
Gaussian Processes for Machine Learning. mit
Press. [Β§ 2.7] π(π₯) = aβ€π (π₯); πΆ (π₯, π₯ β²) = πΎ (π₯, π₯ β²) + π (π₯)β€ Bπ (π₯ β²). (3.6)
The covariance function determines fundamental properties of sample sample path continuity: Β§ 2.5, p. 28
path behavior, including continuity, differentiability, and aspects of the sample path differentiability: Β§ 2.6, p. 30
global optima, as we have already seen. Perhaps more so than the mean existence and uniqueness of global maxima:
function, careful design of the covariance function is critical to ensure Β§ 2.7, p. 33
fidelity in modeling. We will devote considerable discussion to this topic,
beginning with some important properties and moving on to useful
examples and mechanisms for systematically modifying and composing
multiple covariance functions together to model complex behavior.
After appropriate normalization, a covariance function πΎ may be
loosely interpreted as a measure of similarity between points in the
domain. Namely, given π₯, π₯ β² β X, the correlation between the corre-
sponding function values is correlation between function values, π
πΎ (π₯, π₯ β²)
π = corr[π, π β² | π₯, π₯ β²] = βοΈ , (3.9)
πΎ (π₯, π₯) πΎ (π₯ ,β² π₯ β²)
and we may interpret the strength of this dependence as a measure of
similarity between the input locations. This intuition can be useful, but
some caveats are in order. To begin, note that correlation may be negative,
which might be interpreted as indicating dis-similarity as the function
values react to information with opposite sign.
Further, for a proposed covariance function πΎ to be admissible, it
must satisfy two global consistency properties ensuring that the collec-
tion of random variables comprising π are able to satisfy the purported
relationships. First, we can immediately deduce from its definition (3.8)
that πΎ must be symmetric in its inputs. Second, the covariance function symmetry and positive semidefiniteness
must be positive semidefinite; that is, given any finite set of points x β X,
the Gram matrix πΎ (x, x) must have only nonnegative eigenvalues.10 10 Symmetry guarantees the eigenvalues are real.
To illustrate how positive semidefiniteness ensures statistical validity, consequences of positive semidefiniteness
note that a direct consequence is that πΎ (π₯, π₯) = var[π | π₯] β₯ 0, and thus
marginal variance is always nonnegative. On a slightly less trivial level,
consider a pair of points x = (π₯, π₯ β²) and normalize the corresponding
Gram matrix πΊ = πΎ (x, x) to yield the correlation matrix:
1 π
P = corr[π | x] = ,
π 1
(3.10)
The symmetry of the spectral measure implies a similar symmetry in symmetry of spectral density
the spectral density: π
(π ) = π
(βπ ) for all π β βπ.
bochnerβs theorem is surprisingly useful in practice, allowing us to
approximate an arbitrary stationary covariance function by approximat-
ing (e.g., by modeling or sampling from) its spectral density. This is the
basis of the spectral mixture covariance described in the next section, as
well as the sparse spectrum approximation scheme, which facilitates the sparse spectrum approximation: Β§ 8.7, p. 178
computation of some Bayesian optimization policies.
Samples from a centered Gaussian process We will refer to the MatΓ©rn and the limiting case of the squared expo-
with squared exponential covariance πΎse . nential covariance functions together as the MatΓ©rn family. The squared
exponential covariance is without a doubt the most prevalent covari-
ance function in the statistical and machine learning literature. However,
it may not always be a good choice in practice. Sample paths from a
centered Gaussian process with squared exponential covariance are in-
17 m. l. stein (1999). Interpolation of Spatial Data: finitely differentiable, which has been ridiculed as an absurd assumption
Some Theory for Kriging. SpringerβVerlag. for most physical processes.17 stein does not mince words on this, start-
[Β§ 1.7] ing off a three-sentence βsummary of practical suggestionsβ with βuse
the MatΓ©rn modelβ and devoting significant effort to discouraging the
use of the squared exponential in the context of geostatistics.
3.3. notable covariance functions 53
Between these extremes are the cases π = 3/2 and π = 5/2, which
respectively model once- and twice-differentiable functions:
β β
πΎM3/2 (π₯, π₯ β²) = 1 + 3π exp β 3π ; (3.13)
β 5 2
β
πΎM5/2 (π₯, π₯ ) = 1 + 5π + 3 π exp β 5π .
β²
(3.14)
Figure 3.4 illustrates samples from centered Gaussian processes with
different values of the smoothness parameters π. The π = 5/2 case in 18 j. snoek et al. (2012). Practical Bayesian Op-
particular has been singled out as a prudent off-the-shelf choice for timization of Machine Learning Algorithms.
Bayesian optimization when no better alternative is obvious.18
neurips 2012.
πΎ
In this context the parameter π is known as an output scale, or when
the base covariance is stationary with πΎ (π₯, π₯) = 1, the signal variance,
as it determines the variance of any function value: var[π | π₯, π] = π 2.
The illustration in the margin shows the effect of scaling the squared
exponential covariance function by a series of increasing output scales.
We can also of course consider nonlinear transformations of the
π
function output as well. This can be useful for modeling constraints β
such as nonegativity or boundedness β that are not compatible with
The squared exponential covariance πΎse the Gaussian assumption. However, a nonlinear transformation of a
scaled by a range of output scales π (3.20). Gaussian process is no longer Gaussian, so it is often more convenient
to model the transformed function after βremoving the constraint.β
We may use the general form of this scaling result (3.19) to transform
a stationary covariance into a nonstationary one, as any nonconstant
scaling is sufficient to break translation invariance. We show an example
of such a transformation in Figure 3.5, where we have scaled a stationary
covariance by a bump function to create a prior on smooth functions
with compact support.
Taking this one step further, we may consider dilating each axis by a
separate factor:
π A = |Ax β Axβ² |.
58 modeling with gaussian processes
Nonlinear warping
When using a covariance function with an inherent length scale, such as
a MatΓ©rn or squared exponential covariance, some linear transformation
of the domain is almost always considered, whether it be simple dilation
23 We did see a periodic gp in the previous chap- (3.22), anisotropic scaling (3.25), or a general transformation (3.26). How-
ter (2.30); however, that model only had sup- ever, nonlinear transformations can also be useful for imposing structure
port on perfectly sinusoidal functions.
on the domain, a process commonly referred to as warping.
24 d. j. c. mackay (1998). Introduction to Gaussian To provide an example that may not often be useful in optimization
Processes. In: Neural Networks and Machine
Learning. [Β§ 5.2]
but is illustrative nonetheless, suppose we wish to model a function
π : β β β that we believe to be smooth and periodic with period π.
25 The covariance on the circle is usually
inherited from a covariance on β2. The result
None of the covariance functions introduced thus far would be able to
of composing with the squared exponential induce the periodic correlations that this assumption would entail.23 A
covariance in particular is often called βtheβ construction due to mackay is to compose a map onto a circle of radius
periodic covariance, but we stress that any
other covariance on β2 could be used instead.
π = π/(2π):24
π cos π₯
π₯ β¦β (3.27)
π sin π₯
with a covariance function on that space reflecting any desired properties
of π.25 As this map identifies points separated by any multiple of the
period, the corresponding function values are perfectly correlated, as
desired. A sample from a Gaussian process employing this construction
with a MatΓ©rn covariance after warping is shown in the margin.
A compelling use of warping is to build nonstationary models by
A sample path of a centered gp with MatΓ©rn
covariance with π = 5/2 (3.14) after applying
composing a nonlinear map with a stationary covariance, an idea snoek
the periodic warping function (3.27). et al. explored in the context of Bayesian optimization.26 Many objec-
tive functions exhibit different behavior depending on the proximity to
26 j. snoek et al. (2014). Input Warping for Bayes- the optimum, suggesting that nonstationary models may sometimes be
ian Optimization of Non-Stationary Functions. worth exploring. snoek et al. proposed a flexible family of warping func-
icml 2014. tions for optimization problems with box-bounded constraints, where
3.4. modifying and combining covariance functions 59
we may take the domain to be the unit cube by scaling and translating as
necessary: X = [0, 1] π. The idea is to warp each coordinate of the input
via the cumulative distribution function of a beta distribution: 1
π₯π β¦β πΌ (π₯π ; πΌπ , π½π ), (3.28)
where (πΌπ , π½π ) are shape parameters and πΌ is the regularized beta func-
tion. This represents a monotonic bijection on the unit interval that can
assume several shapes; see the marginal figure for examples. The map
may contract portions of the domain and expand others, effectively de- 0
creasing and increasing the length scale in those regions. Finally, taking 0 1
πΌ = π½ = 1 recovers the identity map, allowing us to degrade gracefully Some examples of beta cdf warping functions
to the unwarped case if desired. (3.28).
In Figure 3.7 we combine a beta warping on a one-dimensional do-
main with a stationary covariance on the output. The chosen warping
shortens the length scale near the center of the domain and extends it
near the boundary, which might be reasonable for an objective expected
to exhibit the most βinterestingβ behavior on the interior of its domain.
A recent innovation is to use sophisticated artificial neural networks
as warping maps for modeling functions of high-dimensional data with
complex structure. Notable examples of this approach include the fami-
lies of manifold Gaussian processes introduced by calandra et al.27 and 27 r. calandra et al. (2016). Manifold Gaussian
deep kernels introduced contemporaneously by wilson et al.28 Here the Processes for Regression. ijcnn 2016.
warping function was taken to be an arbitrary neural network, the output 28 a. g. wilson et al. (2016). Deep Kernel Learn-
layer of which was fed into a suitable stationary covariance function. ing. aistats 2016.
This gives a highly parameterized covariance function where the pa-
rameters of the base covariance and the neural map become parameters
of the resulting model. In the context of Bayesian optimization, this
can be especially useful when there is sufficient data to learn a useful neural representation learning: Β§ 8.11, p. 198
representation of the domain via unsupervised methods.
60 modeling with gaussian processes
πΎ1 πΎ2 πΎ1 + πΎ2
Figure 3.8: Samples from centered Gaussian processes with different covariance functions: (left) a squared exponential
covariance, (middle) a squared exponential covariance with smaller output scale and shorter length scale, and
(right) the sum of the two. Samples from the process with the sum covariance show smooth variation on two
different scales.
and thus covariance functions are closed under addition and pointwise
29 The assumption of the processes being cen- multiplication.29 Combining this result with (3.20), we have that any
tered is needed for the product result only; polynomial of covariance functions with nonnegative coefficients forms
otherwise, there would be additional terms
involving scaled versions of each individual
a valid covariance. This enables us to construct infinite families of in-
covariance as in (3.19). The sum result does creasingly complex covariance functions from simple components.
not depend on any assumptions regarding the We may use a sum of covariance functions to model a function with
mean functions. independent additive contributions, such as random behavior on several
length scales. Precisely such a construction is illustrated in Figure 3.8.
If the covariance functions are nonnegative and have roughly the same
scale, the effect of addition is roughly one of logical disjunction: the sum
will assume nontrivial values whenever any one of its constituents does.
Meanwhile, a product of covariance functions can loosely be in-
terpreted in terms of logical conjunction, with function values having
appreciable covariance only when every individual covariance function
does. A prototypical example of this effect is a covariance function mod-
eling functions that are βalmost periodic,β formed by the product of a
bump-shaped isotropic covariance function such as a squared exponen-
tial (3.12) with a warped version modeling perfectly periodic functions
(3.27). The former moderates the influence of the latter by driving the cor-
relation between function values to zero for inputs that are sufficiently
A sample from a centered Gaussian process separated, regardless of their positions in the periodic cycle. We show a
with an βalmost periodicβ covariance function.
sample from such a covariance in the margin, where the length scale of
the modulation term is three times the period.
3.5. modeling functions on high-dimensional domains 61
There is a consensus in the high-dimensional data analysis commu- 33 e. levina and p. j. bickel (2004). Maximum
nity that the only reason any methods work in very high dimensions Likelihood Estimation of Intrinsic Dimension.
is that, in fact, the data are not truly high dimensional.
neurips 2004.
Neural embeddings
Given the success of deep learning in designing feature representations
for complex, high-dimensional objects, neural embeddings β as used in
62 modeling with gaussian processes
35 a. g. wilson et al. (2016). Deep Kernel Learn- the family of deep kernels35 β present a tantalizing option. Neural embed-
ing. aistats 2016. dings have shown some success in Bayesian optimization, where they
can facilitate optimization over complex structured objects by providing
neural representation learning: Β§ 8.11, p. 198 a nice continuous latent space to work in.
snoek et al. demonstrated excellent performance on hyperparameter
tuning tasks by interpreting the output layer of a deep neural network as
a set of custom nonlinear basis functions for Bayesian linear regression,
36 j. snoek et al. (2015). Scalable Bayesian Opti- as in (3.6).36 An advantage of this particular construction is that Gaussian
mization Using Deep Neural Networks. icml process inference and prediction is accelerated dramatically by adopting
the linear covariance (3.6) β the cost of inference scales linearly with the
2015.
cost of Gaussian process inference: Β§ 9.1, p. 201 number of observations, rather than cubically as in the general case.
Linear embeddings
Another line of attack is to search for a low-dimensional linear subspace
of the domain encompassing the relevant variation in inputs and model
the function after projection onto that space. For an objective π on a
high-dimensional domain X β β,π we consider models of the form
π₯ 1β
Several specific schemes have been proposed for building such de-
compositions. One convenient approach is to partition the coordinates
43 k. kandasamy et al. (2015). High Dimensional
of the input into disjoint groups and add a contribution defined on each Bayesian Optimisation and Bandits via Addi-
subset.43, 44 Figure 3.10 shows an example, where a two-dimensional ob- tive Models. icml 2015.
jective is the sum of independent axis-aligned components. We might use 44 j. r. gardner et al. (2017). Discovering and Ex-
such a model when every feature of the input is likely to be relevant but ploiting Additive Structure for Bayesian Opti-
only through interaction with a limited number of additional variables. mization. aistats 2017.
64 modeling with gaussian processes
This material has been published by Cambridge University Press as Bayesian Optimization. 67
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
68 model assessment, selection, and averaging
All models we consider below will be of this composite form (4.1), but the
assessment framework we describe will accommodate arbitrary models.
We have indexed the space by a vector π½ , the entries of which jointly vector of hyperparameters, π½
specify any necessary parameters from their joint range Ξ. The entries range of hyperparameter values, Ξ
of π½ are known as hyperparameters of the model structure, as they pa-
rameterize the prior distribution for the observations, π (y | x, π½ ) (4.2).
In many cases we may be happy with a single suitably flexible model
structure for the data, in which case we can proceed with the correspond-
ing space (4.3) as the set of candidate models. We may also consider
multiple model structures for the data by taking a discrete union of such
spaces, an idea we will return to later in this chapter. multiple model structures: Β§ 4.5, p. 78
Example
Let us momentarily take a step back from abstraction and create an ex-
plicit model space for optimization on the interval X = [π, π].3 Suppose 3 The interval can be arbitrary; our discussion
our initial beliefs are that the objective will exhibit stationary behav- will be purely qualitative.
ior with a constant trend near zero, and that our observations will be
corrupted by additive noise with unknown signal-to-noise ratio.
For the observation model, we take homoskedastic additive Gaussian observation model: additive Gaussian noise
noise, a reasonable choice when there is no obvious alternative: with unknown scale
and select the MatΓ©rn covariance function with π = 5/2 (3.14) with un- prior covariance function: MatΓ©rn π = 5/2
known output scale π (3.20) and unknown length scale β (3.22): with unknown output and length scales
Following our discussion in the last chapter, we may eliminate one eliminating mean parameter via
of the parameters above by marginalizing the unknown constant mean marginalization: Β§ 3.1, p. 47
under its assumed prior,4 leaving us with the identically zero mean
function and an additive contribution to the covariance function (3.3): 4 We would ideally marginalize the other pa-
rameters as well, but it would not result in a
π (π₯) β‘ 0; πΎ (π₯, π₯ β²; π, β) = π 2 + π 2πΎM5/2 (π/β). (4.5) Gaussian process, as we will discuss shortly.
samples from the joint prior over the objective function and the observed
values y that would result from measurements at 15 locations x (4.2) for
a range of these hyperparameters. Even this simple model space is quite
flexible, offering degrees of freedom for the variation in the objective
function and the precision of our measurements.
π (π½ ) β 1 (4.6)
4.2. bayesian inference over parametric model spaces 71
may be used, in which case the model prior may not be explicitly ac-
knowledged at all. However, it can be helpful to express at least weakly
informative prior beliefs β especially when working with small datasets
β as it can offer gentle regularization away from patently absurd choices.
This should be possible for most hyperparameters in practice. For ex-
ample, when modeling a physical system, it would be unlikely that
interaction length scales of say one nanometer and one kilometer would
be equally plausible a priori; we might capture this intuition with a wide
prior on the logarithm of the length scale.
Model posterior
Given a set of observations D = (x, y), we may appeal to Bayesβ theorem model posterior, π (π½ | D)
to derive the posterior distribution over the candidate models:
π (π½ | D) β π (π½ ) π (y | x, π½ ). (4.7)
log π (y | x, π½ ) =
β 12 (y β π) β€ (πΊ + N) β1 (y β π) + log |πΊ + N| + π log 2π . (4.8)
interpretation of terms The first term of this expression is the sum of the squared Mahalanobis
norms (a.8) of the observations under the prior and represents a measure
of data fit. The second term serves as a complexity penalty: the volume
of any confidence ellipsoid under the prior is proportional to |πΊ + N|,
9 The dataset was realized using a moderate
and thus this term scales according to the volume of the modelβs support
length scale (30 length scales spanning the in observation space. The third term simply ensures normalization.
domain) and a small amount of additive noise,
shown below. But this is impossible to know
from inspection of the data alone, and many Return to example
alternative explanations are just as plausible
according to the model posterior! Let us return to our example scenario and model space. We invite the
reader to consider the hypothetical set of 15 observations in Figure 4.2
from our example system of interest and contemplate which models
from our space of candidates in Figure 4.1 might be the most compatible
with these observations.9
We illustrate the model posterior given this data in Figure 4.3, where,
in the interest of visualization, we have fixed the covariance output
4.3. model selection via posterior maximization 73
posterior mean
posterior 95% credible interval, π¦ posterior 95% credible interval, π
scale to its true value and set the range of the axes to be compatible
with the samples from Figure 4.1. The model prior was designed to be
weakly informative regarding the expected order of magnitude of the
hyperparameters by taking independent, wide Gaussian priors on the 10 Both parameters are nonnegative, so the prior
logarithm of the observation noise and covariance length scale.10 has support on the entire parameter range.
The first observation we can make regarding the model posterior is
that it is remarkably broad, with many settings of the model hyperparam-
eters remaining plausible after observing the data. However, the model
posterior does express a preference for models with either low noise and
short length scale or high noise combined with a range of compatible
length scales. Figure 4.4 provides examples of objective function and
observation posteriors corresponding to the hyperparameters indicated
in Figure 4.3. Although each is equally plausible in the posterior,11 their 11 The posterior probability density of these
explanations of the data are diverse. points is approximately 10% of the maximum.
When the model prior is flat (4.6), the map model corresponds to the
maximum likelihood estimate (mle) of the model hyperparameters. Fig-
74 model assessment, selection, and averaging
ure 4.5 shows the predictions made by the map model for our running
example; in this case, the map hyperparameters are in fact a reasonable
match to the parameters used to generate the example dataset.
acceleration via gradient-based optimization When the model space is defined over a continuous space of hyperpa-
rameters, computation of the map model can be significantly accelerated
via gradient-based optimization. Here it is advisable to work in the log
domain, where the objective becomes the unnormalized log posterior:
log π (π½ ) + log π (y | x, π½ ). (4.10)
The log marginal likelihood is given in (4.8), noting that π, πΊ, and N
are all implicitly functions of the hyperparameters π½ . This objective
gradient of log marginal likelihood with (4.10) is differentiable with respect to π½ assuming the Gaussian process
respect to π½ : Β§ c.1, p. 307 prior moments, the noise covariance, and the model prior are as well, in
which case we may appeal to off-the-shelf gradient methods for solving
(4.9). However, a word of warning is in order: the model posterior is
not guaranteed to be concave and may have multiple local maxima, so
multistart optimization is prudent.
posterior mean posterior 95% credible interval, π¦ posterior 95% credible interval, π
Figure 4.6: A Monte Carlo estimate to the model-marginal predictive distribution (4.11) for our example sceneario using
100 samples drawn from the model posterior in Figure 4.3 (4.14β4.15); see illustration in margin. Samples
from the objective function posterior display a variety of behavior due to being associated with different
hyperparameters.
some special cases,14 so we must resort to approximation if we wish 14 A notable example is marginalizing the coeffi-
to pursue this approach. In fact, maximum a posteriori estimation can cients of a linear prior mean against a Gaus-
sian prior: Β§ 3.1, p. 47.
be interpreted as one rather crude approximation scheme where the
model posterior is replaced by a Dirac delta distribution at the map
hyperparameters:
π (π½ | D) β πΏ (π½ β π½Λ ).
This can be defensible when the dataset is large compared to the number
of hyperparameters, in which case the model posterior is often unimodal
with little residual uncertainty. However, large datasets are the exception
rather than the rule in Bayesian optimization, and more sophisticated
approximations can pay off when model uncertainty is significant.
1 βοΈ
π
π (π | D) β GP π ; πD (π½π ), πΎD (π½π ) ; (4.14) ππ
π π=1
π β«
1 βοΈ
π (π¦ | π₯, D) β π (π¦ | π₯, π, π½π ) π (π | π₯, D, π½π ) dπ . (4.15)
π π=1
posterior mean posterior 95% credible interval, π¦ posterior 95% credible interval, π
Figure 4.7: An approximation to the model-marginal posterior (4.11) using the central composite design approach proposed
by rue et al. A total of nine hyperparameter samples are used for the approximation, illustrated in the margin
below.
15 m. d. hoffman and a. gelman (2014). The Carlo (hmc) such as the no u-turn sampler (nuts) would be a reasonable
No-U-turn Sampler: Adaptively Setting Path choice when the gradient of the log posterior (4.10) is available, as it can
Lengths in Hamiltonian Monte Carlo. Journal
of Machine Learning Research 15(4):1593β1623.
exploit this information to accelerate mixing.15
Figure 4.6 demonstrates a Monte Carlo approximation to the model-
marginal posterior (4.11β4.12) for our running example. Comparing with
the map approximation in Figure 4.5, the predictive uncertainty of both
objective function values and observations has increased considerably
due to accounting for model uncertainty in the predictive distributions.
16 h. rue et al. (2009). Approximate Bayesian In- Unfortunately this integral remains intractable due to the nonlinear de-
ference for Latent Gaussian Models by Using pendence of the posterior moments on the hyperparameters, but reducing
Integrated Nested Laplace Approximations.
Journal of the Royal Statistical Society Series B to this common form allows us to derive deterministic approximations
(Methodological) 71(2):319β392. against a single assumed posterior.
17 g. e. p. box and k. b. wilson (1951). On the Ex- rue et al. introduced several approximation schemes representing
perimental Attainment of Optimum Condi- different tradeoffs between efficiency and fidelity.16 Notable among these
tions. Journal of the Royal Statistical Society is a simple, sample-efficient procedure grounded in classical experi-
Series B (Methodological) 13(1):1β45.
mental design. Here a central composite design17 in hyperparameter
4.4. model averaging 77
posterior mean posterior 95% credible interval, π¦ posterior 95% credible interval, π
Figure 4.8: The approximation to the model-marginal posterior (4.11) for our running example using the approach proposed
by osborne et al.
where
m5 + lin m5 Γ lin
initial model structure: p. 69 We build a model space comprising several model structures by aug-
menting our previous space with structures incorporating additional
covariance functions. The treatment of the prior mean (unknown con-
stant marginalized against a Gaussian prior) and observation model
(additive Gaussian noise with unknown scale) will remain the same
for all. The model structures reflect a variety of hypotheses positing
potential linear or quadratic behavior:
m5: the MatΓ©rn π = 5/2 covariance (3.14) from our previous example;
lin: the linear covariance (3.16), where the the prior on the slope is vague
and centered at zero and the prior on the intercept agrees with the m5
model;
lin Γ lin: the product of two linear covariances designed as above, modeling a
latent quadratic function with unknown coefficients;
m5 + lin: the sum of a MatΓ©rn π = 5/2 and linear covariance designed as in the
corresponding individual model structures; and
m5 Γ lin: the product of a MatΓ©rn π = 5/2 and linear covariance designed as in the
corresponding individual model structures.
Objective function samples from models in each of these structures are
shown in Figure 4.10. Among these, the model structure closest to the
truth is arguably m5 + lin.
approximation to model structure posterior Following the above discussion, we find the map hyperparameters
for each of these model structures separately and use a Laplace approx-
imation (4.16) to approximate the hyperparameter posterior on each
space, along with the normalizing constant (4.22). Normalizing over the
structures provides an approximate model structure posterior:
Pr(m5 | D) β 10.8%;
Pr(m5 + lin | D) β 71.8%;
Pr(m5 Γ lin | D) β 17.0%,
with the remaining model structures (lin and lin Γ lin) sharing the re-
4.6. automating model structure search 81
Figure 4.11: An approximation to the model-marginal posterior (4.23) for our multiple-model example. The posterior on
each model structure is approximated separately as a mixture of Gaussian processes following rue et al. (see
Figure 4.7); these are then combined by weighting by an approximation of the model structure posterior (4.21).
We show the result with three superimposed, transparent credible intervals, which are shaded with respect to
their weight in contributing to the final approximation.
maining 0.4%. The m5 + lin model structure is the clear winner, and there
is strong evidence that the purely polynomial models are insufficient for
explaining the data alone.
Figure 4.11 illustrates an approximation to the model-marginal pos- approximation to marginal predictive
terior (4.23), approximated by applying rue et al.βs central composite distribution
design approach to each of the model structures, then combining these
into a Gaussian process mixture by weighting by the approximate model
structure posterior. The highly asymmetric credible intervals reflect
the diversity in explanations for the data offered by the chosen model
structures, and the combined model makes reasonable predictions of our
example objective function sampled from the true model.
For this example, averaging over the model structure has important averaging over a space of Gaussian processes
implications regarding the behavior of the resulting optimization policy. in policy computation: Β§ 8.10, p. 192
Figure 4.12 illustrates a common acquisition function25 built from the off- 25 to be specific, expected improvement: Β§ 7.3,
the-shelf m5 model, as well as from the structure-marginal model. The p. 127
former chooses to exploit near what it believes is a local optimum, but
the latter has a strong belief in an underlying linear trend and chooses
to explore the right-hand side of the domain instead. For our example
objective function sample, this would in fact reveal the global optimum
with the next observation.
Figure 4.12: Optimization policies built from the map m5 model (left) and the structure-marginal posterior (right). The m5
model chooses to exploit near the local optimum, but the structure-marginal model is aware of the underlying
linear trend and chooses to explore the right-hand side as a result.
πΎ βπ΅
πΎ βπΎ +πΎ
πΎ β πΎπΎ
πΎ β (πΎ).
The symbol π΅ in the first rule represents any desired base kernel. The five
model structures considered in our multiple-structure example above in
4.6. automating model structure search 83
End-to-end automation
Follow-on work demonstrated a completely automated Bayesian opti-
mization system built on this structure search procedure avoiding any
29 g. malkomes and r. garnett (2018). Automat- manual modeling at all.29 The key idea was to dynamically maintain a
ing Bayesian Optimization with Bayesian Op- set of plausible model structures throughout optimization. Predictions
timization. neurips 2018.
are made via model averaging over this set, offering robustness to model
misspecification when computing the outer optimization policy. Every
time a new observation is obtained, the set of model structures is then
updated via a continual Bayesian optimization in model space given the
new data. This interleaving of Bayesian optimization in data space and
model space offered promising performance.
Finally, gardner et al. offered an alternative to optimization over
model structures by constructing a Markov chain Monte Carlo routine
30 j. r. gardner et al. (2017). Discovering and Ex- to sample model structures from their posterior (4.21).30 The proposed
ploiting Additive Structure for Bayesian Opti- sampler was a realization of the MetropolisβHastings algorithm with a
mization. aistats 2017.
custom proposal distribution making minor modifications to the incum-
bent structure. In the case of the additive decompositions considered
in that work, this step consisted of applying random atomic operations
such as merging or splitting components of the existing decomposition.
Despite the absolutely enormous number of possible additive decomposi-
tions, this mcmc routine was able to quickly locate promising structures,
and averaging over the sampled structures for prediction resulted in
superior optimization performance as well.
This material has been published by Cambridge University Press as Bayesian Optimization. 87
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
88 decision theory for optimization
Isolated decisions
A decision problem under uncertainty has two defining characteristics.
action space, A The first is the action space A, the set of all available decisions. Our
task is to select an action from this space. For example, in sequential
optimization, an optimization policy decision must select a point in the
domain X for observation, and so we have A = X.
The second critical feature is the presence of uncertain elements of
unknown variables affecting decision the world influencing the outcomes of our actions, complicating our
outcome, π decision. Let π represent a random variable encompassing any relevant
uncertain elements when making and evaluating a decision. Although
we may lack perfect knowledge, Bayesian inference allows us to reason
relevant observed data, D about π in light of data via the posterior distribution π (π | D), and we
posterior belief about π, π (π | D) will use this belief to inform our decision.
Suppose now we must select a decision from an action space A under
uncertainty in π, informed by a set of observed data D. To guide our
utility function, π’ (π,π, D) choice, we select a real-valued utility function π’ (π,π, D). This function
measures the quality of selecting the action π if the true state of the world
were revealed to be π, with higher utilities indicating more favorable
4 Typical presentations of Bayesian decision the- outcomes. The arguments to a utility function comprise everything
ory omit the data from the utility function, but required to judge the quality of a decision in hindsight: the proposed
including it offers more generality, and this
allowance will be important when we turn our action π, what we know (the data D), and what we donβt know (the
attention to optimization policies. uncertain elements π ).4
We cannot know the exact utility that would result from selecting
any given action a priori, due to our incomplete knowledge of π. We can,
however, compute the expected utility that would result from selecting
expected utility an action π, according to our posterior belief:
β«
πΌ π’ (π,π, D) | π, D = π’ (π,π, D) π (π | D) dπ. (5.2)
This expected utility maps each available action to a real value, inducing
a total order and providing a straightforward mechanism for making our
decision. We pick an action maximizing the expected utility:
π β arg max πΌ π’ (π ,β² π, D) | π ,β² D . (5.3)
πβ² βA
5 One may question whether this framework This decision is optimal in the sense that no other action results in greater
is complete in some sense: is it possible to expected utility. (By definition!) This procedure for acting optimally
make rational decisions in some other man-
ner? The von NeumannβMorgenstern theorem
under uncertainty β computing expected utility with respect to relevant
shows that the answer is, surprisingly, no. As- unknown variables and maximizing to select an action β is the central
suming a certain set of rationality axioms, any tenet of Bayesian decision making.5
rational preferences over uncertain outcomes
can be captured by the expectation of some
utility function. Thus every rational decision Example: recommending a point for use after optimization
maximizes an expected utility:
With this abstract decision-making framework established, let us analyze
j. von neumann and o. morgenstern (1944).
Theory of Games and Economic Behavior. an example decision that might be faced in the context of optimization.
Princeton University Press. [appendix a] Consider a scenario where the purpose of optimization is to identify a
single point π₯ β X for perpetual use in a production system, preferring
5.2. seqential decisions with a fixed budget 91
π’ (π₯, π ) = π (π₯) = π,
which rewards points for achieving high values of the objective function.
Now if our optimization procedure returned a dataset D, the expected
utility from recommending a point π₯ is simply the posterior mean of the
recommendation
corresponding function value:
Optimal terminal recommendation. Above:
πΌ π’ (π₯, π ) | π₯, D = πΌ[π | π₯, D] = πD (π₯). (5.4) posterior belief about an objective function
given the data returned by an optimizer,
Therefore, an optimal recommendation maximizes the posterior mean: π (π | D). Below: the expected utility for our
example, the posterior mean πD (π₯). The
optimal recommendation maximizes the
π₯ β arg max πD (π₯ β²).
π₯ β² βX
expected utility.
Bayesian inference of the objective function: To reason about uncertainty in the objective function, we follow
Β§ 1.2, p. 8 the path laid out in the preceding chapters and maintain a probabilistic
belief throughout optimization, π (π | D). We make no assumptions
regarding the nature of this distribution, and in particular it need not
be a Gaussian process. Equipped with this belief, we may reason about
the result of making an observation at some point π₯ via the posterior
predictive distribution π (π¦ | π₯, D) (1.7), which will play a key role below.
The ultimate purpose of optimization is to collect and return a dataset
D. Before we can reason about what data we should acquire, we must
first clarify what data we would like to acquire. Following the previous
optimization utility function, π’ (D) section, we will accomplish this by defining a utility function π’ (D) to
evaluate the quality of data returned by an optimizer. This utility function
will serve to establish preferences over optimization outcomes: all other
things being equal, we would prefer to return a dataset with higher
utility than any dataset with lower utility. As before, we will use this
utility to guide the design of policies, by making observations that, in
Chapter 6: Utility Functions for Optimization, expectation, promise the biggest improvement in utility. We will define
p. 109 and motivate several utility functions used for optimization in the next
chapter, and some readers may wish to jump ahead to that discussion for
explicit examples before continuing. In the following, we will develop
the general theory in terms of an arbitrary utility function.
In this section we will consider the construction of optimization fixed, known budget
policies assuming that we have a fixed and known budget on the number
of observations we will make. This scenario is both common in practice
and convenient for analysis, as we can for now ignore the question of
when to terminate optimization. Note that this assumption effectively
implies that every observation has a constant acquisition cost, which
may not always be reasonable. We will address variable observation cost-aware optimization: Β§ 5.4, p. 103
costs and the question of when to stop optimization later in this chapter.
Assuming a fixed observation budget allows us to reason about opti-
mization policies in terms of the number of observations remaining to
termination, which will always be known. The problem we will consider
in this section then becomes the following: provided an arbitrary set of
data, how should we design our next evaluation location when exactly π number of remaining observations (horizon),
observations remain before termination? In sequential decision making, π
this value is known as the decision horizon, as it indicates how far we
must look ahead into the future when reasoning about the present.
To facilitate our discussion, we pause to define notation for future
data that will be encountered during optimization relative to the present.
When considering an observation at some point π₯, we will call the value
resulting from an observation there π¦. We will then call the dataset
available at the next stage of optimization D1 = D βͺ (π₯, π¦) , where putative next observation and dataset: (π₯, π¦),
the subscript indicates the number of future observations incorporated D1
into the current data. We will write (π₯ 2, π¦2 ) for the following observa- putative following observation and dataset:
tion, which when acquired will form D2, etc. Our final observation π (π₯ 2 , π¦2 ), D2
steps in the future will then be (π₯π , π¦π ), and the dataset returned by our putative final observation and dataset:
optimization procedure will be Dπ , with utility π’ (Dπ ). (π₯π , π¦π ), Dπ
This utility of the data we return is our ultimate concern and will
serve as the utility function used to design every observation. Note we
may write this utility in the same form we introduced in our general
discussion:
π’ (Dπ ) = π’ ( D, π₯, π¦, π₯ 2, π¦2, . . . , π₯π , π¦π ),
known action unknown
Explicitly writing out the expectation over the future data in (5.5) yields
the following expression:
β« β« Γ
π
Β· Β· Β· π’ (Dπ ) π (π¦ | π₯, D) π (π₯π , π¦π | Dπβ1 ) dπ¦ d (π₯π , π¦π ) . (5.7)
π=2
for this quantity, which is simply the expected terminal utility (5.5) shifted
by the utility of our existing data, π’ (D). It is no coincidence this notation
defining a policy by maximizing an echoes our notation for acquisition functions! We will characterize the
acquisition function: Β§ 5, p. 88 optimal optimization policy by a family of acquisition functions defined
in this manner.
π₯β² π₯
Here we have defined the symbol πΌπβ (D) to represent the expected
increase in utility when starting with D and continuing optimally for value of D with horizon π, πΌπβ (D)
π additional observations. This is called the value of the dataset with a
horizon of π and will serve a central role below. We have now shown
how to compute the value of any dataset with a horizon of π = 1 (5.10)
and how to identify a corresponding optimal action (5.9). This completes
the base case of our argument.
We illustrate the optimal optimization policy with one observation illustration of one-step optimal optimization
remaining in Figure 5.1. In this scenario the belief over the objective policy
function π (π | D) is a Gaussian process, and for simplicity we assume
our observations reveal exact values of the objective. We consider an
intuitive utility function: the maximal objective value contained in the
data, π’ (D) = max π (x).9 The marginal gain in utility offered by a putative 9 This is a special case of the simple reward util-
final observation (π₯, π¦) is then a piecewise linear function of the observed ity function, which we discuss further in the
next chapter (Β§ 6.1, p. 109). The corresponding
value:
expected marginal gain is the well-known ex-
π’ (D1 ) β π’ (D) = max π¦ β π’ (D), 0 ; (5.11) pected improvement acquisition function (Β§ 7.3,
p. 127).
that is, the utility increases linearly if we exceed the previously best-
seen value and otherwise remains constant. To design the optimal final
observation, we compute the expectation of this quantity over the domain
and choose the point maximizing it, as shown in the top panels. We also
illustrate the computation of this expectation for the optimal choice and
96 decision theory for optimization
ο£Ή
ο£Ί
ο£Ί Figure 5.3: The posterior of the objec-
ο£Ί
ο£Ί tive function given two pos-
ο£Ί
ο£Ί sible observations resulting
ο£Ί π¦ = π¦β²
ο£Ί from the optimal two-step ob-
πΌ 1β ο£Ί
ο£Ί servation π₯ illustrated on the
ο£Ί facing page. The relatively
ο£Ί
ο£Ί low value π¦ β² offers no imme-
ο£Ί
π₯2 ο£» diate reward, but reveals a
new local optimum and the
ο£Ή expected future reward from
ο£Ί
ο£Ί the optimal final decision π₯ 2
ο£Ί
ο£Ί is high. The relatively high
ο£Ί
ο£Ί value π¦ β²β² offers a large imme-
ο£Ί π¦ = π¦ β²β² diate reward and respectable
ο£Ί
ο£Ί prospects from the optimal fi-
ο£Ί
ο£Ί nal decision as well.
ο£Ί
πΌ 1β
ο£Ί
ο£Ί
π₯2 ο£»
which yields
πΌ 2 (π₯; D) = πΌ 1 (π₯; D) + πΌ πΌ 1 (π₯ 2 ; D1 ) | π₯, D .
That is, the expected increase in utility after two observations can be decomposition of expected marginal gain
decomposed as the expected increase after our first observation π₯ β the
expected immediate gain β plus the expected additional increase from
the final observation π₯ 2 β the expected future gain.
It is still not clear how to address the second term in this expression.
However, from our analysis of the base case, we can reason as follows.
Given π¦ (and thus knowledge of D1 ), the optimal final decision π₯ 2 (5.9)
results in an expected marginal gain of πΌ 1β (D1 ), a quantity we know how
to compute (5.10). Therefore, assuming optimal future behavior, we have:
πΌ 2 (π₯; D) = πΌ 1 (π₯; D) + πΌ πΌ 1β (D1 ) | π₯, D , (5.13)
We pause to note that the value of any dataset with null horizon optimal policy: compact notation
is πΌ 0β (D) = 0, and thus the expressions in (5.15β5.17) are valid for any
horizon and compactly express the proposed policy. Further, we have
actually shown that this policy is optimal in the sense of maximizing
expected terminal utility over the space of all policies, at least with optimality
respect to our model of the objective function and observations. This
follows from our induction: the base case is established in (5.9), and the 11 Since ties in (5.16) may be broken arbitrarily,
inductive case by the sequential maximization in (5.16).11 this argument does not rule out the possibility
of there being multiple, equally good optimal
policies.
Bellman optimality and the Bellman equation
Substituting (5.15) into (5.17), we may derive the following recursive 12 r. bellman (1952). On the Theory of Dy-
definition of the value in terms of the value of future data: namic Programming. Proceedings of the Na-
n β o tional Academy of Sciences 38(8):716β719.
πΌπβ (D) = max
β²
πΌ 1 (π₯ β²
; D) + πΌ πΌ πβ1 (D1 ) | π₯ β²
, D . (5.18)
π₯ βX
This is known as the Bellman equation and is a central result in the theory Bellman equation
of optimal sequential decisions.12 The treatment of future decisions in
this equation β recursively assuming that we will always act to maximize
expected terminal utility given the available data β reflects bellmanβs bellmanβs principle of optimality
principle of optimality, which characterizes optimal sequential decision
policies in terms of the optimality of subpolicies:13
An optimal policy has the property that whatever the initial state 13 r. bellman (1957). Dynamic Programming.
and initial decision are, the remaining decisions must constitute Princeton University Press.
an optimal policy with regard to the state resulting from the first
decision.
That is, to make a sequence of optimal decisions, we make the first
decision optimally, then make all following decisions optimally given
the outcome!
π¦ π¦2 π₯π π’ (Dπ )
π₯ π₯2 Β·Β·Β· π¦π
Figure 5.4: The optimal optimization policy as a decision tree. Squares indicate decisions (the choice of each observation),
and circles represent expectations with respect to random variables (the outcomes of observations). Only one
possible optimization path is shown; dangling edges lead to different futures, and all possibilities are always
considered. We maximize the expected terminal utility π’ (Dπ ), recursively assuming optimal future behavior.
π₯ β arg max πΌπ ;
β
πΌπ = πΌ 1 + πΌ[πΌπβ1 ]
= πΌ 1 + πΌ[max πΌπβ1 ]
= πΌ 1 + πΌ max πΌ 1 + πΌ[πΌπβ2
β
]
h n i
= πΌ 1 + πΌ max πΌ 1 + πΌ max πΌ 1 + πΌ[max{πΌ 1 + Β· Β· Β· .
Limited lookahead
One widespread and surprisingly effective approximation is to simply
limit how many future observations we consider in each decision. This
is practical as decisions closer to termination require substantially less
computation than earlier decisions.
With this in mind, we can construct a natural family of approxima-
tions to the optimal policy defined by artificially limiting the horizon
used throughout optimization to some computationally feasible maxi-
mum β. When faced with an infeasible decision horizon π, we make the
crude approximation
and by maximizing this score, we act optimally under the incorrect but
convenient assumption that only β observations remain. This effectively 15 Equivalently, we approximate the true future
assumes π’ (Dπ ) β π’ (Dβ ).15 This may be reasonable if we expect decreas- value πΌπβ β1 with πΌ βββ1 .
ing marginal gains, implying a significant fraction of potential gains can
be attained within the truncated horizon. This scheme is often described myopic approximations
(sometimes disparagingly) as myopic, as we limit our sight to only the
next few observations rather than looking ahead to the full horizon.
A policy that designs each observation to maximize the limited-
horizon acquisition function πΌ min{β,π } is called an β-step lookahead pol- β-step lookahead
icy.16 This is also called a rolling horizon strategy, as the fixed horizon rolling horizon
βrolls alongβ with us as we go. By limiting the horizon, we bound the
computational effort required for each decision to at-most O(π β π β ) time 16 We take the minimum to ensure we donβt look
with the implementation described above. This can be a considerable beyond the true horizon, which would be non-
sense.
speedup when the observation budget is much greater than the selected
lookahead. A lookahead policy is illustrated as a decision tree in Figure
5.5. Comparing to the optimal policy in Figure 5.4, we simply βcut offβ
and ignore any portion of the tree lying deeper than β steps in the future.
102 decision theory for optimization
π¦ π¦2 π₯π π’ (Dπ )
π₯ π₯2 Β·Β·Β· π¦π
Figure 5.6: A decision tree representing a rollout policy. Comparing to the optimal policy in Figure 5.4, we simulate future
decisions starting with π₯ 2 using an efficient but suboptimal heuristic policy, rather than the intractable optimal
policy. We maximize the expected terminal utility π’ (Dπ ), assuming potentially suboptimal future behavior.
Rollout
The optimal policy evaluates a potential observation location by simu-
lating the entire remainder of optimization following that choice, recur-
sively assuming we will use the optimal policy for every future decision.
Although sensible, this is clearly intractable. Rollout is an approach to ap-
proximate policy design that emulates the structure of the optimal policy,
but using a tractable suboptimal policy to simulate future decisions.
A rollout policy is illustrated as a decision tree in Figure 5.6. Given a
base policy, heuristic policy putative next observation (π₯, π¦), we use an inexpensive so-called base
or heuristic policy to simulate a plausible β but perhaps suboptimal β
realization of the following decision π₯ 2 . Note there is no branching in
the tree corresponding to this decision, as it does not depend on the
exhaustively enumerated subtree required by the optimal policy. We
then take an expectation with respect to the unknown value π¦2 as usual.
Given a putative value of π¦2 , we use the base policy to select π₯ 3 and
continue in this manner until reaching the decision horizon. We use the
terminal utilities in the resulting pruned tree to estimate the expected
marginal gain πΌπ , which we maximize as a function of π₯.
choice of base policy There are no constraints on the design of the base policy used in
rollout; however, for this approximation to be sensible, we must choose
something relatively efficient. One common and often effective choice
is to simulate future decisions with one-step lookahead. If we again
use off-the-shelf optimization and quadrature routines to traverse the
rollout decision tree in Figure 5.6 with this particular choice, the running
time of the policy with a horizon of π is O(π 2ππ ), significantly faster
5.4. cost-aware optimization and termination as a decision 103
A = X βͺ {β }. (5.19)
For the sake of analysis, after the termination action has been selected, it
is convenient to model the decision process as not actually terminating,
but rather continuing with the collapsed action space A = {β
} β once
you terminate, thereβs no going back.
As before, we may derive the optimal optimization policy in the
adaptive termination case via induction on the decision horizon π. How-
ever, we must address one technical issue: the base case of the induction,
which analyzes the βfinalβ decision, breaks down if we allow the possi-
bility of a nonterminating sequence of decisions. To sidestep this issue,
bound on total number of observations, πmax we assume there is a fixed and known upper bound πmax on the total
number of observations we may make, at which point optimization is
compelled to terminate regardless of any other concern. This is not an
overly restrictive assumption in the context of Bayesian optimization.
Because observations are assumed to be expensive, we can adopt some
18 It is possible to consider unbounded sequen- suitably absurd upper bound without issue; for example, πmax = 1 000 000
tial decision problems, but this is probably not would suffice for an overwhelming majority of plausible scenarios.18
of practical interest in Bayesian optimization:
After assuming the decision process is bounded, our previous induc-
m. h. degroot (1970). Optimal Statistical Deci- tive argument carries through after we demonstrate how to compute
sions. McGrawβHill. [Β§ 12.7] the value of the termination action. Fortunately, this is straightforward:
termination does not augment our data, and once this action is taken, no
other action will ever again be allowed. Therefore the expected marginal
gain from termination is always zero:
πΌπ (β ; D) = 0. (5.20)
With this, substituting A for X in (5.15β5.17) now gives the optimal policy.
Intuitively, the result in (5.20) implies that termination is only the
optimal decision if there is no observation offering positive expected
gain in utility. For the utility functions described in the next chapter β
all of which are agnostic to costs and measure optimization progress
19 This can be proven through various βinforma- alone β reaching this state is actually impossible.19 However, explicitly
tion never hurtsβ (in expectation) results. accounting for observation costs in addition to optimization progress in
the utility function resolves this issue, as we will demonstrate.
Figure 5.8: Illustration of one-step lookahead with the option to terminate. With a linear utility and additive costs, the
expected marginal gain πΌ 1 is the expected marginal gain to the data utility πΌ 1β² adjusted for the cost of acquisition
π. For some points, the cost-adjusted expected gain is negative, in which case we would prefer immediate
termination to observing there. However, continuing with the chosen point is expected to increase the utility
of the current data.
Consider the objective function belief in the top panel of Figure 5.8 (which
is identical to that from our running example from Figures 5.1β5.3) and 20 We will consider unknown and stochastic
suppose that the cost of observation now depends on location according costs in Β§ 11.1, p. 245.
to a known cost function π (π₯),20 illustrated in the middle panel. observation cost function, π (π₯)
If we wish to reason about observation costs in the optimization
policy, we must account for them somehow, and the most natural place
to do so is in the utility function. Depending on the situation, there are
many ways we could proceed;21 however, one natural approach is to 21 We wish to stress this point β there is consider-
first select a utility function measuring the quality of a returned dataset able flexibility beyond the scheme we describe.
alone, ignoring any costs incurred to acquire it. We call this quantity
the data utility and notate it with π’ β²(D). The data utility is akin to the data utility, π’ β² (D)
cost-agnostic utility from the known-budget case, and any one of the Chapter 6: Utility Functions for Optimization,
options described in the next chapter could reasonably fill this role. p. 109
We now adjust the data utility to account for the cost of data acquisi-
tion. In many applications, these costs are additive, so that the total cost
of gathering a dataset D is simply observation costs, π (D)
βοΈ
π (D) = π (π₯). (5.21)
π₯ βD 22 Some additional discussion on this natural ap-
proach can be found in:
If the acquisition cost can be expressed in the same units as the data
utility β for example, if both can be expressed in monetary terms22β then h. raiffa and r. schlaifer (1961). Applied Sta-
tistical Decision Theory. Division of Research,
we might reasonably evaluate a dataset D by the cost-adjusted utility: Graduate School of Business Administration,
Harvard University. [chapter 4]
π’ (D) = π’ β²(D) β π (D). (5.22)
106 decision theory for optimization
In the next chapter we will discuss several prominent utility functions for
measuring the quality of a dataset returned by an optimization procedure.
In the following chapter, we will demonstrate how many common acqui- Chapter 7: Common Bayesian Optimization
sition functions for Bayesian optimization may be realized by performing Policies, p. 123
one-step lookahead with these utility functions.
6
U TILI TY F UNCTIONS FOR OPTIMIZATION
This material has been published by Cambridge University Press as Bayesian Optimization. 109
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
110 utility functions for optimization
Simple reward
Suppose an optimization routine returned data D = (x, y) to inform a
terminal recommendation, and that we will make this decision using
the risk-neutral utility function π (π) = π (6.2). If we limit the action
space of this recommendation to only the locations evaluated during
optimization x, the expected utility of the optimal recommendation is
6 This name contrasts with the cumulative re- the so-called simple reward:6, 7
ward : Β§ 6.2, p. 114.
7 One technical caveat is in order: when the π’ (D) = max πD (x). (6.3)
dataset is empty, the maximum degererates
and we have π’ (β
) = ββ. In the special case of exact observations, where y = π (x) = π, the
simple reward reduces to the maximal function value encountered during
optimization:
π’ (D) = max π. (6.4)
6.1. expected utility of terminal recommendation 113
global reward
simple reward Figure 6.1: The terminal recommenda-
tions corresponding to the
simple reward and global re-
ward for an example dataset
comprising five observations.
The prior distribution for the
objective for this demonstra-
tion is illustrated in Figure 6.3.
recommendations
One-step lookahead with the simple reward utility function produces expected improvement: Β§ 7.3 p. 127
a widely used acquisition function known as expected improvement,
which we will discuss in detail in the next two chapters.
Global reward
Another prominent utility is the global reward.8 Here we again consider a 8 βGlobal simple rewardβ would be a more accu-
risk-neutral terminal recommendation, but now expand the action space rate (but annoyingly bulky) name.
for this recommendation to the entire domain X. The expected utility of
this recommendation is the global maximum of the posterior mean:
max y
Figure 6.2: The utility π’ (D) = max y would prefer the excessively noisy dataset on the left to the less-noisy dataset
on the right with smaller maximum value. The data on the left reveal little information about the objective
function, and the maximum observed value is very likely to be an outlier, whereas the data on the right indicate
reasonable progress.
large-but-noisy observations are not This is simple to demonstrate by contemplating the preferences over
necessarily preferable outcomes encoded in the utility, which may not align with intuition.
This disparity is especially notable in situations with excessively noisy
observations, where the maximum value observed will likely reflect
spurious noise rather than actual optimization progress.
example and discussion Figure 6.2 shows an extreme but illustrative example. We consider
two optimization outcomes over the same domain, one with excessively
noisy observations and the other with exact measurements. The noisy
dataset contains a large observation on the right-hand side of the domain,
but this is almost certainly the result of noise, as indicated by the objective
function posterior. Although the other dataset has a lower maximal value,
the observations are more trustworthy and represent a plainly better
outcome. But the proposed utility (6.6) prefers the noisier dataset! On
the other hand, both the simple and global reward utilities prefer the
noiseless dataset, as the data produce a larger effect on the posterior
mean β and thus yield more promising recommendations.
approximation to the simple reward Of course, errors in noisy measurements are not always as extreme
as in this example. When the signal-to-noise ratio is relatively high, the
utility (6.6) can serve as a reasonable approximation to the simple reward.
We will discuss this approximation scheme further in the context of
expected improvement: Β§ 7.3, p. 127 expected improvement.
One notable use of cumulative reward is in active search, a simple active search: Β§ 11.11, p. 282
mathematical model of scientific discovery. Here, we successively select
points for investigation seeking novel members of a rare, valuable class
V β X. Observing at a point π₯ β X yields a binary observation indicating
membership in the desired class: π¦ = [π₯ β V]. Most studies of active
search seek to maximize the cumulative reward (6.7) of the gathered
data, hoping to discover as many valuable items as possible.
The information gain offered by a dataset D is then the reduction in 11 A caveat is in order regarding this notation,
entropy when moving from the prior to the posterior distribution: which is not standard. In information theory
π» [π | D] denotes the conditional entropy of
π’ (D) = π» [π] β π» [π | D], (6.8) π given D, which is the expectation of the
given quantity over the observed values y. For
our purposes it will be more useful for this
where π» [π | D] is the differential entropy of the posterior:11 to signify the differential entropy of the no-
β« tationally parallel posterior π (π | D). When
π» [π | D] = β π (π | D) log π (π | D) dπ. needed, we will write conditional
entropy with
an explicit expectation: πΌ π» [π | D] | x .
π (π β )
π (π₯ β )
Figure 6.3: The objective function prior used throughout our utility function comparison. Marginal beliefs of function
values are shown, as well as the induced beliefs over the location of the global optimum, π (π₯ β ), and the value
of the global optimum, π (π β ). Note that there is a significant probability that the global optimum is achieved
on the boundary of the domain, reflected by large point masses.
π (π β | D)
π (π₯ β | D)
Figure 6.4: An example dataset of five observations and the resulting posterior belief of the objective function. This dataset
exhibits relatively low simple reward (6.3) but relatively high global reward (6.5) and information gain (6.8)
about the location π₯ β and value π β of the optimum.
covariance (3.12). This prior is illustrated in Figure 6.3, along with the
induced beliefs about the location π₯ β and value π β of the global optimum.
15 For this model, a unique optimum will exist Both distributions reflect considerable uncertainty.15 We will examine
with probability one; see Β§ 2.7, p. 34 for more two datasets that might be returned by an optimizer using this model and
details.
discuss how different utility functions would evaluate these outcomes.
π (π β | D)
π (π₯ β | D)
Figure 6.5: An example dataset containing a single observation and the resulting posterior belief of the objective function.
This dataset exhibits relatively high simple reward (6.3) but relatively low global reward (6.5) and information
gain (6.8) about the location π₯ β and value π β of the optimum.
This material has been published by Cambridge University Press as Bayesian Optimization. 123
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
124 common bayesian optimization policies
Figure 7.1: The scenario we will consider for illustrating optimization policies. The objective function prior is a Gaussian
process with constant mean and MΓ‘tern covariance with π = 5/2 (3.14). We show the marginal predictive
distributions and three samples from the posterior conditioned on the indicated observations.
objective, π
Figure 7.2: The true objective function used for simulating optimization policies.
of the returned data (5.16). This policy is optimal in the average case: it
maximizes the expected utility of the returned dataset over the space 5 To be precise, optimality is defined with re-
of all possible policies.5 Unfortunately, optimality comes at a great cost. spect to a model for the objective function
π (π ), an observation model π (π¦ | π₯, π), a util-
Computing the optimal policy requires recursive simulation of the entire ity function π’ (D), and an upper bound on the
remainder of optimization, a random process due to uncertainty in the number of observations allowed π. Bayesian
outcomes of our observations. In general, the cost of computing the decision theory provides a policy achieving
optimal policy grows exponentially with the horizon, the number of the maximal expected utility at termination
with respect to these choices.
observations remaining before termination.
However, the structure of the optimal policy suggests a natural family running time of optimal policy and efficient
of lookahead approximations based on fixing a computationally tractable approximations: Β§ 5.3, p. 99
maximum horizon throughout optimization. This line of reasoning has limited lookahead: Β§ 5.3, p. 101
led to many of the practical policies available for Bayesian optimization.
In fact, most popular algorithms represent one-step lookahead, where
in each iteration we greedily maximize the expected utility after obtain-
ing only a single additional observation. Although these policies are
maximally myopic, they are also maximally efficient among lookahead
approximations and have delivered impressive empirical performance in
a wide range of settings.
It may seem surprising that such dramatically myopic policies have
any use at all. There is a huge difference between the scale of reasoning
in one-step lookahead compared with the optimal procedure, which
may consider hundreds of future decisions or more when designing an
observation. However, the situation is somewhat more nuanced than it
might appear. In a seminal paper, kushner argued that myopic policies
may in fact show better empirical performance than a theoretically
optimal policy, and his argument remains convincing:6
Since a mathematical model of [π ] is available, it is theoretically 6 h. j. kushner (1964). A New Method of Locat-
possible, once a criterion of optimality is given, to determine the ing the Maximum Point of an Arbitrary Multi-
peak Curve in the Presence of Noise. Journal
mathematically optimal sampling policy. However. . . determination of Basic Engineering 86(1):97β106.
of the optimum sampling policies is extremely difficult. Because
of this, the development of our sampling laws has been guided
primarily by heuristic considerations.7 There are some advantages 7 Specifically, maximizing probability of im-
to the approximate approach. . . [and] its use may yield better results provement: Β§ 7.5, p. 131.
than would a procedure that is optimum for the model. Although the
model selected for [π ] is the best we have found for our purposes,
it is sometimes too general. . .
126 common bayesian optimization policies
One-step lookahead
notation for one-step lookahead policies Let us review the generic procedure for developing a one-step lookahead
policy and adopt standard notation to facilitate their description. Suppose
we have selected an arbitrary utility function π’ (D) to evaluate a returned
dataset. Suppose further that we have already gathered an arbitrary
dataset D = (x, y) and wish to select the next evaluation location. This
is the fundamental role of an optimization policy.
proposed next point π₯ with putative value π¦ If we were to choose some point π₯, we would observe a corresponding
updated dataset Dβ² = D βͺ (π₯, π¦) value π¦ and update our dataset, forming D β² = (x,β² yβ²) = D βͺ (π₯, π¦) .
Note that in our discussion on decision theory in Chapter 5, we notated
this updated dataset with the symbol D1 , as we needed to be able to
distinguish between datasets after the incorporation of a variable number
of additional observations. As our focus in this chapter will be on one-step
lookahead, we can simplify notation by dropping subscripts indicating
time. Instead, we will systematically use the prime symbol to indicate
future quantities after the acquisition of the next observation.
In one-step lookahead, we evaluate a proposed point π₯ via the ex-
expected marginal gain pected marginal gain in utility after incorporating an observation there
(5.8):
πΌ (π₯; D) = πΌ π’ (D β²) | π₯, D β π’ (D),
which serves as an acquisition function inducing preferences over possi-
acquisition functions: Β§ 5, p. 88 ble observation locations. We design each observation by maximizing
this score:
π₯ β arg max πΌ (π₯ β²; D). (7.1)
π₯ β² βX
π’ (D) = max πD (x). 10 This reasoning is the same for all one-step
lookahead policies, which could all be de-
Suppose we have already gathered observations D = (x, y) and wish scribed as maximizing βexpected improve-
to choose the next evaluation location. Expected improvement is derived ment.β But this name has been claimed for the
by measuring the expected marginal gain in utility, or the instantaneous simple reward utility alone.
improvement, π’ (D β²) β π’ (D),10 offered by making the next observation 11 As mentioned in the last chapter, simple re-
ward degenerates with an empty dataset; ex-
at a proposed location π₯:11
pected improvement does as well. In that case
β«
we can simply ignore the second term and
πΌ ei (π₯; D) = max πDβ² (xβ²) π (π¦ | π₯, D) dπ¦ β max πD (x). (7.2) compute the first, which for zero-mean addi-
tive noise becomes the mean function of the
prior process.
Expected improvement reduces to a particularly nice expression in
the case of exact observations of the objective, where the utility takes a
simpler form (6.4). Suppose that, when we elect to make an observation expected improvement without noise
at a location π₯, we observe the exact objective value π = π (π₯). Consider
128 common bayesian optimization policies
Figure 7.3: The expected improvement acquisition function (7.2) corresponding to our running example.
maximal value observed, incumbent π β a dataset D = (x, π), and define π β = max π to be the so-called incum-
bent : the maximal objective value yet seen.12 As a consequence of exact
12 The value π β is incumbent as it is currently observation, we have
βholding officeβ as our standing recommenda-
tion until it is deposed by a better candidate. π’ (D) = π β ; π’ (D β²) = max(π ,β π);
and thus
π’ (D β²) β π’ (D) = max(π β π ,β 0).
example and interpretation Expected improvement is illustrated for our running example in
Figure 7.3. In this case, maximizing expected improvement will select
a point near the previous best point found, an example of exploitation.
Notice that the expected improvement vanishes near regions where
we have existing observations. Although these locations may be likely
to yield values higher than π β due to relatively high expected value,
the relatively narrow credible intervals suggest that the magnitude of
any improvement is likely to be small. Expected improvement is thus
considering the explorationβexploitation dilemma in the selection of the
next observation location, and the tradeoff between these two concerns
is considered automatically.
simulated optimization and interpretation Figure 7.4 shows the posterior belief of the objective after sequentially
maximizing expected improvement to gather 20 additional observations
of our example objective function. The global optimum was efficiently
located. The distribution of the sample locations, with more evaluations
in the most promising regions, reflects consideration of the explorationβ
exploitation dilemma. However, there seems to have been a focus on
exploitative behavior resulting from myopia exploitation throughout the entire process; the first ten observations
for example never strayed from the initially known local optimum. This
behavior is a reflection of the simple reward utility function underlying
the policy, which only rewards the discovery of high objective func-
tion values at observed locations. As a result, one-step lookahead may
7.4. knowledge gradient 129
Figure 7.4: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the expected improve-
ment acquisition function (7.2) on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1β10 (top) and 11β20 (bottom). Observations within 0.2
length scales of the optimum have thicker marks; the optimum was located on iteration 19.
Figure 7.5: The knowledge gradient acquisition function (7.4) corresponding to our running example.
Figure 7.7: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the knowledge gradient
acquisition function (7.4) on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1β10 (top) and 11β20 (bottom). Observations within 0.2 length
scales of the optimum have thicker marks; the optimum was located on iteration 15.
of these new maxima coincide with local maxima of the expected im-
provement acquisition function; see Figure 7.3 for comparison. This is
not a coincidence! One way to interpret this relation is that, due to re-
warding large values of the posterior mean at observed locations only,
expected improvement must essentially guess on which side the hidden
local optimum of the objective lies and hope to be correct. The knowl-
edge gradient, on the other hand, considers identifying this maximum
on either side a success, and guessing is not necessary.
Figure 7.7 illustrates the behavior of the knowledge gradient policy simulated optimization and interpretation
on our example optimization scenario. The global optimum was located
efficiently. Comparing the decisions made by the knowledge gradient to
those made by expected improvement (see Figure 7.4), we can observe more exploration than expected improvement
a somewhat more even exploration of the domain, including in local from more-relaxed utility
maxima. The knowledge gradient policy does not necessarily need to
expend observations to verify a suspected maximum, instead putting
more trust into the model to have correct beliefs in these regions.
improvement target
π (π | π₯, D)
π (π β² | π₯ ,β² D)
Figure 7.8: An illustrative example comparing the behavior of probability of improvement with expected improvement
computed with respect to the dashed target. The predictive distributions for two points π₯ and π₯ β² are shown. The
distributions have equal mean but the distribution at π₯ β² has larger predictive standard deviation. The shaded
regions represent the region of improvement. The relatively safe π₯ is preferred by probability of improvement,
whereas the more-risky π₯ β² is preferred by expected improvement.
simple reward: Β§ 6.1, p. 109 Consider the simple reward of an already gathered dataset D = (x, y):
πΌ pi (π = 0.1)
πΌ pi (π = 1)
Figure 7.9: The probability of improvement acquisition function (7.5) corresponding to our running example for different
values of the target improvement π. The target is expressed as a fraction of the range of the posterior mean
over the space. Increasing the target improvement leads to increasingly exploratory behavior.
Figure 7.11: The posterior after 20 steps of the optimization policy induced by probability of improvement with π = 0
(7.5) on our running example. The tick marks show the points chosen by the policy, progressing from top to
bottom.
Evidently we must carefully select the desired improvement threshold 14 d. r. jones (2001). A Taxonomy of Global Op-
to achieve ideal behavior. jones provided some simple, data-driven advice timization Methods Based on Response Sur-
faces. Journal of Global Optimization 21(4):345β
for choosing improvement thresholds that remains sound.14 Define 383.