0% found this document useful (0 votes)
46 views376 pages

Bayesian Optimization

Uploaded by

Mohamad Iman
Copyright
Β© Β© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views376 pages

Bayesian Optimization

Uploaded by

Mohamad Iman
Copyright
Β© Β© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BAYESIAN

OPTIMIZATION
ROMAN GARNETT
Bayesian Optimization

Bayesian optimization is a methodology for optimizing expensive objective functions that has
proven success in the sciences, engineering, and beyond. This timely text provides a self-contained
and comprehensive introduction to the subject, starting from scratch and carefully developing all
the key ideas along the way. This bottom-up approach illuminates unifying themes in the design of
Bayesian optimization algorithms and builds a solid theoretical foundation for approaching novel
situations.
The core of the book is divided into three main parts, covering theoretical and practical aspects
of Gaussian process modeling, the Bayesian approach to sequential decision making, and the real-
ization and computation of practical and effective optimization policies.
Following this foundational material, the book provides an overview of theoretical convergence
results, a survey of notable extensions, a comprehensive history of Bayesian optimization, and an
extensive annotated bibliography of applications.

Roman Garnett is Associate Professor in Computer Science and Engineering at Washington


University in St. Louis. He has been a leader in the Bayesian optimization community since 2011,
when he cofounded a long-running workshop on the subject at the NeurIPS conference. His
research focus is developing Bayesian methods – including Bayesian optimization – for automating
scientific discovery.
ROMAN GARNETT
Washington University in St Louis

B AY E S I A N O P T I M I Z AT I O N
Shaftesbury Road, Cambridge CB2 8EA, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467

Cambridge University Press is part of Cambridge University Press & Assessment,


a department of the University of Cambridge.
We share the University’s mission to contribute to society through the pursuit of
education, learning and research at the highest international levels of excellence.

[Link]
Information on this title: [Link]/9781108425780
DOI: 10.1017/9781108348973
Β© Roman Garnett 2023
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
First published 2023
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-42578-0 Hardback
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will
remain, accurate or appropriate.
CON TEN TS

preface ix

notation xiii

1 introduction 1
1.1 Formalization of Optimization 2
1.2 The Bayesian Approach 5

2 gaussian processes 15
2.1 Definition and Basic Properties 16
2.2 Inference with Exact and Noisy Observations 18
2.3 Overview of Remainder of Chapter 26
2.4 Joint Gaussian Processes 26
2.5 Continuity 28
2.6 Differentiability 30
2.7 Existence and Uniqueness of Global Maxima 33
2.8 Inference with Non-Gaussian Observations and Constraints 35
2.9 Summary of Major Ideas 41

3 modeling with gaussian processes 45


3.1 The Prior Mean Function 46
3.2 The Prior Covariance Function 49
3.3 Notable Covariance Functions 51
3.4 Modifying and Combining Covariance Functions 54
3.5 Modeling Functions on High-Dimensional Domains 61
3.6 Summary of Major Ideas 64

4 model assessment, selection, and averaging 67


4.1 Models and Model Structures 68
4.2 Bayesian Inference over Parametric Model Spaces 70
4.3 Model Selection via Posterior Maximization 73
4.4 Model Averaging 74
4.5 Multiple Model Structures 78
4.6 Automating Model Structure Search 81
4.7 Summary of Major Ideas 84

5 decision theory for optimization 87


5.1 Introduction to Bayesian Decision Theory 89
5.2 Sequential Decisions with a Fixed Budget 91
5.3 Cost and Approximation of the Optimal Policy 99
5.4 Cost-Aware Optimization and Termination as a Decision 103
5.5 Summary of Major Ideas 106

6 utility functions for optimization 109


6.1 Expected Utility of Terminal Recommendation 109
6.2 Cumulative Reward 114

This material has been published by Cambridge University Press as Bayesian Optimization. v
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
vi contents

6.3 Information Gain 115


6.4 Dependence on Model of Objective Function 116
6.5 Comparison of Utility Functions 117
6.6 Summary of Major Ideas 119

7 common bayesian optimization policies 123


7.1 Example Optimization Scenario 124
7.2 Decision-Theoretic Policies 124
7.3 Expected Improvement 127
7.4 Knowledge Gradient 129
7.5 Probability of Improvement 131
7.6 Mutual Information and Entropy Search 135
7.7 Multi-Armed Bandits and Optimization 141
7.8 Maximizing a Statistical Upper Bound 145
7.9 Thompson Sampling 148
7.10 Other Ideas in Policy Construction 150
7.11 Summary of Major Ideas 156

8 computing policies with gaussian processes 157


8.1 Notation for Objective Function Model 157
8.2 Expected Improvement 158
8.3 Probability of Improvement 167
8.4 Upper Confidence Bound 170
8.5 Approximate Computation for One-Step Lookahead 171
8.6 Knowledge Gradient 172
8.7 Thompson Sampling 176
8.8 Mutual Information with π‘₯ βˆ— 180
8.9 Mutual Information with 𝑓 βˆ— 187
8.10 Averaging over a Space of Gaussian Processes 192
8.11 Alternative Models: Bayesian Neural Networks, etc. 196
8.12 Summary of Major Ideas 200

9 implementation 201
9.1 Gaussian Process Inference, Scaling, and Approximation 201
9.2 Optimizing Acquisition Functions 207
9.3 Starting and Stopping Optimization 210
9.4 Summary of Major Ideas 212

10 theoretical analysis 213


10.1 Regret 213
10.2 Useful Function Spaces for Studying Convergence 215
10.3 Relevant Properties of Covariance Functions 220
10.4 Bayesian Regret with Observation Noise 224
10.5 Worst-Case Regret with Observation Noise 232
10.6 The Exact Observation Case 237
10.7 The Effect of Unknown Hyperparameters 241
10.8 Summary of Major Ideas 243
contents vii

11 extensions and related settings 245


11.1 Unknown Observation Costs 245
11.2 Constrained Optimization and Unknown Constraints 249
11.3 Synchronous Batch Observations 252
11.4 Asynchronous Observation with Pending Experiments 262
11.5 Multifidelity Optimization 263
11.6 Multitask Optimization 266
11.7 Multiobjective Optimization 269
11.8 Gradient Observations 276
11.9 Stochastic and Robust Optimization 277
11.10 Incremental Optimization of Sequential Procedures 281
11.11 Non-Gaussian Observation Models and Active Search 282
11.12 Local Optimization 285

12 a brief history of bayesian optimization 287


12.1 Historical Precursors and Optimal Design 287
12.2 Sequential Analysis and Bayesian Experimental Design 287
12.3 The Rise of Bayesian Optimization 289
12.4 Later Rediscovery and Development 290
12.5 Multi-Armed Bandits to Infinite-Armed Bandits 292
12.6 What’s Next? 294

a the gaussian distribution 295

b methods for approximate bayesian inference 301

c gradients 307

d annotated bibliography of applications 313

references 331

index 353
PREFACE

My interest in Bayesian optimization began in 2007 at the start of my


doctoral studies. I was frustrated that there seemed to be a Bayesian
approach to every task I cared about, except optimization. Of course, as
was often the case at that time (not to mention now!), I was mistaken in
this belief, but one should never let ignorance impede inspiration.
Meanwhile, my labmate and soon-to-be frequent collaborator Mike
Osborne had a fresh copy of rasmussen and williams’s Gaussian Pro-
cesses for Machine Learning and just would not stop talking about gps at
our lab meetings. Through sheer brute force of repetition, I slowly built
a hand-wavy intuition for Gaussian processes – my mental model was
the β€œsausage plot” – without even being sure about their precise defi-
nition. However, I was pretty sure that marginals were Gaussian (what
else?), and one day it occurred to me that one could achieve Bayesian The first of many β€œsausage plots” to come.
optimization by maximizing the probability of improvement. This was
the algorithm I was looking for! In my excitement I shot off an email to
Mike that kicked off years of fruitful collaboration:

Can I ask a dumb question about gps? Let’s say that I’m doing
function approximation on an interval with a gp. So I’ve got this
mean function π‘š(π‘₯) and a variance function
 𝑣 (π‘₯). Is it true
 that if
I pick a particular point π‘₯, then 𝑝 𝑓 (π‘₯) ∼ N π‘š(π‘₯), 𝑣 (π‘₯) ? Please
say yes.
If this is true, then I think the idea of doing Bayesian optimization
using gps is, dare I say, trivial.

The hubris of youth!


Well, it turned out I was 45 years too late in proposing this algo-
rithm,1 and that it only seemed β€œtrivial” because I had no appreciation for 1 h. j. kushner (1962). A Versatile Stochastic
its theoretical foundation. However, truly great ideas are rediscovered Model of a Function of Unknown and Time
Varying Form. Journal of Mathematical Analy-
many times, and my excitement did not fade. Once I developed a deeper sis and Applications 5(1):150–167.
understanding of Gaussian processes and Bayesian decision theory, I
came to see them as a β€œBayesian crank” I could turn to realize adaptive
algorithms for any task. I have been repeatedly astonished to find that
the resulting algorithms – seemingly by magic – automatically display
intuitive emergent behavior as a result of their careful design. My goal
with this book is to paint this grand picture. In effect, it is a gift to my
former self: the book I wish I had in the early years of my career.
In the context of machine learning, Bayesian optimization is an
ancient idea – kushner’s paper appeared only three years after the
term β€œmachine learning” was coined! Despite its advanced age, Bayesian
optimization has been enjoying a period of revitalization and rapid
progress over the past ten years. The primary driver of this renaissance
has been advances in computation, which have enabled increasingly
sophisticated tools for Bayesian modeling and inference.
Ironically, however, perhaps the most critical development was not
Bayesian at all, but the rise of deep neural networks, another old idea

This material has been published by Cambridge University Press as Bayesian Optimization. ix
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
x preface

granted new life by modern computation. The extreme cost of training


these models demands efficient routines for hyperparameter tuning, and
in a timely and influential paper, snoek et al. demonstrated (dramati-
2 j. snoek et al. (2012). Practical Bayesian Op- cally!) that Bayesian optimization was up to the task.2 Hyperparameter
timization of Machine Learning Algorithms. tuning proved to be a β€œkiller app” for Bayesian optimization, and the en-
suing surge of interest has yielded a mountain of publications developing
neurips 2012.

new algorithms and improving old ones, exploring countless variations


on the basic setup, establishing theoretical guarantees on performance,
and applying the framework to a huge range of domains.
Due to the nature of the computer science publication model, these
recent developments are scattered across dozens of brief papers, and the
pressure to establish novelty in a limited space can obscure the big picture
in favor of minute details. This book aims to provide a self-contained and
comprehensive introduction to Bayesian optimization, starting β€œfrom
scratch” and carefully developing all the key ideas along the way. This
bottom-up approach allows us to identify unifying themes in Bayesian
optimization algorithms that may be lost when surveying the literature.
intended audience The intended audience is graduate students and researchers in ma-
chine learning, statistics, and related fields. However, it is also my sincere
hope that practitioners from more distant fields wishing to harness the
power of Bayesian optimization will also find some utility here.
prerequisites For the bulk of the text, I assume the reader is comfortable with differ-
ential and integral calculus, probability, and linear algebra. On occasion
the discussion will meander to more esoteric areas of mathematics, and
these passages can be safely ignored and returned to later if desired. A
good working knowledge of the Gaussian distribution is also essential,
and I provide an abbreviated but sufficient introduction in Appendix a.
Chapters 2–4: modeling the objective The book is divided into three main parts. Chapters 2–4 cover theo-
function with Gaussian processes retical and practical aspects of modeling with Gaussian processes. This
class of models is the overwhelming favorite in the Bayesian optimiza-
tion literature, and the material contained within is critical for several
following chapters. It was daunting to write this material in light of
the many excellent references already available, in particular the afore-
mentioned Gaussian Processes for Machine Learning. However, I heavily
biased the presentation in light of the needs of optimization, and even
experts may find something new.
Chapters 5–7: sequential decision making and Chapters 5–7 develop the theory of sequential decision making and
policy building its application to optimization. Although this theory requires a model
of the objective function and our observations of it, the presentation is
agnostic to the choice of model and may be read independently from the
preceding chapters on Gaussian processes.
Chapters 8–10: Bayesian optimization with These threads are unified in Chapters 8–10, which discuss the partic-
Gaussian processes ulars of Bayesian optimization with Gaussian process models. Chapters
8–9 cover details of computation and implementation, and Chapter 10
discusses theoretical performance bounds on Bayesian optimization al-
gorithms, where most results depend intimately on a Gaussian process
model of the objective function or the associated reproducing kernel
Hilbert space.
preface xi

The nuances of some applications require modifications to the basic Chapter 11: extensions
sequential optimization scheme that is the focus of the bulk of the book,
and Chapter 11 introduces several notable extensions to this basic setup.
Each is systematically presented through the unifying lens of Bayesian
decision theory to illustrate how one might proceed when facing a novel
situation.
Finally, Chapter 12 provides a brief and standalone history of Bayesian Chapter 12: brief history of Bayesian
optimization. This was perhaps the most fun chapter for me to write, optimization
if only because it forced me to plod through old Soviet literature (in an
actual library! What a novelty these days!). To my surprise I was able to
antedate many Bayesian optimization policies beyond their commonly
attested origin, including expected improvement, knowledge gradient,
probability of improvement, and upper confidence bound. (A reader
familiar with the literature may be surprised to learn the last of these
was actually the first policy discussed by kushner in his 1962 paper.)
Despite my best efforts, there may still be stones left to be overturned
before the complete history is revealed.
Dependencies between the main chapters are illustrated in the mar-
gin. There are two natural linearizations of the material. The first is the 2 3 4
one I adopted and personally prefer, which covers modeling prior to
decision making. However, one could also proceed in the other order,
7 6 5
reading Chapters 5–7 first, then looping back to Chapter 2. After cov-
ering the material in these chapters (in either order), the remainder of
the book can be perused at will. Logical partial paths through the book
include:
8

β€’ a minimal but self-contained introduction: Chapters 1–2, 5–7


9
β€’ a shorter introduction requiring leaps of faith: Chapters 1 and 7
β€’ a crash course on the underlying theory: Chapters 1–2, 5–7, 10
β€’ a head start on implementing a software package: Chapters 1–9 10
A reader already quite comfortable with Gaussian processes might wish
to skip over Chapters 2–4 entirely. 11
I struggled for some time over whether to include a chapter on ap-
plications. On the one hand, Bayesian optimization ultimately owes its
A dependency graph for Chapters 2–11. Chap-
popularity to its success in optimizing a growing and diverse set of dif- ter 1 is a universal dependency.
ficult objectives. However, these applications often require extensive
technical background to appreciate, and an adequate coverage would
be tedious to write and tedious to read. As a compromise, I provide an
annotated bibliography outlining the optimization challenges involved Annotated Bibliography of Applications:
in notable domains of interest and pointing to studies where these chal- Appendix d, p. 313
lenges were successfully overcome with the aid of Bayesian optimization.
The sheer size of the Bayesian optimization literature – especially
the output of the previous decade – makes it impossible to provide a
complete survey of every recent development. This is especially true
for the extensions discussed in Chapter 11 and even more so for the
bibliography on applications, where work has proliferated in myriad
branching directions. Instead I settled for presenting what I considered
xii preface

to be the most important ideas and providing pointers to entry points


for the relevant literature. The reader should not read anything into any
omissions; there is simply too much high-quality work to go around.
Additional information about the book, including a list of errata as
they are discovered, may be found at the companion webpage:
[Link]
I encourage the reader to report any errata or other issues to the com-
panion GitHub repository for discussion and resolution:
[Link]/bayesoptbook/[Link]
Thank you! Preparation of this manuscript was facilitated tremendously by nu-
merous free and open source projects, and the creators, developers, and
maintainers of these projects have my sincere gratitude. The manuscript
was typeset in LATEX using the excellent and extremely flexible memoir
class. The typeface is Linux Libertine. Figures were laid out in matlab
and converted to TikZ/pgf/pgfplots for further tweaking and typeset-
ting via the matlab2tikz script. The colors used in figures were based
on [Link] by Cynthia A. Brewer, and I endeavored to the
best of my ability to ensure that the figures are colorblind friendly. The
colormap used in heat maps is a slight modification of the Matplotlib
viridis colormap where the β€œbright” end is pure white.
I would like to thank Eric Brochu, Nando de Freitas, Matt Hoffman,
Frank Hutter, Mike Osborne, Bobak Shahriari, Jasper Snoek, Kevin Swer-
sky, and Ziyu Wang, who jointly provided the activation energy for
this undertaking. I would also like to thank Eytan Bakshy, Ivan Barri-
entos, George De Ath, Neil Dhir, Peter Frazier, Lukas FrΓΆhlich, Ashok
Gautam, Jake Gardner, Javier GonzΓ‘lez, Ryan-Rhys Griffiths, Philipp
Hennig, Eugen Hotaj, Jungtaek Kim, Simon Kruse, Jack Liu, Bryan Low,
Ruben Martinez-Cantin, Keita Mori, Kevin Murphy, Matthias Poloczeck,
Jon Scarlett, Sebastian Tay, Sattar Vakili, Jiangyan Zhao, Qiuyi Zhang,
Xiaowei Zhang, and GitHub users cgoble001 and chaos-and-patterns
for their suggestions, corrections, and valuable discussions along the
way, as well as everyone at Cambridge University Press for their support
and patience as I continually missed deadlines. Finally, special thanks
are due to the students of two seminars run at Washington University
reading, discussing, and ultimately improving the book.
Funding support was provided by the United States National Science
Foundation (nsf) under award number 1845434. Any opinions, findings,
and conclusions or recommendations expressed in this book are those
of the author and do not necessarily reflect the views of the nsf.
This book took far more time than I initially anticipated, and I would
especially like to thank my wife Marion, my son Max (arg Max?), and
my daughter Matilda (who escaped being named Minnie!) for their un-
derstanding and support during this long journey.

Roman Garnett
St. Louis, Missouri, November 2022
NOTATION

All vectors are column vectors and are denoted in lowercase bold: x ∈ ℝ𝑑. vectors and matrices
Matrices are denoted in uppercase bold: A.
We adopt the β€œnumerator layout” convention for matrix calculus: matrix calculus convention
the derivative of a vector by a scalar is a (column) vector, whereas the
derivative of a scalar by a vector is a row vector. This results in the chain
rule proceeding from left-to-right; for example, if a vector x(πœƒ ) depends chain rule
on a scalar parameter πœƒ , then for a function 𝑓 (x), we have:

πœ•π‘“ πœ•π‘“ πœ•x
= .
πœ•πœƒ πœ•x πœ•πœƒ

When an indicator function is required, we use the Iverson bracket indicator functions
notation. For a statement 𝑠, we have:
(
1 if 𝑠 is true;
[𝑠] =
0 otherwise.

The statement may depend on a parameter: [π‘₯ ∈ 𝐴], [π‘₯ β‰₯ 0], etc.


Logarithms are taken with respect to their natural base, 𝑒. Quantities logarithms
in log units such as log likelihoods or entropy thus have units of nats, nats
the base-𝑒 analogue of the more familiar base-2 bits.

symbols with implicit dependence on location


There is one notational innovation in this book compared with the
Gaussian process and Bayesian optimization literature at large: we make
heavy use of symbols for quantities that depend implicitly on a putative
(and arbitrary) input location π‘₯. Most importantly, to refer to the value
of an objective function 𝑓 at a given location π‘₯, we introduce the symbol
πœ™ = 𝑓 (π‘₯). This avoids clash with the name of the function itself, 𝑓, while
avoiding an extra layer of brackets. We use this scheme throughout the
book, including variations such as:

πœ™ β€² = 𝑓 (π‘₯ β€²); 𝝓 = 𝑓 (x); 𝛾 = 𝑔(π‘₯); etc.

To refer to the outcome of a (possibly inexact) measurement at π‘₯, we use


the symbol 𝑦; the distribution of 𝑦 presumably depends on πœ™.
We also allocate symbols to describe properties of the marginal pre-
dictive distributions for the objective function value πœ™ and observed
value 𝑦, all of which also have implicit dependence on π‘₯. These appear
in the following table.

comprehensive list of symbols


A list of important symbols appears on the following pages, arranged
roughly in alphabetical order.

This material has been published by Cambridge University Press as Bayesian Optimization. xiii
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
xiv notation

symbol description
≑ identical equality of functions; for a constant 𝑐, 𝑓 ≑ 𝑐 is a constant function
βˆ‡ gradient operator
βˆ… termination option: the action of immediately terminating optimization
β‰Ί either Pareto dominance or the LΓΆwner order: for symmetric A, B, A β‰Ί B if and only if B βˆ’ A
is positive definite
is sampled according to: πœ” is a realization of a random variable with probability density 𝑝 (πœ”)
Γƒ Γƒ Ð 
πœ” ∼ 𝑝 (πœ”)
𝑖 X𝑖 disjoint union of {X𝑖 }: 𝑖 X𝑖 = 𝑖 (π‘₯, 𝑖) | π‘₯ ∈ X𝑖
|A| determinant of square matrix A
|x| Euclidean norm of vector x; |x βˆ’ y| is thus the Euclidean distance between vectors x and y
βˆ₯ 𝑓 βˆ₯ H𝐾 norm of function 𝑓 in reproducing kernel Hilbert space H𝐾
Aβˆ’1 inverse of square matrix A
x⊀ transpose of vector x
0 vector or matrix of zeros
A action space for a decision
𝛼 (π‘₯; D) acquisition function evaluating π‘₯ given data D
π›Όπœ (π‘₯; D) expected marginal gain in 𝑒 (D) after observing at π‘₯ then making 𝜏 βˆ’ 1 additional optimal
observations given the outcome
π›Όπœβˆ— (D) value of D with horizon 𝜏: expected marginal gain in 𝑒 (D) from 𝜏 additional optimal obser-
vations
𝛼 ei expected improvement
𝛼𝑓 βˆ— mutual information between 𝑦 and 𝑓 βˆ—
𝛼 kg knowledge gradient
𝛼 pi probability of improvement
𝛼π‘₯ βˆ— mutual information between 𝑦 and π‘₯ βˆ—
𝛼 ucb upper confidence bound
𝛼 ts Thompson sampling β€œacquisition function:” a draw 𝑓 ∼ 𝑝 (𝑓 | D)
𝛽 confidence parameter in Gaussian process upper confidence bound policy
𝛽 (x; D) batch acquisition function evaluating x given data D; may have modifiers analogous to 𝛼
C prior covariance matrix of observed values y: C = cov[y]
𝑐 (D) cost of acquiring data D
chol A Cholesky decomposition of positive definite matrix A: if 𝚲 = chol A, then A = 𝚲𝚲⊀
corr[πœ”,πœ“ ] correlation of random variables πœ” and πœ“ ; with a single argument, corr[πœ”] = corr[πœ”, πœ”]
cov[πœ”,πœ“ ] covariance of random variables πœ” and πœ“ ; with a single argument, cov[πœ”] = cov[πœ”, πœ”]
D set of observed data, D = (x, y) 
D,β€² D1 set of observed data after observing at π‘₯: D β€² = D βˆͺ (π‘₯, 𝑦) = (x,β€² yβ€²)
D𝜏 set of observed data after 𝜏 observations
𝐷 kl [𝑝 βˆ₯ π‘ž] Kullback–Leibler divergence between distributions with probability densities 𝑝 and π‘ž
Ξ”(π‘₯, 𝑦) marginal gain in utility after acquiring observation (π‘₯, 𝑦): Ξ”(π‘₯, 𝑦) = 𝑒 (D β€²) βˆ’ 𝑒 (D)
𝛿 (πœ” βˆ’ π‘Ž) Dirac delta distribution on πœ” with point mass at π‘Ž
diag x diagonal matrix with diagonal x
𝔼, π”Όπœ” expectation, expectation with respect to πœ”
πœ€ measurement error associated with an observation at π‘₯: πœ€ = 𝑦 βˆ’ πœ™
𝑓 objective function; 𝑓 : X β†’ ℝ
𝑓 |Y the restriction of 𝑓 onto the subdomain Y βŠ‚ X
π‘“βˆ— globally maximal value of the objective function: 𝑓 βˆ— = max 𝑓
π›Ύπœ information capacity of an observation process given 𝜏 iterations
notation xv

symbol description
GP (𝑓 ; πœ‡, 𝐾) Gaussian process on 𝑓 with mean function πœ‡ and covariance function 𝐾
H𝐾 reproducing kernel Hilbert space associated with kernel 𝐾
H𝐾 [𝐡] ball of radius 𝐡 in H𝐾 : {𝑓 | βˆ₯𝑓 βˆ₯ H𝐾 ≀ 𝐡}
𝐻 [πœ”] discrete or differential entropy of random variable πœ”
𝐻 [πœ” | D] discrete or differential of random variable πœ” after conditioning on D
𝐼 (πœ”;πœ“ ) mutual information between random variables πœ” and πœ“
𝐼 (πœ”;πœ“ | D) mutual information between random variables πœ” and πœ“ after conditioning on D
I identity matrix
𝐾 prior covariance function: 𝐾 = cov[𝑓 ]
𝐾D posterior covariance function given data D: 𝐾D = cov[𝑓 | D]
𝐾m Matérn covariance function
𝐾se squared exponential covariance function
πœ… cross-covariance between 𝑓 and observed values y: πœ… (π‘₯) = cov[y, πœ™ | π‘₯]
β„“ either a length-scale parameter or the lookahead horizon
πœ† output-scale parameter
M space of models indexed by the hyperparameter vector 𝜽
m prior expected value of observed values y, m = 𝔼[y]
πœ‡ either the prior mean function, πœ‡ = 𝔼[𝑓 ], or the predictive mean of πœ™: πœ‡ = 𝔼[πœ™ | π‘₯, D] = πœ‡D (π‘₯)
πœ‡D posterior mean function given data D: πœ‡D = 𝔼[𝑓 | D]
N (𝝓; 𝝁, 𝚺) multivariate normal distribution on 𝝓 with mean vector 𝝁 and covariance matrix 𝚺
N measurement error covariance corresponding to observed values y
O is asymptotically bounded above by: for nonnegative functions 𝑓, 𝑔 of 𝜏, 𝑓 = O(𝑔) if 𝑓 /𝑔 is
asymptotically bounded by a constant as 𝜏 β†’ ∞
Oβˆ— as above with logarithmic factors suppressed: 𝑓 = Oβˆ— (𝑔) if 𝑓 (𝜏) (log 𝜏)π‘˜ = O(𝑔) for some π‘˜
Ξ© is asymptotically bounded below by: 𝑓 = Ξ©(𝑔) if 𝑔 = O(𝑓 )
𝑝 probability density
π‘ž either an approximation to probability density 𝑝 or a quantile
βˆ«π‘§ function
Ξ¦(𝑧) standard normal cumulative density function: Ξ¦(𝑧) = βˆ’βˆž πœ™ (𝑧 ) d𝑧 β€²
β€²

πœ™ value of the objective function at π‘₯: πœ™ = 𝑓 (π‘₯) √


πœ™ (𝑧) standard normal probability density function: πœ™ (𝑧) = ( 2πœ‹) βˆ’1 exp(βˆ’ 21 𝑧 2 )
Pr probability
ℝ set of real numbers
π‘…πœ cumulative regret after 𝜏 iterations
π‘…Β―πœ [𝐡] worst-case cumulative regret after 𝜏 iterations on the rkhs ball H𝐾 [𝐡]
π‘Ÿπœ simple regret after 𝜏 iterations
π‘ŸΒ―πœ [𝐡] worst-case simple regret after 𝜏 iterations on the rkhs ball H𝐾 [𝐡]
P a correlation matrix
𝜌 a scalar correlation
𝜌𝜏 instantaneous regret on iteration 𝜏
𝑠2 predictive variance of 𝑦; for additive Gaussian noise, 𝑠 2 = var[𝑦 | π‘₯, D] = 𝜎 2 + πœŽπ‘›2
𝚺 a covariance matrix, usually the Gram matrix associated with x: 𝚺 = 𝐾D (x, x)
𝜎2 predictive variance of πœ™: 𝜎 2 = 𝐾D (π‘₯, π‘₯)
πœŽπ‘›2 variance of measurement error at π‘₯: πœŽπ‘›2 = var[πœ€ | π‘₯]
std[πœ”] standard deviation of random variable πœ”
T (πœ™; πœ‡, 𝜎 ,2 𝜈) Student-𝑑 distribution on πœ™ with 𝜈 degrees of freedom, mean πœ‡, and variance 𝜎 2
TN (πœ™; πœ‡, 𝜎 ,2 𝐼 ) truncated normal distribution, N (πœ™; πœ‡, 𝜎 2 ) truncated to interval 𝐼
xvi notation

symbol description
𝜏 either decision horizon (in the context of decision making) or number of optimization itera-
tions passed (in the context of asymptotic analysis)
Θ is asymptotically bounded above and below by: 𝑓 = Θ(𝑔) if 𝑓 = O(𝑔) and 𝑓 = Ξ©(𝑔)
𝜽 vector of hyperparameters indexing a model space M
tr A trace of square matrix A
𝑒 (D) utility of data D
var[πœ”] variance of random variable πœ”
π‘₯ putative input location of the objective function
x either a sequence of observed locations x = {π‘₯𝑖 } or (when the distinction is important) a
vector-valued input location
π‘₯βˆ— a location attaining the globally maximal value of 𝑓 : π‘₯ βˆ— ∈ arg max 𝑓 ; 𝑓 (π‘₯ βˆ— ) = 𝑓 βˆ—
X domain of objective function
𝑦 value resulting from an observation at π‘₯
y observed values resulting from observations at locations x
𝑧 𝑧-score of measurement 𝑦 at π‘₯: 𝑧 = (𝑦 βˆ’ πœ‡)/𝑠
1
IN TRODUCTION

Optimization is an innate human behavior. On an individual level, we


all strive to better ourselves and our surroundings. On a collective level,
societies struggle to allocate limited resources seeking to improve the
welfare of their members, and optimization has been an engine of societal
progress since the domestication of crops through selective breeding
over 12 000 years ago – an effort that continues to this day.
Given its pervasiveness, it should perhaps not be surprising that
optimization is also difficult. While searching for an optimal design,
we must spend – sometimes quite significant – resources evaluating
suboptimal alternatives along the way. This observation compels us to
seek methods of optimization that, when necessary, can carefully allocate
resources to identify optimal parameters as efficiently as possible. This
is the goal of mathematical optimization.
Since the 1960s, the statistics and machine learning communities
have refined a Bayesian approach to optimization that we will develop
and explore in this book. Bayesian optimization routines rely on a statis-
tical model of the objective function, whose beliefs guide the algorithm
in making the most fruitful decisions. These models can be quite so-
phisticated, and maintaining them throughout optimization may entail
significant cost of its own. However, the reward for this effort is unparal-
leled sample efficiency. For this reason, Bayesian optimization has found
a niche in optimizing objectives that:
β€’ are costly to compute, precluding exhaustive evaluation,
β€’ lack a useful expression, causing them to function as β€œblack boxes,”
β€’ cannot be evaluated exactly, but only through some indirect or noisy
mechanism, and/or
β€’ offer no efficient mechanism for estimating their gradient.
Let us consider an example setting motivating the machine learn-
ing community’s recent interest in Bayesian optimization. Consider a
data scientist crafting a complex machine learning model – say a deep
neural network – from training data. To ensure success, the scientist
must carefully tune the model’s hyperparameters, including the network
architecture and details of the training procedure, which have massive
influence on performance. Unfortunately, effective settings can only be
identified via trial-and-error: by training several networks with different
settings and evaluating their performance on a validation dataset.
The search for the best hyperparameters is of course an exercise
in optimization. Mathematical optimization has been under continual
development for centuries, and numerous off-the-shelf procedures are
available. However, these procedures usually make assumptions about
the objective function that may not always be valid. For example, we
might assume that the objective is cheap to evaluate, that we can easily
compute its gradient, or that it is convex, allowing us to reduce from
global to local optimization.

This material has been published by Cambridge University Press as Bayesian Optimization. 1
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
2 introduction

In hyperparameter tuning, all of these assumptions are invalid. Train-


ing a deep neural network can be extremely expensive in terms of both
time and energy. When some hyperparameters are discrete – as many
features of network architecture naturally are – the gradient does not
even exist. Finally, the mapping from hyperparameters to performance
may be highly complex and multimodal, so local refinement may not
yield an acceptable result.
The Bayesian approach to optimization allows us to relax all of these
1 r. turner et al. (2021). Bayesian Optimization assumptions when necessary, and Bayesian optimization algorithms can
Is Superior to Random Search for Machine deliver impressive performance even when optimizing complex β€œblack
Learning Hyperparameter Tuning: Analysis of
the Black-Box Optimization Challenge 2020.
box” objectives under severely limited observation budgets. Bayesian
Proceedings of the Neurips 2020 Competition optimization has proven successful in settings spanning science, engi-
and Demonstration Track. neering, and beyond, including of course hyperparameter tuning.1 In
2 a. gelman and a. vehtari (2021). What Are light of this broad success, gelman and vehtari identified adaptive
the Most Important Statistical Ideas of the Past decision analysis – and Bayesian optimization in particular – as one of
50 Years? Journal of the American Statistical
Association 116(536):2087–2097.
the eight most important statistical ideas of the past 50 years.2
Covering all these applications and their nuances could easily fill a
separate volume (although we do provide an overview of some impor-
Annotated Bibliography of Applications: tant application domains in an annotated bibliography), so in this book
Appendix d, p. 313 we will settle for developing the mathematical foundation of Bayesian
optimization underlying its success. In the remainder of this chapter we
will lay important groundwork for this discussion. We will first establish
the precise formulation of optimization we will consider and important
conventions of our presentation, then outline and illustrate the key as-
outline and reading guide: p. x pects of the Bayesian approach. The reader may find an outline of and
reading guide for the chapters to come in the Preface.

1.1 formalization of optimization


Throughout this book we will consider a simple but flexible formulation
of sequential global optimization outlined below. There is nothing inher-
ently Bayesian about this model, and countless solutions are possible.
objective function, 𝑓 We begin with a real-valued objective function defined on some
domain of objective function, X domain X ; 𝑓 : X β†’ ℝ. We make no assumptions regarding the nature
of the domain. In particular, it need not be Euclidean but might instead,
π‘“βˆ— for example, comprise a space of complex structured objects. The goal
of optimization is to systematically search the domain for a point π‘₯ βˆ—
attaining the globally maximal value 𝑓 βˆ— :3

π‘₯ βˆ— ∈ arg max 𝑓 (π‘₯); 𝑓 βˆ— = max 𝑓 (π‘₯) = 𝑓 (π‘₯ βˆ— ). (1.1)


π‘₯ ∈X π‘₯ ∈X
π‘₯βˆ—
Before we proceed, we note that our focus on maximization rather
An objective function with the location, π‘₯ ,βˆ—
and value, 𝑓 ,βˆ— of the global optimum marked.
than minimization is entirely arbitrary; the author simply judges max-
imization to be the more optimistic choice. If desired, we can freely
transform one problem to the other by negating the objective function.
3 A skeptical reader may object that, without
further assumptions, a global maximum may We caution the reader that some translation may be required when com-
not exist at all! We will sidestep this issue for paring expressions derived here to what may appear in parallel texts
now and pick it up again in Β§ 2.7, p. 34. focusing on minimization.
1.1. formalization of optimization 3

input: initial dataset D β–Ά can be empty Algorithm 1.1: Sequential optimization.


repeat
π‘₯ ← policy(D) β–Ά select the next observation location
𝑦 ← observe(π‘₯)
 β–Ά observe at the chosen location
D ← D βˆͺ (π‘₯, 𝑦) β–Ά update dataset
until termination condition reached β–Ά e.g., budget exhausted
return D

In a significant departure from classical mathematical optimization,


we do not require that the objective function have a known functional
form or even be computable directly. Rather, we only require access to a
mechanism revealing some information about the objective function at
identified points on demand. By amassing sufficient information from
this mechanism, we may hope to infer the solution to (1.1). Avoiding
the need for an explicit expression for 𝑓 allows us to consider so-called 4 Of course, we do not require but merely allow
β€œblack box” optimization, where a system is optimized through indirect that the objective function act as a black box.
Access to a closed-form expression does not
measurements of its quality. This is one of the greatest strengths of preclude a Bayesian approach!
Bayesian optimization.4

Optimization policy
Directly solving for the location of global optima is infeasible except in
exceptional circumstances. The tools of traditional calculus are virtually
powerless in this setting; for example, enumerating and classifying every
stationary point in the domain would be tedious at best and perhaps
even impossible. Mathematical optimization instead takes an indirect
approach: we design a sequence of experiments to probe the objective
function for information that, we hope, will reveal the solution to (1.1).
The iterative procedure in Algorithm 1.1 formalizes this process. We
begin with an initial (possibly empty) dataset D that we grow incremen-
tally through a sequence of observations of our design. In each iteration,
an optimization policy inspects the available data and selects a point
π‘₯ ∈ X where we make our next observation.5 This action in turn reveals 5 Here β€œpolicy” has the same meaning as in
a corresponding value 𝑦 provided by the system under study. We append other decision-making contexts: it maps our
state (indexed by our data, D) to an action (the
the newly observed information to our dataset and finally decide whether location of our next observation, π‘₯).
to continue with another observation or terminate and return the current
data. When we inevitably do choose to terminate, the returned data can
be used by an external consumer as desired, for example to inform a
subsequent decision. terminal recommendations: Β§ 5.1, p. 90
We place no restrictions on how an optimization policy is imple-
mented beyond mapping an arbitrary dataset to some point in the do-
main for evaluation. A policy may be deterministic or stochastic, as
demonstrated respectively by the prototypical examples of grid search
and random search. In fact, these popular policies are nonadaptive and
completely ignore the observed data. However, when observations only
come at significant cost, we will naturally prefer policies that adapt their
behavior in light of evolving information. The primary challenge in opti-
4 introduction

mization is designing policies that can rapidly optimize a broad class of


objective functions, and intelligent policy design will be our focus for
the majority of this book.

Observation model
For optimization to be feasible, the observations we obtain must provide
information about the objective function that can guide our search and in
aggregate determine the solution to (1.1). A near-universal assumption in
mathematical optimization is that observations yield exact evaluations of
the objective function at our chosen locations. However, this assumption
is unduly restrictive: many settings feature inexact measurements due
to noisy sensors, imperfect simulation, or statistical approximation. A
typical example featuring additive observation noise is shown in the
Inexact observations of an objective function margin. Although the objective function is not observed directly, the
corrupted by additive noise. noisy measurements nonetheless constrain the plausible options due to
strong dependence on the objective.
We thus relax the assumption of exact observation and instead as-
sume that observations are realized by a stochastic mechanism depending
measured value, 𝑦 on the objective function. Namely, we assume that the value 𝑦 resulting
observation location, π‘₯ from an observation at some point π‘₯ is distributed according to an ob-
servation model depending on the underlying objective function value
objective function value, πœ™ = 𝑓 (π‘₯) πœ™ = 𝑓 (π‘₯):
𝑝 (𝑦 | π‘₯, πœ™). (1.2)
Through judicious design of the observation model, we may consider a
wide range of observation mechanisms.
conditional independence of observations As with the optimization policy, we do not make any assumptions
given objective values about the nature of the observation model, save one. Unless otherwise
mentioned, we assume that a set of multiple measurements y are condi-
tionally independent given the corresponding observation locations x
and objective function values 𝝓 = 𝑓 (x):
Γ–
𝑝 (y | x, 𝝓) = 𝑝 (𝑦𝑖 | π‘₯𝑖 , πœ™π‘– ). (1.3)
𝑖
𝑝 (𝑦 | π‘₯, πœ™)
This is not strictly necessary but is overwhelmingly common in practice
and will simplify our presentation considerably.
One particular observation model will enjoy most of our attention in
𝑦 this book: additive Gaussian noise. Here we model the value 𝑦 observed
at π‘₯ as
πœ™ 𝑦 = πœ™ + πœ€,
Additive Gaussian noise: the distribution where πœ€ represents measurement error. Errors are assumed to be Gaussian
of the value 𝑦 observed at π‘₯ is Gaussian, distributed with mean zero, implying a Gaussian observation model:
centered on the objective function value πœ™.
𝑝 (𝑦 | π‘₯, πœ™, πœŽπ‘› ) = N (𝑦; πœ™, πœŽπ‘›2 ). (1.4)

observation noise scale, πœŽπ‘› Here the observation noise scale πœŽπ‘› may optionally depend on π‘₯, allowing
heteroskedastic noise: Β§ 2.2, p. 25 us to model both homoskedastic or heteroskedastic errors.
1.2. the bayesian approach 5

If we take the noise scale to be identically zero, we recover the 𝑝 (𝑦 | πœ™)


special case of exact observation, where we simply have 𝑦 = πœ™ and the
observation model collapses to a Dirac delta distribution:
𝑝 (𝑦 | πœ™) = 𝛿 (𝑦 βˆ’ πœ™).
Although not universally applicable, many settings do feature exact 𝑦
observations such as optimizing the output of a deterministic computer πœ™
simulation. We will sometimes consider the exact case separately as
some results simplify considerably in the absence of measurement error. Exact observations: every value measured
We will focus on additive Gaussian noise as it is a reasonably faithful equals the corresponding function value,
yielding a Dirac delta observation model.
model for many systems and offers considerable mathematical conve-
nience. This observation model will be most prevalent in our discussion
on Gaussian processes in the next three chapters and on the explicit
computation of Bayesian optimization policies with this model class inference with non-Gaussian observations:
in Chapter 8. However, the general methodology we will build in the Β§ 2.8, p. 35
remainder of this book is not contingent on this choice, and we will optimization with non-Gaussian
occasionally address alternative observation mechanisms. observations: Β§ 11.11, p. 282

Termination
The final decision we make in each iteration of optimization is whether
to terminate immediately or continue with another observation. As with
the optimization policy, we do not assume any particular mechanism by
which this decision is made. Termination may be deterministic – such
as stopping after reaching a certain optimization goal or exhausting
a preallocated observation budget – or stochastic, and may optionally
depend on the observed data. In many cases, the time of termination
may in fact not be under the control of the optimization routine at all
but instead decided by an external agent. However, we will also consider
scenarios where the optimization procedure can dynamically choose optimal termination: Β§ 5.4, p. 103
when to return based upon inspection of the available data. practical termination: Β§ 9.3, p. 210

1.2 the bayesian approach


Bayesian optimization does not refer to one particular algorithm but
rather to a philosophical approach to optimization grounded in Bayes-
ian inference from which an extensive family of algorithms have been
derived. Although these algorithms display significant diversity in their
details, they are bound by common themes in their design.
Optimization is fundamentally a sequence of decisions: in each it-
eration, we must choose where to make our next observation and then
whether to terminate depending on the outcome. As the outcomes of
these decisions are governed by the system under study and outside our
control, the success of optimization rests entirely on effective decision
making.
Increasing the difficulty of these decisions is that they must be made
under uncertainty, as it is impossible to know the outcome of an observa-
tion before making it. The optimization policy must therefore design each
6 introduction

observation with some measure of faith that the outcome will ultimately
prove beneficial and justify the cost of obtaining it. The sequential nature
of optimization further compounds the weight of this uncertainty, as the
outcome of each observation not only has an immediate impact, but also
forms the basis on which all future decisions are made. Developing an
effective policy requires somehow addressing this uncertainty.
The Bayesian approach systematically relies on probability and Bayes-
ian inference to reason about the uncertain quantities arising during
optimization. This critically includes the objective function itself, which
is treated as a random variable to be inferred in light of our prior ex-
pectations and any available data. In Bayesian optimization, this belief
then takes an active role in decision making by guiding the optimiza-
tion policy, which may evaluate the merit of a proposed observation
location according to our belief about the value we might observe. We
introduce the key ideas of this process with examples below, starting
with a refresher on Bayesian inference.

Bayesian inference
To frame the following discussion, we offer a quick overview of Bayesian
6 The literature is vast. The following references inference as a reminder to the reader. This introduction is far from
are excellent, but no list can be complete: complete, but there are numerous excellent references available.6
d. j. c. mackay (2003). Information Theory, In- Bayesian inference is a framework for inferring uncertain features of
ference, and Learning Algorithms. Cambridge a system of interest from observations grounded in the laws of probability.
University Press. To illustrate the basic ideas, we may begin by identifying some unknown
a. o’hagan and j. forster (2004). Kendall’s feature of a given system that we wish to reason about. In the context of
Advanced Theory of Statistics. Vol. 2b: Bayes- optimization, this might represent, for example, the value of the objective
ian Inference. Arnold.
function at a given location, or the location π‘₯ βˆ— or value 𝑓 βˆ— of the global
j. o. berger (1985). Statistical Decision Theory optimum (1.1). We will take the first of these as a running example:
and Bayesian Analysis. Springer–Verlag. inferring about the value of an objective function at some arbitrary point
π‘₯, πœ™ = 𝑓 (π‘₯). We will shortly extend this example to inference about the
entire objective function.
In the Bayesian approach to inference, all unknown quantities are
treated as random variables. This is a powerful convention as it allows us
to represent beliefs about these quantities with probability distributions
reflecting their plausible values. Inference then takes the form of an
inductive process where these beliefs are iteratively refined in light of
observed data by appealing to probabilistic identities.
As with any induction, we must start somewhere. Here we begin with
prior distribution, 𝑝 (πœ™ | π‘₯) a so-called prior distribution (or simply prior) 𝑝 (πœ™ | π‘₯), which encodes
what we consider to be plausible values for πœ™ before observing any
7 Here we assume the location of interest π‘₯ is data.7 The prior distribution allows us to inject our knowledge about and
known, hence our conditioning the prior on experience with the system of interest into the inferential process, saving
its value.
us from having to begin β€œfrom scratch” or entertain patently absurd
possibilities. The left panel of Figure 1.1 illustrates a prior distribution
for our example, indicating support over a range of values.
Once a prior has been established, the next stage of inference is
to refine our initial beliefs in light of observed data. Suppose in our
1.2. the bayesian approach 7

prior, 𝑝 (πœ™ | π‘₯) likelihood, 𝑝 (𝑦 | π‘₯, πœ™) posterior, 𝑝 (πœ™ | D)

πœ™ πœ™ πœ™
𝑦 𝑦

Figure 1.1: Bayesian inference for an unknown function value πœ™ = 𝑓 (π‘₯). Left: a prior distribution over πœ™; middle: the
likelihood of the marked observation 𝑦 according to an additive Gaussian noise observation model (1.4) (prior
shown for reference); right: the posterior distribution in light of the observation and the prior (prior and
likelihood shown for reference).

example we make an observation of the objective function at π‘₯, revealing


a measurement 𝑦. In our model of optimization, the distribution of this
measurement is assumed to be determined by the value of interest πœ™
through the observation model 𝑝 (𝑦 | π‘₯, πœ™) (1.2). In the context of Bayesian
inference, a distribution explaining the observed values (here 𝑦) in terms
of the values of interest (here πœ™) is known as a likelihood function or
simply a likelihood. The middle panel of Figure 1.1 shows the likelihood likelihood function (observation model),
– as a function of πœ™ – for a given measurement 𝑦, here assumed to be 𝑝 (𝑦 | π‘₯, πœ™)
generated by additive Gaussian noise (1.4).
Finally, given the observed value 𝑦, we may derive the updated poste-
rior distribution (or simply posterior) of πœ™ by appealing to Bayes’ theorem: posterior distribution, 𝑝 (πœ™ | π‘₯, 𝑦)

𝑝 (πœ™ | π‘₯) 𝑝 (𝑦 | π‘₯, πœ™)
𝑝 (πœ™ | π‘₯, 𝑦) = . (1.5)
𝑝 (𝑦 | π‘₯)
The posterior is proportional to the prior weighted by the likelihood of
the observed value. The denominator is a constant with respect to πœ™ that
ensures normalization:
∫
𝑝 (𝑦 | π‘₯) = 𝑝 (𝑦 | π‘₯, πœ™) 𝑝 (πœ™ | π‘₯) dπœ™ . (1.6)

The right panel of Figure 1.1 shows the posterior resulting from the
measurement in the middle panel. The posterior represents a compromise
between our experience (encoded in the prior) and the information
contained in the data (encoded in the likelihood).
Throughout this book we will use the catchall notation D to repre- data informing posterior belief, D
sent all the information influencing a posterior belief; here the relevant
information is D = (π‘₯, 𝑦), and the posterior distribution is then 𝑝 (πœ™ | D).
As mentioned previously, Bayesian inference is an inductive process
whereby we can continue to refine our beliefs through additional ob-
servation. At this point, the induction is trivial: to incorporate a new
8 introduction

observation, what was our posterior serves as the prior in the context
posterior predictive, 𝑝 (𝑦 β€² | π‘₯, D) of the new information, and multiplying by the likelihood and renor-
malizing yields a new posterior. We may continue in this manner as
desired.
The posterior distribution is not usually the end result of Bayesian
inference but rather a springboard enabling follow-on tasks such as
prediction or decision making, both of which are integral to Bayesian
optimization. To address the former, suppose that after deriving the
posterior (1.5), we wish to predict the result of an independent, repeated
𝑦′
noisy observation at π‘₯, 𝑦 .β€² Treating the outcome as a random variable,
𝑦 we may derive its distribution by integrating our posterior belief about
πœ™ against the observation model (1.2):8
Posterior predictive distribution for a repeated ∫
measurement at π‘₯ for our running example.
The location of our first measurement 𝑦 and
𝑝 (𝑦 β€² | π‘₯, D) = 𝑝 (𝑦 β€² | π‘₯, πœ™) 𝑝 (πœ™ | π‘₯, D) dπœ™; (1.7)
the posterior distribution of πœ™ are shown for
reference. There is more uncertainty in 𝑦 β€² this is known as the posterior predictive distribution for 𝑦 .β€² By integrating
than πœ™ due to the effect of observation noise. over all possible values of πœ™ weighted by their plausibility, the posterior
predictive distribution naturally accounts for uncertainty in the unknown
8 This expression takes the same form as (1.6), objective function value; see the figure in the margin.
which is simply the (prior) predictive distribu- The Bayesian approach to decision making also relies on a posterior
tion evaluated at the actual observed value.
belief about unknown features affecting the outcomes of our decisions,
as we will discuss shortly.

Bayesian inference of the objective function


At the heart of any Bayesian optimization routine is a probabilistic belief
over the objective function. This takes the form of a stochastic process, a
stochastic process probability distribution over an infinite collection of random variables –
here the objective function value at every point. The reasoning behind
this inference is, in essence, the same as our single-point example above.
We begin by encoding any assumptions we may have about the ob-
jective function, such as smoothness or other features, in a prior process
objective function prior, 𝑝 (𝑓 ) 𝑝 (𝑓 ). Conveniently, we can specify a stochastic process via the distribu-
tion of the function values 𝝓 corresponding to an arbitrary finite set of
locations x:
𝑝 (𝝓 | x). (1.8)
Chapter 2: Gaussian Processes, p. 15 The family of Gaussian processes – where these finite-dimensional distri-
Chapter 3: Modeling with Gaussian Processes,
butions are multivariate Gaussian – is especially convenient and widely
p. 45 used in Bayesian optimization. We will explore this model class in depth
in the following three chapters; here we provide a motivating illustration.
Chapter 4: Model Assessment, Selection, and
Averaging, p. 67
Figure 1.2 shows a Gaussian process prior on a one-dimensional
objective function, constructed to reflect a minimal set of assumptions
we will elaborate on later in the book:
differentiability: Β§ 2.6, p. 30 β€’ that the objective function is smooth (that is, infinitely differentiable),
characteristic length scales: Β§ 3.4, p. 56 β€’ that correlations among function values have a characteristic scale, and
stationarity: Β§ 3.2, p. 50 β€’ that the function’s expected behavior does not depend on location (that
is, the prior process is stationary).
1.2. the bayesian approach 9

prior mean prior 95% credible interval samples

Figure 1.2: An example prior process for an objective defined on an interval. We illustrate the marginal belief at every
location with its mean and a 95% credible interval and also show three example functions sampled from the
prior process.

We summarize the marginal belief of the model, for each point in the
domain showing the prior mean and a 95% credible interval for the cor-
responding function value. We also show three functions sampled from
the prior process, each exhibiting the assumed behavior. We encourage
the reader to become comfortable with this plotting convention, as we
will use it throughout this book. In particular we eschew axis labels, as plotting conventions
they are always the same: the horizontal axis represents the domain X
and the vertical axis the function value. Further, we do not mark units on
axes to stress relative rather than absolute behavior, as scale is arbitrary
in this illustration.
We can encode a vast array of information into the prior process
and can model significantly more complex structure than in this sim-
ple example. We will explore the world of possibilities in Chapter 3,
including interaction at different scales, nonstationarity, low intrinsic nonstationarity, warping: Β§ 3.4, p. 56
dimensionality, and more. low intrinsic dimensionality: Β§ 3.5, p. 61
With the prior process in hand, suppose we now make a set of
observations at some locations x, revealing corresponding values y; we
aggregate this information into a dataset D = (x, y). Bayesian inference observed data, D = (x, y)
accounts for these observations by forming the posterior process 𝑝 (𝑓 | D). objective function posterior, 𝑝 (𝑓 | D)
The derivation of the posterior process can be understood as a two-
stage process. First we consider the impact of the data on the correspond-
ing function values 𝝓 alone (1.5):
9 The given expression sweeps some details un-
𝑝 (𝝓 | D) ∝ 𝑝 (𝝓 | x) 𝑝 (y | x, 𝝓). (1.9) der the rug. A careful derivation of the pos-
terior process proceeds by finding the poste-
The quantities on the right-hand side are known: the first term is given rior of an arbitrary finite-dimensional vector
by the prior process (1.8), and the second by the observation model (1.3), 𝝓 βˆ— = 𝑓 (xβˆ— ):
which serves the role of a likelihood. We now extend the posterior on 𝝓
to all of 𝑓 :9 ∫
𝑝 (π“βˆ— | xβˆ— , D) =
∫
𝑝 (𝑓 | D) = 𝑝 (𝑓 | x, 𝝓) 𝑝 (𝝓 | D) d𝝓. (1.10) 𝑝 (𝝓 βˆ— | xβˆ— , x, 𝝓) 𝑝 (𝝓 | D) d𝝓,

which specifies the process. The distributions


The posterior encapsulates our belief regarding the objective in light of
on the right-hand side are known: the pos-
the data, incorporating both the assumptions of the prior process and terior on 𝝓 is in (1.9), and the posterior on
the information contained in the observations. 𝝓 βˆ— given the exact function values 𝝓 can be
We illustrate an example posterior in Figure 1.3, where we have found by computing their joint prior (1.8) and
conditioning.
conditioned our prior from Figure 1.2 on three exact observations. As the
10 introduction

observations posterior mean posterior 95% credible interval samples

Figure 1.3: The posterior process for our example scenario in Figure 1.2 conditioned on three exact observations.

acquisition function next observation location

Figure 1.4: A prototypical acquisition function corresponding to our example posterior from Figure 1.3.

observations are assumed to be exact, the objective function posterior


collapses onto the observed values. The posterior mean interpolates
through the data, and the posterior credible intervals reflect increased
certainty regarding the function near the observed locations. Further,
the posterior continues to reflect the structural assumptions encoded
in the prior, demonstrated by comparing the behavior of the samples
drawn from the posterior process to those drawn from the prior.

Uncertainty-aware optimization policies


Bayesian inference provides an elegant means of reasoning about an
uncertain objective function, but the success of optimization is measured
not by the fidelity of our beliefs but by the outcomes of our actions.
These actions are determined by the optimization policy, which exam-
ines available data to design each successive observation location. Each
of these decisions is fraught with uncertainty, as we must commit to each
observation before knowing its result, which will form the context of all
following decisions. Bayesian inference enables us to express this uncer-
tainty, but effective decision making additionally requires us to establish
preferences over outcomes and act to maximize those preferences.
Chapter 5: Decision Theory for Optimization, To proceed we need to establish a framework for decision making
p. 87 under uncertainty, an expansive subject with a world of possibilities.
Chapter 6: Utility Functions for Optimization, A natural and common choice is Bayesian decision theory, the subject
p. 109 of Chapters 5–6. We will discuss this and other approaches to policy
Chapter 7: Common Bayesian Optimization
construction at length in Chapter 7 and derive popular optimization
Policies, p. 123 policies from first principles.
Ignoring details in policy design, a thread running through all Bayes-
ian optimization policies is a uniform handling of uncertainty in the
objective function and the outcomes of observations via Bayesian infer-
1.2. the bayesian approach 11

ence. Instrumental in connecting our beliefs about the objective function


to decision making is the posterior predictive distribution (1.7), represent-
ing our belief about the outcomes of proposed observations. Bayesian
optimization policies are designed with reference to this distribution,
which guides the policy in discriminating between potential actions.
In practice, Bayesian optimization policies are defined indirectly by
optimizing a so-called acquisition function assigning a score to potential acquisition functions: Β§ 5, p. 88
observation locations commensurate with their perceived ability to ben-
efit the optimization process. Acquisition functions tend to be cheap to
evaluate with analytically tractable gradients, allowing the use of off-
the-shelf optimizers to efficiently design each observation. Numerous
acquisition functions have been proposed for Bayesian optimization, each
derived from different considerations. However, all notable acquisition
functions address the classic tension between exploitation – sampling
where the objective function is expected to be high – and exploration –
sampling where we are uncertain about the objective function to inform
future decisions. These opposing concerns must be carefully balanced
for effective global optimization.
An example acquisition function is shown in Figure 1.4, correspond- example and discussion
ing to the posterior from Figure 1.3. Consideration of the exploitation–
exploration tradeoff is apparent: this example acquisition function attains
relatively large values both near local maxima of the posterior mean
and in regions with significant marginal uncertainty. Local maxima of
the acquisition function represent optimal compromises between these
concerns. Note that the acquisition function vanishes at the location of
the current observations: the objective function values at these locations
are already known, so observing there would be pointless. Maximizing
the acquisition function determines the policy; here the policy chooses
to search around the local optimum on the right-hand side.
Figure 1.5 demonstrates an entire session of Bayesian optimization,
beginning from the belief and initial decision from Figure 1.4 and pro-
gressing iteratively following Algorithm 1.1. The true (unknown) objec-
tive function is also shown for reference; its maximum is near the center
of the domain. The running marks below each posterior show the loca-
tions of each measurement made, progressing in sequence from top to
bottom, and we show the objective function posterior at four waypoints.
Dynamic consideration of the exploitation–exploration tradeoff is
evident in the algorithm’s behavior. The first two observations map out
the neighborhood of the initially best-seen point, exhibiting exploitation.
Once sufficiently explored, the policy continues exploitation around the
second best-seen point, discovering and refining the global optimum in
iterations 7–8. Finally, the policy switches to exploration in iterations
13–19, systematically covering the domain to ensure nothing has been
missed. At termination, there is clear bias in the collected data toward
higher objective values, and all remaining uncertainty is in regions where
the credible intervals indicate the optimum is unlikely to reside.
The β€œmagic” of Bayesian optimization is that the intuitive behavior
of this optimization policy is not the result of ad hoc design, but rather
12 introduction

objective function

1
2

exploitation

3
4
5
6
7
8
9
10
11
12

exploitation

13
14
15
16
17
18
19

exploration

Figure 1.5: The posterior after the indicated number of steps of an example Bayesian optimization policy, starting from
the posterior in Figure 1.4. The marks show the points chosen by the policy, progressing from top to bottom.
Observations sufficiently close to the optimum are marked in bold; the optimum was located on iteration 7.
1.2. the bayesian approach 13

emerges automatically through the machinery of Gaussian processes and


Bayesian decision theory that we will develop over the coming chapters.
In this framework, building an optimization policy boils down to:
β€’ choosing a model of the objective function,
β€’ deciding what sort of data we seek to obtain, and
β€’ systematically transforming these beliefs and preferences into an opti-
mization policy.
Over the following chapters, we will develop tools for achieving each
of these goals: Gaussian processes (Chapters 2–4) for expressing what
we believe about the objective function, utility functions (Chapter 6)
for expressing what we value in data, and Bayesian decision theory
(Chapter 5) for building optimization policies aware of the uncertainty
encoded in the model and guided by the preferences encoded in the utility
function. In Chapter 7 we will combine these fundamental components
to realize complete Bayesian optimization policies, at which point we
will be equipped to replicate this example from first principles.
2
GAUSSIAN PROCESSES

The central object in optimization is an objective function 𝑓 : X β†’ ℝ, and


the primary challenge in algorithm design is inherent uncertainty about
this function: most importantly, where is the function maximized and
what is its maximal value? Prior to optimization, we may very well have
no idea. Optimization affords us the opportunity to acquire information
about the objective – through observations of our own design – to shed
light on these questions. However, this process is itself fraught with
uncertainty, as we cannot know the outcomes and implications of these
observations at the time of their design. Notably, we face this uncertainty
even when we have a closed-form expression for the objective function,
a favorable position as many objectives act as β€œblack boxes.”
Reflecting on this situation, diaconis posed an intriguing question:1 1 p. diaconis (1988). Bayesian Numerical Anal-
β€œwhat does it mean to β€˜know’ a function?” The answer is unclear when an ysis. In: Statistical Decision Theory and Related
analytic expression, which might at first glance seem to encapsulate the
Topics iv.

essence of the function, is insufficient to determine features of interest.


However, diaconis argued that although we may not know everything
about a function, we often have some prior knowledge that can facilitate
a numerical procedure such as optimization. For example, we may expect
an objective function to be smooth (or rough), or to assume values in
a given range, or to feature a relatively simple underlying trend, or to
depend on some hidden low-dimensional representation we hope to
uncover.2 All of this knowledge could be instrumental in accelerating 2 We will explore all of these possibilities in the
optimization if it could be systematically captured and exploited. next chapter, p. 45.
Having identifiable information about an objective function prior to
optimization motivates the Bayesian approach we will explore through-
out this book. We will address uncertainty in the objective function
through the unifying framework of Bayesian inference, treating 𝑓 – as Bayesian inference of the objective function:
well as ancillary quantities such as π‘₯ βˆ— and 𝑓 βˆ— (1.1) – as random variables Β§ 1.2, p. 8
to be inferred from observations revealed during optimization.
To pursue this approach, we must first determine how to build useful
prior distributions for objective functions and how to compute a poste-
rior belief given observations. If the system under investigation is well
understood, we may be able to identify an appropriate parametric form
𝑓 (π‘₯; 𝜽 ) and infer the parameters 𝜽 directly. This approach is likely the
best course of action when possible;3 however, many objective functions 3 v. dalibard et al. (2017). boat: Building Auto-
have no obvious parametric form, and most models used in Bayesian Tuners with Structured Bayesian Optimization.
optimization are thus nonparametric to avoid undue assumptions.4
www 2017.

In this chapter we will introduce Gaussian processes (gps), a conve- 4 The termβ€œnonparametric” is something of a
misnomer. A nonparametric objective func-
nient class of nonparametric regression models widely used in Bayesian tion model has parameters but their dimen-
optimization. We will begin by defining Gaussian processes and deriving sion is infinite – we effectively parameterize
some basic properties, then demonstrate how to perform inference from the objective by its value at every point.
observations. In the case of exact observation and additive Gaussian
noise, we can perform this inference exactly, resulting in an updated
posterior Gaussian process. We will continue by considering some the-
oretical properties of Gaussian processes relevant to optimization and
inference with non-Gaussian observation models.

This material has been published by Cambridge University Press as Bayesian Optimization. 15
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
16 gaussian processes

The literature on Gaussian processes is vast, and we do not intend this


chapter to serve as a standalone introduction but rather as companion to
the existing literature. Although our discussion will be comprehensive,
our focus on optimization will sometimes bias its scope. For a broad
5 c. e. rasmussen and c. k. i. williams (2006). overview, the interested reader may consult rasmussen and williams’s
Gaussian Processes for Machine Learning. mit classic monograph.5
Press.

2.1 definition and basic properties


multivariate normal distribution: Β§ a.2, p. 296 A Gaussian process is an extension of the familiar multivariate normal dis-
tribution suitable for modeling functions on infinite domains. Gaussian
processes inherit the convenient mathematical properties of the multi-
Chapter 3: Modeling with Gaussian Processes, variate normal distribution without sacrificing computational tractability.
p. 45 Further, by modifying the structure of a gp, we can model functions with
a rich variety of behavior; we will explore this capability in the next
6 p. hennig et al. (2015). Probabilistic Numerics chapter. This combination of mathematical elegance and flexibility in
and Uncertainty in Computations. Proceedings modeling has established Gaussian processes as the workhorse of Bayes-
ian approaches to numerical tasks, including optimization.6, 7
of the Royal Society A: Mathematical, Physical
and Engineering Sciences 471(2179):20150142.
7 p. hennig et al. (2022). Probabilistic Numerics:
Computation as Machine Learning. Cambridge Definition
University Press.
Consider an objective function 𝑓 : X β†’ ℝ of interest over an arbitrary
8 If X is finite, there is no distinction between infinite domain X.8 We will take a nonparametric approach and reason
a Gaussian process and a multivariate normal about the function as an infinite collection of random variables, one
distribution, so only the infinite case is inter-
esting for this discussion. corresponding to the function value at every point in the domain. Mutual
dependence between these random variables will then determine the
statistical properties of the function’s shape.
It is perhaps not immediately clear how we can specify a useful distri-
bution over infinitely many random variables, a construction known as a
stochastic process. However, a result known as the Kolmogorov extension
theorem allows us to construct a stochastic process by defining only the
9 b. ΓΈksendal (2013). Stochastic Differential distribution of arbitrary finite sets of function values, subject to natural
Equations: An Introduction with Applications. consistency constraints.9 For a Gaussian process, these finite-dimensional
Springer–Verlag. [Β§ 2.1]
distributions are all multivariate Gaussian, hence its name.
In this light, we build a Gaussian process by replacing the parameters
10 Writing the process as if it were a function- in the finite-dimensional case – a mean vector and a positive semidefinite
valued probability density function is an abuse covariance matrix – by analogous functions over the domain. We specify
of notation, but a useful and harmless one.
a Gaussian process on 𝑓:10

𝑝 (𝑓 ) = GP (𝑓 ; πœ‡, 𝐾)

mean function, πœ‡ by a mean function πœ‡ : X β†’ ℝ and a positive semidefinite covariance


covariance function (kernel), 𝐾 function (or kernel) 𝐾 : X Γ— X β†’ ℝ. The mean function determines the
value of objective at π‘₯, πœ™ expected function value πœ™ = 𝑓 (π‘₯) at any location π‘₯:

πœ‡ (π‘₯) = 𝔼[πœ™ | π‘₯],

thus serving as a location parameter representing the function’s central


tendency. The covariance function determines how deviations from the
2.1. definition and basic properties 17

mean are structured, encoding expected properties of the function’s


behavior. Defining πœ™ β€² = 𝑓 (π‘₯ β€²), we have: value of objective at π‘₯ ,β€² πœ™ β€²

𝐾 (π‘₯, π‘₯ β€²) = cov[πœ™, πœ™ β€² | π‘₯, π‘₯ β€²]. (2.1)

The mean and covariance functions of the process allow us to com-


pute any finite-dimensional marginal distribution on demand. Let x βŠ‚ X
be finite and let 𝝓 = 𝑓 (x) be the corresponding function values, a vector- values of objective at x, 𝝓 = 𝑓 (x)
valued random variable. For the Gaussian process (2.1), the distribution
of 𝝓 is multivariate normal with parameters determined by the mean
and covariance functions:

𝑝 (𝝓 | x) = N (𝝓; 𝝁, 𝚺), (2.2)


where
𝝁 = 𝔼[𝝓 | x] = πœ‡ (x); 𝚺 = cov[𝝓 | x] = 𝐾 (x, x). (2.3)

Here 𝐾 (x, x) is the matrix formed by evaluating the covariance function


for each pair of points: Σ𝑖 𝑗 = 𝐾 (π‘₯𝑖 , π‘₯ 𝑗 ), also called the Gram matrix of x. Gram matrix of x, 𝚺 = 𝐾 (x, x)
In many ways, Gaussian processes behave like β€œreally big” Gaussian
distributions, and one can intuit many of their properties from this
heuristic alone. For example, the Gaussian marginal property in (2.2–
2.3) corresponds precisely with the analogous formula in the finite- 11 In fact, this is precisely the consistency re-
dimensional case (a.13). Further, this property automatically ensures quired by the Kolmogorov extension theorem
mentioned on the facing page.
global consistency in the following sense.11 If x is an arbitrary set of
points and xβ€² βŠƒ x is a superset, then we arrive at the same belief about
𝝓 whether we compute it directly from (2.2–2.3) or indirectly by first marginalizing multivariate normal
computing 𝑝 (𝝓 β€² | xβ€²) then marginalizing 𝝓 β€² \ 𝝓 (a.13). distributions, Β§ a.2, p. 299

Example and basic properties


Let us construct and explore an explicit Gaussian process for a function
on the interval X = [0, 30]. For the mean function we take the zero
function πœ‡ ≑ 0, indicating a constant central tendency. For the covariance
function, we take the prototypical squared exponential covariance: squared exponential covariance: Β§ 3.3, p. 51

𝐾 (π‘₯, π‘₯ β€²) = exp βˆ’ 21 |π‘₯ βˆ’ π‘₯ β€² | 2 . (2.4)

Let us pause to consider the implications of this choice. First, note that
var[πœ™ | π‘₯] = 𝐾 (π‘₯, π‘₯) = 1 at every point π‘₯ ∈ X, and thus the covariance 𝐾 (π‘₯, π‘₯ 0 )
function (2.4) also measures the correlation between the function values 1
πœ™ and πœ™ .β€² This correlation decreases with the distance between π‘₯ and π‘₯ ,β€²
falling from unity to zero as these points become increasingly separated;
see the illustration in the margin. We can loosely interpret this as a
statistical consequence of continuity: function values at nearby locations
|π‘₯ βˆ’ π‘₯ 0 |
are highly correlated, whereas function values at distant locations are 0
effectively independent. This assumption also implies that observing 0 3
the function at some point π‘₯ provides nontrivial information about the The squared exponential covariance (2.4) as a
function at sufficiently nearby locations (roughly when |π‘₯ βˆ’ π‘₯ β€² | < 3). We function of the distance between inputs.
will explore this implication further shortly.
18 gaussian processes

prior mean prior 95% credible interval samples

Figure 2.1: Our example Gaussian process on the domain X = [0, 30]. We illustrate the marginal belief at every location
with its mean and a 95% credible interval and also show three example functions sampled from the process.

predictive credible intervals For a Gaussian process, the marginal distribution of any single func-
tion value is univariate normal (2.2):

𝑝 (πœ™ | π‘₯) = N (πœ™; πœ‡, 𝜎 2 ); πœ‡ = πœ‡ (π‘₯); 𝜎 2 = 𝐾 (π‘₯, π‘₯), (2.5)

where we have abused notation slightly by overloading the symbol πœ‡.


This allows us to derive pointwise credible intervals; for example, the
familiar πœ‡ Β± 1.96𝜎 is a 95% credible interval for πœ™. Examining our example
gp, the marginal distribution of every function value is in fact standard
normal. We provide a rough visual summary of the process via its mean
function and pointwise 95% predictive credible intervals in Figure 2.1.
There is nothing terribly exciting we can glean from these marginal
distributions alone, and no interesting structure in the process is yet
apparent.

Sampling
We may gain more insight by inspecting samples drawn from our exam-
ple process reflecting the joint distribution of function values. Although
it is impossible to represent an arbitrary function on X in finite memory,
we can approximate the sampling process by taking a dense grid x βŠ‚ X
sampling from a multivariate normal and sampling the corresponding function values from their joint multi-
distribution: Β§ a.2, p. 299 variate normal distribution (2.2). Plotting the sampled vectors against
the chosen grid reveals curves approximating draws from the Gaussian
process. Figure 2.1 illustrates this procedure for our example using a grid
of 1000 equally spaced points. Each sample is smooth and has several
local optima distributed throughout the domain – for some applications,
this might be a reasonable model for an objective function on X.

2.2 inference with exact and noisy observations


We now turn to our attention to inference: given a Gaussian process
prior on an objective function, how can we condition this initial belief
on observations obtained during optimization?
example and discussion Let us look at an example to build intuition before diving into the
details. Figure 2.2 shows the effect of conditioning our example gp from
2.2. inference with exact and noisy observations 19

observations posterior mean posterior 95% credible interval samples

Figure 2.2: The posterior for our example scenario in Figure 2.1 conditioned on three exact observations.

the previous section on three exact measurements of the function. The


updated belief reflects both our prior assumptions and the information
contained in the data, the hallmark of Bayesian inference. To elaborate,
the posterior mean smoothly interpolates through the observed values,
agreeing with both the measured values and the smoothness encoded
in the prior covariance function. The posterior credible intervals are
reduced in the neighborhood of the measured locations – where the
prior covariance function encodes nontrivial dependence on at least one
observed value – and vanish where the function value has been exactly
determined. On the other hand, our marginal belief remains effectively
unchanged from the prior in regions sufficiently isolated from the data,
where the prior covariance function encodes effectively no correlation.
Conveniently, inference is straightforward for the pervasive observa-
tion models of exact measurement and additive Gaussian noise, where
the self-conjugacy of the normal distribution yields a Gaussian process
posterior with updated parameters we can compute in closed form. The
reasoning underlying inference for both observation models is identical
and is subsumed by a flexible general argument we will present first.

Inference from arbitrary jointly Gaussian observations


We may exactly condition a Gaussian process 𝑝 (𝑓 ) = GP (𝑓 ; πœ‡, 𝐾) on the
observation of any vector y sharing a joint Gaussian distribution with 𝑓 : vector of observed values, y

     !
𝑓 πœ‡ 𝐾 πœ…βŠ€
𝑝 (𝑓, y) = GP ; , . (2.6)
y m πœ… C

This notation, analogous to (a.12), extends the Gaussian process on 𝑓 to 12 We assume C is positive definite; if it were only
include the entries of y; that is, we assume the distribution of any finite positive semidefinite, there would be wasteful
linear dependence among observations.
subset of function and/or observed values is multivariate normal. We
specify the joint distribution via the marginal distribution of y:12 observation mean and covariance, m, C

𝑝 (y) = N (y; m, C) (2.7)

and the cross-covariance function between y and 𝑓 : cross-covariance between observations and
function values, πœ…
πœ… (π‘₯) = cov[y, πœ™ | π‘₯]. (2.8)
20 gaussian processes

Although it may seem absurd that we could identify and observe a vector
satisfying such strong restrictions on its distribution, we can already
deduce several examples from first principles, including:
inference from exact observations: Β§ 2.2, p. 22 β€’ any vector of function values (2.2),
affine transformations: Β§ a.2, p. 298 β€’ any affine transformation of function values (a.10), and
derivatives and expectations: Β§ 2.6, p. 30 β€’ limits of such quantities, such as partial derivatives or expectations.
Further, we may condition on any of the above even if corrupted by
independent additive Gaussian noise, as we will shortly demonstrate.
conditioning a multivariate normal We may condition the joint distribution (2.6) on y analogously to the
distribution: Β§ a.2, p. 299 finite-dimensional case (a.14), resulting in a Gaussian process posterior
on 𝑓. Writing D = y for the observed data, we have:

𝑝 (𝑓 | D) = GP (𝑓 ; πœ‡D , 𝐾D ), (2.9)
posterior mean and covariance, πœ‡D , 𝐾D where
πœ‡D (π‘₯) = πœ‡ (π‘₯) + πœ… (π‘₯)⊀ Cβˆ’1 (y βˆ’ m);
(2.10)
𝐾D (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯, π‘₯ β€²) βˆ’ πœ… (π‘₯)⊀ Cβˆ’1 πœ… (π‘₯ β€²).
13 This is a useful exercise! The result will be a This can be verified by computing the joint distribution of an arbitrary
stochastic process with multivariate normal finite set of function values and y and conditioning on the latter (a.14).13
finite-dimensional distributions, a Gaussian
process by definition (2.5).
The above result provides a simple procedure for gp posterior infer-
ence from any vector of observations satisfying (2.6):
1. compute the marginal distribution of y (2.7),
2. derive the cross-covariance function πœ… (2.8), and
3. find the posterior distribution of 𝑓 via (2.9–2.10).
We will realize this procedure for several special cases below. However,
we will first demonstrate how we may seamlessly handle measurements
corrupted by additive Gaussian noise and build intuition for the posterior
distribution by dissecting its moments in terms of the statistics of the
observations and the correlation structure of the prior.

Corruption by additive Gaussian noise


We pause to make one observation of immense practical importance:
any vector satisfying (2.6) would continue to suffice even if corrupted
by independent additive Gaussian noise, and thus we can use the above
result to condition a Gaussian process on noisy observations as well.
Suppose that rather than observing y exactly, our measurement
noisy observation of y, z mechanism only allowed observing z = y + 𝜺 instead, where 𝜺 is a vector
vector of random errors, 𝜺 of random errors independent of y. If the errors are normally distributed
noise covariance matrix, N with mean zero and known (arbitrary) covariance N:

𝑝 (𝜺 | N) = N (𝜺; 0, N), (2.11)


sums of normal vectors: Β§ a.2, p. 300 then we have
𝑝 (z | N) = N (z; m, C + N); cov[z, πœ™ | π‘₯] = cov[y, πœ™ | π‘₯] = πœ… (π‘₯).
2.2. inference with exact and noisy observations 21

Thus we can condition on an observation of the corrupted vector z by


simply replacing C with C + N in the prior (2.6) and posterior (2.10).14 14 Assuming zero-mean errors is not strictly nec-
Note that the posterior converges to that from a direct observation of y essary but is overwhelmingly common in prac-
tice. A nonzero mean 𝔼[𝜺 ] = n is possible by
if we take the noise covariance N β†’ 0 in the positive semidefinite cone, further replacing (y βˆ’ 𝝁) with y βˆ’ [𝝁 + n]
a reassuring result. in (2.10), where 𝝁 + n = 𝔼[y].

Interpretation of posterior moments


The moments of the posterior Gaussian process (2.10) contain update
terms adjusting the prior moments in light of the data. These updates
have intuitive interpretations in terms of the nature of the prior process
and the observed values, which we may unravel with some care.
We can gain some initial insight by considering the case where we
observe a single value with 𝑦 distribution N (𝑦; π‘š, 𝑠 2 ) and breaking down
its impact on our belief. Consider an arbitrary function value πœ™ with
prior distribution N (πœ™; πœ‡, 𝜎 2 ) (2.5) and define 𝑧-score of measurement 𝑦, 𝑧
𝑦 βˆ’π‘š
𝑧=
𝑠
to be the 𝑧-score of the observed value 𝑦 and correlation between measurement 𝑦 and
function value πœ™, 𝜌
πœ… (π‘₯)
𝜌 = corr[𝑦, πœ™ | π‘₯] =
πœŽπ‘ 
to be the correlation between 𝑦 and πœ™. Then the posterior mean and posterior moments of πœ™ from a scalar
standard deviation of πœ™ are, respectively: observation
√︁
πœ‡ + πœŽπœŒπ‘§; 𝜎 1 βˆ’ 𝜌2. (2.12)
The 𝑧-score of the posterior mean, with respect to the prior distri- interpretation of moments
bution of πœ™, is πœŒπ‘§. An independent measurement with 𝜌 = 0 thus leaves
the prior mean unchanged, whereas a perfectly dependent measurement
with |𝜌 | = 1 shifts the mean up or down by 𝑧 standard deviations (de-
pending on the sign of the correlation) to match the magnitude of the
measurement’s 𝑧-score. Measurements with partial dependence result in
outcomes between these extremes. Further, surprising measurements – √︁
𝜎 1 βˆ’ 𝜌2
that is, those with large |𝑧| – yield larger shifts in the mean, whereas an
𝜎
entirely expected measurement with 𝑦 = π‘š leaves the mean unchanged.
Turning to the posterior standard deviation, the measurement re-
duces our uncertainty in πœ™ by a factor depending on the correlation
𝜌, but not on the value observed. An independent measurement again
leaves the prior intact, whereas a perfectly dependent measurement col-
lapses the posterior standard deviation to zero as the value of πœ™ would be 0
|𝜌 |
completely determined. The relative reduction in posterior uncertainty 0 1
as a function of the absolute correlation is illustrated in the margin.
The posterior standard deviation of πœ™ as a
In the case of vector-valued observations, we can interpret similar function of the strength of relationship with
structure in the posterior, although dependence between entries of y 𝑦, |𝜌 |.
must also be accounted for. We may factor the observation covariance
matrix as
C = SPS, (2.13)
22 gaussian processes

√
where S is diagonal with 𝑆𝑖𝑖 = 𝐢𝑖𝑖 = std[𝑦𝑖 ] and P = corr[y] is the
observation correlation matrix. We may then rewrite the posterior mean
of πœ™ as
πœ‡ + πœŽπ†βŠ€ Pβˆ’1 z,
where z and 𝝆 represent the vectors of measurement 𝑧-scores and the
cross-correlation between πœ™ and y, respectively:

𝑦𝑖 βˆ’ π‘šπ‘– [πœ… (π‘₯)] 𝑖
𝑧𝑖 = ; πœŒπ‘– = .
𝑠𝑖 πœŽπ‘ π‘–
15 It can be instructive to contrast the behavior The posterior mean is now in the same form as the scalar case (2.12),
of the posterior when conditioning on two with the introduction of the observation correlation matrix moderating
highly correlated values versus two indepen-
dent ones. In the former case, the posterior
the 𝑧-scores to account for dependence between the observed values.15
does not change much as a result of the sec- The posterior standard deviation of πœ™ in the vector-valued case is
ond measurement, as dependence reduces the βˆšοΈƒ
effective number of measurements.
𝜎 1 βˆ’ π†βŠ€ Pβˆ’1𝝆,

again analogous to (2.12). Noting that the inverse correlation matrix


16 P is congruent to C (2.13) and is thus positive Pβˆ’1 is positive definite,16 the posterior covariance again reflects a global
definite from Sylvester’s law of inertia. reduction in the marginal uncertainty of every function value. In fact, the
joint distribution of any set of function values has reduced uncertainty
17 For positive semidefinite A, B, |A| ≀ |A + B |. in the posterior in terms of the differential entropy (a.16), as17

|𝐾 (x, x) βˆ’ πœ… (x)⊀ Cβˆ’1 πœ… (x)| ≀ |𝐾 (x, x)|.

The reduction of uncertainty again depends on the strength of depen-


dence between function values and the observed data, with independence
(𝝆 = 0) resulting in no change. The reduction also depends on the pre-
18 The LΓΆwner order is the partial order induced cision of the measurements: all other things held equal, observations
by the convex cone of positive-semidefinite with greater precision in terms of the LΓΆwner order18 on the precision
matrices. For symmetric A, B, we define A β‰Ί B
if and only if B βˆ’ A is positive definite:
matrix Cβˆ’1 provide a globally better informed posterior. In particular, as
(C + N) βˆ’1 β‰Ί Cβˆ’1 for any noise covariance N, noisy measurements (2.11)
k. lâwner (1934). Über monotone Matrixfunk- categorically provide less information about the function than direct
tionen. Mathematische Zeitschrift 38:177–216.
observations, as one should hope.

Inference with exact function evaluations


We will now explicitly demonstrate the general process of Gaussian
process inference for important special cases, beginning with the simplest
possible observation mechanism: exact observation.
Suppose we have observed 𝑓 at some set of locations x, revealing the
observed data, D = (x, 𝝓) corresponding function values 𝝓 = 𝑓 (x), and let D = (x, 𝝓) denote this
dataset. The observed vector shares a joint Gaussian distribution with
any other set of function values by the gp assumption on 𝑓 (2.2), so we
may follow the above procedure to derive the posterior. The marginal
distribution of 𝝓 is Gaussian (2.3):

𝑝 (𝝓 | x) = N (𝝓; 𝝁, 𝚺),
2.2. inference with exact and noisy observations 23

and the cross-covariance between an arbitrary function value and 𝝓 is


by definition given by the covariance function:

πœ… (π‘₯) = cov[𝝓, πœ™ | x, π‘₯] = 𝐾 (x, π‘₯).

Appealing to (2.9–2.10) we have:

𝑝 (𝑓 | D) = GP (𝑓 ; πœ‡ D , 𝐾D ),
where
πœ‡D (π‘₯) = πœ‡ (π‘₯) + 𝐾 (π‘₯, x)πšΊβˆ’1 (𝝓 βˆ’ 𝝁);
(2.14)
𝐾D (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯, π‘₯ β€²) βˆ’ 𝐾 (π‘₯, x)πšΊβˆ’1 𝐾 (x, π‘₯ β€²).

Our previous Figure 2.2 illustrates the posterior resulting from con- example and discussion
ditioning our gp prior in Figure 2.1 on three exact measurements, with
high-level analysis of its behavior in the accompanying text.

Inference with function evaluations corrupted by additive Gaussian noise


With the notable exception of optimizing the output of a deterministic
computer program or simulation, observations of an objective function
are typically corrupted by noise due to measurement limitations or statis-
tical approximation; we must be able to handle such noisy observations to
maximize utility. Fortunately, in the important case of additive Gaussian
noise, we may perform exact inference following the general procedure
described above. In fact, the derivation below follows directly from our
previous discussion on arbitrary additive Gaussian noise, but the case of
additive Gaussian noise in function evaluations is important enough to arbitrary additive Gaussian noise: Β§ 2.2, p. 20
merit its own discussion.
Suppose we make observations of 𝑓 at locations x, revealing cor-
rupted values y = 𝝓 + 𝜺. Suppose the measurement errors 𝜺 are indepen- 19 Allowing nondiagonal N departs from our typ-
dent of 𝝓 and normally distributed with mean zero and covariance N, ical convention of assuming conditional inde-
pendence between observations (1.3), but do-
which may optionally depend on x:
ing so does not complicate inference, so there
is no harm in this generality.
𝑝 (𝜺 | x, N) = N (𝜺; 0, N). (2.15)

As before we aggregate the observations into a dataset D = (x, y).


The observation noise covariance can in principle be arbitrary;19 special case: independent homoskedastic
however, the most common models in practice are independent ho- noise
moskedastic noise with scale πœŽπ‘› :

N = πœŽπ‘›2 I, (2.16)

and independent heteroskedastic noise with scale depending on location special case: independent heteroskedastic
according to a function πœŽπ‘› : X β†’ ℝ β‰₯0 : noise

N = diag πœŽπ‘›2 (x). (2.17)

For a given observation location π‘₯, we will simply write πœŽπ‘› for the observation noise scale, πœŽπ‘›
associated noise scale, leaving any dependence on π‘₯ implicit.
24 gaussian processes

observations posterior mean posterior 95% credible interval

Figure 2.3: Posteriors for our example gp from Figure 2.1 conditioned on 15 noisy observations with independent ho-
moskedastic noise (2.16). The signal-to-noise ratio is 10 for the top example, 3 for the middle example, and 1 for
the bottom example.

The prior distribution of the observations is now multivariate normal


(2.3, a.15):
𝑝 (y | x, N) = N (y; 𝝁, 𝚺 + N). (2.18)
Due to independence of the noise, the cross-covariance remains the same
as in the exact observation case:

πœ… (π‘₯) = cov[y, πœ™ | x, π‘₯] = 𝐾 (x, π‘₯).

Conditioning on the observed value now yields a gp posterior with

πœ‡D (π‘₯) = πœ‡ (π‘₯) + 𝐾 (π‘₯, x) (𝚺 + N) βˆ’1 (y βˆ’ 𝝁);


(2.19)
𝐾D (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯, π‘₯ β€²) βˆ’ 𝐾 (π‘₯, x) (𝚺 + N) βˆ’1 𝐾 (x, π‘₯ β€²).

homoskedastic example and discussion Figure 2.3 shows a sequence of posterior distributions resulting from
conditioning our example gp on data corrupted by increasing levels of
homoskedastic noise (2.16). As the noise level increases, the observations
have diminishing influence on our belief, with some extreme values
eventually being partially explained away as outliers. As measurements
are assumed to be inexact, the posterior mean is not compelled to inter-
polate perfectly through the observations, as in the exact case (Figure
2.2. inference with exact and noisy observations 25

observations posterior 95% credible interval, 𝑦


posterior mean posterior 95% credible interval, πœ™

Figure 2.4: The posterior distribution for our example gp from Figure 2.1 conditioned on 50 observations with heteroskedas-
tic observation noise (2.17). We show predictive credible intervals for both the latent objective function and
noisy observations; the standard deviation of the observation noise increases linearly from left-to-right.

2.2). Further, with increasing levels of noise, our posterior belief reflects
significant residual uncertainty in the function, even in regions with
multiple nearby observations.
We illustrate an example of Gaussian process inference with het- heteroskedastic example and discussion
eroskedastic noise (2.17) in Figure 2.4, where the signal-to-noise ratio
decreases smoothly from left-to-right over the domain. Although the
observations provide relatively even coverage, our posterior uncertainty
is minimal on the left-hand side of the domain – where the measure-
ments provide maximal information – and increases as our observations
become more noisy and less informative.
We will often require the posterior predictive distribution for a noisy posterior predictive distribution for noisy
measurement 𝑦 that would result from observing at a given location π‘₯. observations
The posterior distribution on 𝑓 (2.19) provides the posterior predictive
distribution for the latent function value πœ™ = 𝑓 (π‘₯) (2.5):

𝑝 (πœ™ | π‘₯, D) = N (πœ™; πœ‡, 𝜎 2 ); πœ‡ = πœ‡D (π‘₯); 𝜎 2 = 𝐾D (π‘₯, π‘₯),

but does not account for the effect of observation noise. In the case of
independent additive Gaussian noise (2.16–2.17), deriving the posterior
predictive distribution is trivial; we have (a.15):

𝑝 (𝑦 | π‘₯, D, πœŽπ‘› ) = N (𝑦; πœ‡, 𝜎 2 + πœŽπ‘›2 ). (2.20)

This predictive distribution is illustrated in Figure 2.4; the credible inter-


vals for noisy measurements reflect inflation of the credible intervals for
the underlying function value commensurate with the scale of the noise.
If the noise contains nondiagonal correlation structure, we must predictive distribution with correlated noise
account for dependence between training and test errors in the predictive
distribution. The easiest way to proceed is to recognize that the noisy
observation process 𝑦 = πœ™ + πœ€, as a function of π‘₯, is itself a Gaussian covariance function for noisy measurements,
process with mean function πœ‡ and covariance function 𝐢

𝐢 (π‘₯, π‘₯ β€²) = cov[𝑦, 𝑦 β€² | π‘₯, π‘₯ β€²] = 𝐾 (π‘₯, π‘₯ β€²) + 𝑁 (π‘₯, π‘₯ β€²),


26 gaussian processes

covariance function for observation noise, 𝑁 where 𝑁 is the noise covariance: 𝑁 (π‘₯, π‘₯ β€²) = cov[πœ€, πœ€ β€² | π‘₯, π‘₯ β€²]. The poste-
rior of the observation process is then a gp with

𝔼[𝑦 | π‘₯, D] = πœ‡ (π‘₯) + 𝐢 (π‘₯, x) (𝚺 + N) βˆ’1 (y βˆ’ 𝝁);


(2.21)
cov[𝑦, 𝑦 β€² | π‘₯, π‘₯ β€², D] = 𝐢 (π‘₯, π‘₯ β€²) βˆ’ 𝐢 (π‘₯, x) (𝚺 + N) βˆ’1𝐢 (x, π‘₯ β€²),

from which we can derive predictive distributions via (2.2).

2.3 overview of remainder of chapter


In the remainder of this chapter we will cover some additional, somewhat
niche and/or technical aspects of Gaussian processes that see occasional
20 In particular the claims regarding continuity use in Bayesian optimization. Modulo mathematical nuances irrelevant
and differentiability are slightly more compli- in practical settings, an intuitive (but not entirely accurate!) summary
cated than stated below.
follows:20
β€’ a joint Gaussian process (discussed below) allows us to model multiple
multifidelity optimization: Β§ 11.5, p. 263 related functions simultaneously, which is critical for some scenarios
multiobjective optimization: Β§ 11.7, p. 269 such as multifidelity and multiobjective optimization;
sample path continuity: Β§ 2.5, p. 28 β€’ gp sample paths are continuous if the mean function is continuous and
the covariance function is continuous along the β€œdiagonal” π‘₯ = π‘₯ β€²;
sample path differentiability: Β§ 2.6, p. 30 β€’ gp sample paths are differentiable if the mean function is differentiable
and the covariance function is differentiable along the β€œdiagonal” π‘₯ = π‘₯ β€²;
derivative observations: Β§ 2.6, p. 32 β€’ a function with a sufficiently smooth gp distribution shares a joint gp
distribution with its gradient; among other things, this allows us to con-
dition on (potentially noisy) derivative observations via exact inference;
existence of global maxima: Β§ 2.7, p. 34 β€’ gp sample paths attain a maximum when sample paths are continuous
and the domain is compact;
uniqueness of global maxima: Β§ 2.7, p. 34 β€’ gp sample paths attain a unique maximum under the additional condition
that no two unique function values are perfectly correlated; and
inference with non-Gaussian observations β€’ several methods are available for approximating the posterior process of
and constraints: Β§ 2.8, p. 35 a gp conditioned on information incompatible with exact inference.
If satisfied with the above summary, the reader may safely skip this
material for now and move on with the next chapter. For those who wish
to see the gritty details, dive in below!

2.4 joint gaussian processes


In some settings, we may wish to reason jointly about two-or-more
related functions, such as an objective function and its gradient or an
expensive objective function and a cheaper surrogate. To this end we can
extend Gaussian processes to yield a joint distribution over the values
assumed by multiple functions. The key to the construction is to β€œpaste
together” a collection of functions into a single function on a larger
domain, then construct a standard gp on this combined function.
2.4. joint gaussian processes 27

Definition
To elaborate, consider a set of functions {𝑓𝑖 : X𝑖 β†’ ℝ} we wish to
model.21 We define the disjoint union of these functions βŠ”π‘“ – defined on
Γƒ
disjoint union of {𝑓𝑖 }, βŠ”π‘“
the disjoint union22 of their domains X = X𝑖 – by insisting its restric- disjoint union of {X𝑖 }, X
tion to each domain be compatible with the corresponding function:

βŠ”π‘“ : X β†’ ℝ; βŠ”π‘“ | X𝑖 ≑ 𝑓𝑖 . 21 The domains need not be equal, but they often


are in practice.
We now can define a gp on βŠ”π‘“ by choosing mean and covariance func- 22 A disjoint union represents a point π‘₯ ∈ X𝑖 by
tions on X as desired: the pair (π‘₯, 𝑖), thereby combining the domains
while retaining their identities.
𝑝 (βŠ”π‘“ ) = GP (βŠ”π‘“ ; πœ‡, 𝐾). (2.22)

We will call this construction a joint Gaussian process on {𝑓𝑖 }. joint Gaussian process
It is often convenient to decompose the moments of a joint gp into
their restrictions on relevant subspaces. For example, consider a joint gp
(2.22) on 𝑓 : F β†’ ℝ and 𝑔 : G β†’ ℝ. After defining

πœ‡ 𝑓 ≑ πœ‡| F ; πœ‡π‘” ≑ πœ‡| G ;
𝐾 𝑓 ≑ 𝐾 | F Γ—F ; 𝐾𝑔 ≑ 𝐾 | GΓ—G ; 𝐾 𝑓𝑔 ≑ 𝐾 | F Γ—G ; 𝐾𝑔𝑓 ≑ 𝐾 | GΓ—F ,

we can see that 𝑓 and 𝑔 in fact have marginal gp distributions:23 23 In fact, any restriction of a gp-distributed func-
tion has a gp (or multivariate normal) distri-
𝑝 (𝑓 ) = GP (𝑓 ; πœ‡ 𝑓 , 𝐾 𝑓 ); 𝑝 (𝑔) = GP (𝑔; πœ‡π‘” , 𝐾𝑔 ), (2.23) bution.

that are coupled by the cross-covariance functions 𝐾 𝑓𝑔 and 𝐾𝑔𝑓 . Given


vectors x βŠ‚ F and xβ€² βŠ‚ G, these compute the covariance between the
corresponding function values 𝝓 = 𝑓 (x) and 𝜸 = 𝑔(xβ€²):

𝐾 𝑓𝑔 (x, xβ€²) = cov[𝝓, 𝜸 | x, xβ€²];


(2.24)
𝐾𝑔𝑓 (x, xβ€²) = cov[𝜸, 𝝓 | x, xβ€²] = 𝐾 𝑓𝑔 (x, xβ€²) ⊀
.

When convenient we will notate a joint gp in terms of these decomposed 24 We also used this notation in (2.6), where the
functions, here writing:24 β€œdomain” of the vector y can be taken to be
some finite index set of appropriate size.
     !
𝑓 πœ‡π‘“ 𝐾𝑓 𝐾 𝑓𝑔
𝑝 (𝑓, 𝑔) = GP ; , . (2.25)
𝑔 πœ‡π‘” 𝐾𝑔𝑓 𝐾𝑔

With this notation, the marginal gp property (2.23) is perfectly analogous


to the marginal property of the multivariate Gaussian distribution (a.13).
We can also use this construction to define a gp on a vector-valued extension to vector-valued functions
function f: X β†’ ℝ𝑑 by defining a joint Gaussian process on its 𝑑
coordinate functions {𝑓𝑖 } : X β†’ ℝ. In this case we typically write the
resulting model using the standard notation GP (f; πœ‡, 𝐾), where the mean
and covariance functions are now understood to map to ℝ𝑑 and ℝ𝑑×𝑑 .

Example
We can demonstrate the behavior of a joint Gaussian process by extend-
ing our running example gp on 𝑓 : [0, 30] β†’ ℝ. Recall the prior on 𝑓 has
28 gaussian processes

prior mean prior 95% credible interval sample

Figure 2.5: A joint Gaussian process over two functions on the shared domain X = [0, 30]. The marginal belief over both
functions is the same as our example gp from Figure 2.1, but the cross-covariance (2.26) between the functions
strongly couples their behavior. We also show a sample from the joint distribution illustrating the strong
correlation induced by the joint prior.

zero mean function πœ‡ ≑ 0 and squared exponential covariance function


(2.4). We augment our original function with a companion function 𝑔,
defined on the same domain, that has exactly the same marginal gp
distribution. However, we couple the distribution of 𝑓 and 𝑔 by defining
a nontrivial cross-covariance function 𝐾 𝑓𝑔 (2.24):

𝐾 𝑓𝑔 (π‘₯, π‘₯ β€²) = 0.9𝐾 (π‘₯, π‘₯ β€²), (2.26)

where 𝐾 is the marginal covariance function of 𝑓 and 𝑔. A consequence


of this choice is that for any given point π‘₯ ∈ X, the correlation of the
corresponding function values πœ™ = 𝑓 (π‘₯) and 𝛾 = 𝑔(π‘₯) is quite strong:

corr[πœ™, 𝛾 | π‘₯] = 0.9. (2.27)

We illustrate the resulting joint gp in Figure 2.5. The marginal credible


intervals for 𝑓 (and now 𝑔) have not changed from our original example
in Figure 2.1. However, drawing a sample of the functions from their joint
distribution reveals the strong coupling encoded in the prior (2.26–2.27).

Inference for joint Gaussian processes


The construction in (2.22) allows us to reason about a joint Gaussian
process as if it were a single gp. This allows us to condition a joint gp
on observations of jointly Gaussian distributed values following the
inference from jointly Gaussian distributed procedure outlined previously. In Figure 2.6, we condition the joint gp
observations: Β§ 2.2, p. 18 prior from Figure 2.5 on ten observations: five exact observations of 𝑓 on
the left-hand side of the domain and five exact observations of 𝑔 on the
right-hand side. Due to the strong correlation between the two functions,
an observation of either function strongly informs our belief about the
other, even in regions where there are no direct observations.

2.5 continuity
In this and the following sections we will establish some important
properties of Gaussian processes determined by the properties of their
2.5. continuity 29

observations (direct) posterior mean


observations (other function) posterior 95% credible interval

𝑝 (𝑓 | D)

𝑝 (𝑔 | D)

Figure 2.6: The joint posterior for our example joint gp prior in Figure 2.5 conditioned on five exact observations of each
function.

moments. As a gp is completely specified by its mean and covariance


functions, it should not be surprising that the nature of these functions
has far-reaching implications regarding properties of the function being
modeled. A good familiarity with these implications can help guide
model design in practice – the focus of the next two chapters.
To begin, a fundamental question regarding Gaussian processes is
whether sample paths are almost surely continuous, and if so how many
times differentiable they may be. This is obviously an important consider-
ation for modeling and is also critical to ensure that global optimization existence of global maxima: Β§ 2.7, p. 34
is a well-posed problem, as we will discuss later in this chapter. Fortu-
nately, continuity of Gaussian processes is a well-understood property
that can be guaranteed almost surely under simple conditions on the
mean and covariance functions.
Suppose 𝑓 : X β†’ ℝ has distribution GP (𝑓 ; πœ‡, 𝐾). Recall that 𝑓 is
continuous at π‘₯ if 𝑓 (π‘₯) βˆ’ 𝑓 (π‘₯ β€²) = πœ™ βˆ’ πœ™ β€² β†’ 0 when π‘₯ β€² β†’ π‘₯. Continuity
is thus a limiting property of differences in function values. But under
the Gaussian process assumption, this difference is Gaussian distributed
(2.5, a.9)! We have
𝑝 (πœ™ βˆ’ πœ™ β€² | π‘₯, π‘₯ β€²) = N (πœ™ βˆ’ πœ™ β€²; π‘š, 𝑠 2 ),
where
π‘š = πœ‡ (π‘₯) βˆ’ πœ‡ (π‘₯ β€²); 𝑠 2 = 𝐾 (π‘₯, π‘₯) βˆ’ 2𝐾 (π‘₯, π‘₯ β€²) + 𝐾 (π‘₯ ,β€² π‘₯ β€²).
Now if πœ‡ is continuous at π‘₯ and 𝐾 is continuous at π‘₯ = π‘₯ ,β€² then both
π‘š β†’ 0 and 𝑠 2 β†’ 0 as π‘₯ β†’ π‘₯ ,β€² and thus πœ™ βˆ’πœ™ β€² converges in probability to
0. This intuitive condition of continuous moments is known as continuity continuity in mean square
in mean square at π‘₯; if πœ‡ and 𝐾 are both continuous over the entire domain
(the latter along the β€œdiagonal” π‘₯ = π‘₯ β€²), then we say the entire process is
continuous in mean square.
30 gaussian processes

It turns out that continuity in mean square is not quite sufficient


to guarantee that 𝑓 is simultaneously continuous at every π‘₯ ∈ X with
sample path continuity probability one, a property known as sample path continuity. However,
very slightly stronger conditions on the moments of a gp are sufficient
25 r. j. adler and j. e. taylor (2007). Random to guarantee sample path continuity.25 The following result is adequate
Fields and Geometry. Springer–Verlag. [Β§Β§ 1.3– for most settings arising in practice and may be proven as a corollary to
1.4]
the slightly weaker (and slightly more complicated) conditions assumed
in adler and taylor’s theorem 1.4.1.
Theorem. Suppose X βŠ‚ ℝ𝑑 is compact and 𝑓 : X β†’ ℝ has Gaussian
26 HΓΆlder continuity is a generalization of Lip- process distribution GP (𝑓 ; πœ‡, 𝐾), where πœ‡ is continuous and 𝐾 is HΓΆlder
schitz continuity. Effectively, the covariance continuous.26 Then 𝑓 is almost surely continuous on X.
function must, in some sense, be β€œpredictably”
continuous. The condition that X βŠ‚ ℝ𝑑 be compact is equivalent to the domain
27 w. rudin (1976). Principles of Mathematical being closed and bounded, by the Heine–Borel theorem.27 Applying this
Analysis. McGraw–Hill. [theorem 2.41] result to our example gp in Figure 2.1, we conclude that samples from the
process are continuous with probability one as the domain X = [0, 30] is
28 Following the discussion in the next section, compact and the squared exponential covariance function (2.4) is HΓΆlder
they in fact are infinitely differentiable. continuous. Indeed, the generated samples are very smooth.28
Sample path continuity can also be guaranteed on non-Euclidean
domains under similar smoothness conditions.25

2.6 differentiability
We can approach the question of differentiability by again reasoning
about the limiting behavior of linear transformations of function values.
Suppose 𝑓 : X β†’ ℝ with X βŠ‚ ℝ𝑑 has distribution GP (𝑓 ; πœ‡, 𝐾), and
consider the 𝑖th partial derivative of 𝑓 at x, if it exists:
πœ•π‘“ 𝑓 (x + β„Že𝑖 ) βˆ’ 𝑓 (x)
(x) = lim ,
πœ•π‘₯𝑖 β„Žβ†’0 β„Ž
where e𝑖 is the 𝑖th standard basis vector. For β„Ž > 0, the value in the limit
is Gaussian distributed as a linear transformation of Gaussian-distributed
random variables (a.9). Assuming the corresponding partial derivative
of the mean exists at x and the corresponding partial derivative with
respect to each input of the covariance function exists at x = x,β€² then as
sequences of normal rvs: Β§ a.2, p. 300 β„Ž β†’ 0 the partial derivative converges in distribution to a Gaussian:
   
πœ•π‘“ πœ•π‘“ πœ•πœ‡ πœ•2𝐾
𝑝 (x) | x = N (x); (x), (x, x) .
πœ•π‘₯𝑖 πœ•π‘₯𝑖 πœ•π‘₯𝑖 πœ•π‘₯𝑖 πœ•π‘₯𝑖′

differentiability in mean square If this property holds for each coordinate 1 ≀ 𝑖 ≀ 𝑑, then 𝑓 is said to be
differentiable in mean square at x.
joint gp between function and gradient If 𝑓 is differentiable in mean square everywhere in the domain, the
process itself is called differentiable in mean square, and we have the
remarkable result that the function and its gradient have a joint Gaussian
process distribution:
     !
𝑓 πœ‡ 𝐾 𝐾 βˆ‡βŠ€
𝑝 (𝑓, βˆ‡π‘“ ) = GP ; , . (2.28)
βˆ‡π‘“ βˆ‡πœ‡ βˆ‡πΎ βˆ‡πΎ βˆ‡βŠ€
2.6. differentiability 31

observations posterior mean posterior 95% credible interval

𝑑𝑓
𝑑π‘₯

Figure 2.7: The joint posterior of the function and its derivative for our example Gaussian process from Figure 2.2. The
dashed line in the lower plot corresponds to a derivative of zero.

Here by writing the gradient operator βˆ‡ on the left-hand side of 𝐾 we


mean the result of taking the gradient with respect to its first input,
and by writing βˆ‡βŠ€ on the right-hand side of 𝐾 we mean taking the
gradient with respect to its second input and transposing the result. Thus
βˆ‡πΎ : X Γ— X β†’ ℝ𝑑 maps pairs of points to column vectors: covariance between βˆ‡π‘“ (x) and 𝑓 (xβ€² ), βˆ‡πΎ
 
  πœ•π‘“ πœ•πΎ
βˆ‡πΎ (x, xβ€²) 𝑖 = cov (x), 𝑓 (xβ€²) x, xβ€² = (x, xβ€²),
πœ•π‘₯𝑖 πœ•π‘₯𝑖

and 𝐾 βˆ‡βŠ€: X Γ— X β†’ (ℝ𝑑 ) βˆ— maps pairs of points to row vectors: transpose of covariance between 𝑓 (x) and
 
βˆ‡π‘“ (xβ€² ), 𝐾 βˆ‡βŠ€
𝐾 βˆ‡βŠ€(x, xβ€²) = βˆ‡πΎ (x,β€² x) ⊀ .

Finally, the function βˆ‡πΎ βˆ‡βŠ€: X Γ— X β†’ ℝ𝑑×𝑑 represents the result of covariance between βˆ‡π‘“ (x) and βˆ‡π‘“ (xβ€² ),
applying both operations, mapping a pair of points to the covariance βˆ‡πΎ βˆ‡βŠ€
matrix between the entries of the corresponding gradients:
 
  πœ•π‘“ πœ•π‘“ β€² πœ•2𝐾
βˆ‡πΎβˆ‡ (x, x ) 𝑖 𝑗 = cov
⊀ β€² β€²
(x), β€² (x ) x, x = (x, xβ€²).
πœ•π‘₯𝑖 πœ•π‘₯𝑗 πœ•π‘₯𝑖 πœ•π‘₯𝑗′

As the gradient of 𝑓 has a Gaussian process marginal distribution (2.28),


we can reduce the question of continuous differentiability to sample path continuous differentiability
continuity of the gradient process following the discussion above.
Figure 2.7 shows the posterior distribution for the derivative of our example and discussion
example Gaussian process alongside the posterior for the function itself.
We can observe a clear correspondence between the two distributions;
for example, the posterior mean of the derivative vanishes at critical
points of the posterior mean of the function. Notably, we have a great
deal of residual uncertainty about the derivative, even at the observed
locations. That is because the relatively high spacing between the exist-
ing observations limits our ability to accurately estimate the derivative
32 gaussian processes

observations posterior mean posterior 95% credible interval

𝑑𝑓
𝑑π‘₯

Figure 2.8: The joint posterior of the derivative of our example Gaussian process after adding a new observation nearby
another suggesting a large positive slope. The dashed line in the lower plot corresponds to a derivative of zero.

anywhere. Adding an observation immediately next to a previous one


significantly reduces the uncertainty in the derivative in that region by
effectively providing a finite-difference approximation; see Figure 2.8.

Conditioning on derivative observations


However, we can be more direct in specifying derivatives than finite
differencing. We can instead condition the joint gp (2.28) directly on a
inference from jointly Gaussian distributed derivative observation, as described previously. Figure 2.9 shows the
observations: Β§ 2.2, p. 18 joint posterior after conditioning on an exact observation of the deriva-
tive at the left-most observation location, where the uncertainty in the
derivative now vanishes entirely. This capability allows the seamless
incorporation of derivative information into an objective function model.
Notably, we can even condition a Gaussian process on noisy derivative
observations as well, as we might obtain in stochastic gradient descent.
29 For 𝐾 we again only need to consider the β€œdi- We can reason about derivatives past the first recursively. For exam-
agonal” x = xβ€². ple, if πœ‡ and 𝐾 are twice differentiable,29 then the (e.g., half-vectorized30)
30 Recall the Hessian is symmetric (assuming the Hessian of 𝑓 will also have a joint gp distribution with 𝑓 and its gradient.
second partial derivatives are continuous) and Defining h to be the operator mapping a function to its half-vectorized
thus redundant. The half-vectorization opera-
tor vech A maps the upper triangular part of
Hessian:
a square, symmetric matrix A to a vector. h𝑓 = vech βˆ‡βˆ‡βŠ€π‘“,
for a Gaussian process with suitably differentiable moments, we have

𝑝 (h𝑓 ) = GP h𝑓 ; hπœ‡, h𝐾 h⊀ , (2.29)
where we have used the same notational convention for the transpose.
Further, 𝑓, βˆ‡π‘“, and h𝑓 will have a joint Gaussian process distribution
given by augmenting (2.28) with the marginal in (2.29) and the cross-
covariance functions
cov[h𝑓, 𝑓 ] = h𝐾; cov[h𝑓, βˆ‡π‘“ ] = hπΎβˆ‡.⊀
2.7. existence and uniqeness of global maxima 33

observations posterior mean posterior 95% credible interval

𝑑𝑓
𝑑π‘₯

Figure 2.9: The joint posterior of the derivative of our example Gaussian process after adding an exact observation of the
derivative at the indicated location. The dashed line in the lower plot corresponds to a derivative of zero.

We can continue further in this vein if needed; however, we rarely reason


about derivatives of third-or-higher order in Bayesian optimization.31 31 This is true in classical optimization as well!

Other linear transformations


The joint gp distribution between a suitably smooth gp-distributed func-
tion and its gradient (2.28) is simply an infinite-dimensional analog of
the general result that Gaussian random variables are jointly Gaussian
distributed with arbitrary linear transformations (a.10), after noting that
differentiation is a linear operator. We can extend this result to reason
about other linear transformations of gp-distributed functions. diaco-
nis’s original motivation for studying Bayesian numerical methods was
quadrature, the numerical estimation of intractable integrals.32 It turns 32 p. diaconis (1988). Bayesian Numerical Anal-
out that Gaussian processes are a rather convenient model for this task: ysis. In: Statistical Decision Theory and Related
if 𝑝 (𝑓 ) = GP (𝑓 ; πœ‡, 𝐾) and we want to reason about the expectation
Topics iv.

∫
𝑍 = 𝑓 (π‘₯) 𝑝 (π‘₯) dπ‘₯,

then (under mild conditions) we again have a joint Gaussian process 33 This can be shown, for example, by consider-
distribution over 𝑓 and 𝑍 .33 This enables both inference about 𝑍 and ing the limiting distribution of Riemann sums.
conditioning on noisy observations of integrals, such as a Monte Carlo 34 a. o’hagan (1991). Bayes–Hermite Quadrature.
estimate of an expectation. The former is the basis for Bayesian quadra- Journal of Statistical Planning and Inference
29(3):245–260.
ture, an analog of Bayesian optimization bringing Bayesian experimental
design to bear on numerical integration.32, 34, 35 35 c. e. rasmussen and z. ghahramani (2002).
Bayesian Monte Carlo. neurips 2002.

2.7 existence and uniqeness of global maxima


The primary use of gps in Bayesian optimization is to inform optimiza-
tion decisions, which will be our focus for the majority of this book.
Before continuing down this path, we pause to consider whether global
34 gaussian processes

optimization of a gp-distributed function is a well-posed problem, in par-


ticular, whether the model guarantees the existence of a global maximum
at all.
Consider a function 𝑓 : X β†’ ℝ with distribution GP (𝑓 ; πœ‡, 𝐾), and
consider the location and value of its global optimum, if one exists:

π‘₯ βˆ— = arg max 𝑓 (π‘₯); 𝑓 βˆ— = max 𝑓 (π‘₯) = 𝑓 (π‘₯ βˆ— ).


π‘₯ ∈X π‘₯ ∈X

mutual information and entropy search: Β§ 7.6, As 𝑓 is unknown, these quantities are random variables. Many Bayesian
p. 135 optimization algorithms operate by reasoning about the distributions of
(and uncertainties in) these quantities induced by our belief on 𝑓.
There are two technical issues we must address. The first is whether
we can be certain that a globally optimal value 𝑓 βˆ— exists when the objec-
tive function is random. If existence is not guaranteed, then its distribu-
tion is meaningless. The second issue is one of uniqueness: assuming the
objective does attain a maximal value, can we be certain the optimum
is unique? In general π‘₯ βˆ— is a set-valued random variable, and thus its
distribution might have support over arbitrary subsets of the domain,
rendering it complicated to reason about. However, if we could ensure
the uniqueness of π‘₯ βˆ—, its distribution would have support on X rather
than its power set, allowing more straightforward inference.
Both the existence of 𝑓 βˆ— and uniqueness of π‘₯ βˆ— are tacitly assumed
throughout the Bayesian optimization literature when building algo-
rithms based on distributions of these quantities, but these properties
are not guaranteed for arbitrary Gaussian processes. However, we can
ensure these properties hold almost surely under mild conditions.

Existence of global maxima


To begin, guaranteeing the existence of an optimal value is straightfor-
ward if we suppose the domain X is compact, a pervasive assumption
in optimization. This is no coincidence! In this case, if 𝑓 is continuous
36 w. rudin (1976). Principles of Mathematical then it achieves a global optimum by the extreme value theorem.36 Thus
Analysis. McGraw–Hill. [theorem 4.16] sample path continuity of 𝑓 and compactness of X is sufficient to ensure
that 𝑓 βˆ— exists almost surely. Both conditions can be readily established:
sample path continuity: Β§ 2.5, p. 28 sample path continuity by following our previous discussion, and com-
pactness of the domain by standard arguments (for example, ensuring
that X βŠ‚ ℝ𝑑 be closed and bounded).

Uniqueness of global maxima


We now turn to the question of uniqueness of π‘₯ ,βˆ— which obviously only
becomes a meaningful question after presupposing that 𝑓 βˆ— exists. Again,
37 A centered Gaussian process has identically this condition is easy to ensure almost surely under simple conditions
zero mean function πœ‡ ≑ 0. on the covariance function of a Gaussian process.
38 j. kim and d. pollard (1990). Cube Root kim and pollard considered this issue and provided straightforward
Asymptotics. The Annals of Statistics 18(1):191– conditions under which the uniqueness of π‘₯ βˆ— is guaranteed for a centered
219. [lemma 2.6] Gaussian process.37, 38 Namely, no two unique points in the domain can
2.8. inference with non-gaussian observations and constraints 35

have perfectly correlated function values, a natural condition that can


be easily verified.
Theorem (kim and pollard, 1990). Let X be a compact metric space.39 39 Although unlikely to matter in practice, kim
Suppose 𝑓 : X β†’ ℝ has distribution GP (𝑓 ; πœ‡ ≑ 0, 𝐾), and that 𝑓 is sample and pollard allow X to be 𝜎-compact and
show that the supremum (rather than the max-
path continuous. If for all π‘₯, π‘₯ β€² ∈ X with π‘₯ β‰  π‘₯ β€² we have imum) is unique under the same conditions.

var[πœ™ βˆ’ πœ™ β€² | π‘₯, π‘₯ β€²] = 𝐾 (π‘₯, π‘₯) βˆ’ 2𝐾 (π‘₯, π‘₯ β€²) + 𝐾 (π‘₯ ,β€² π‘₯ β€²) β‰  0,

then 𝑓 almost surely has a unique maximum on X.


arcones provided slightly weaker conditions for uniqueness of the 40 m. a. arcones (1992). On the arg max of a Gaus-
supremum, avoiding the requirement of sample path continuity.40 sian Process. Statistics & Probability Letters
15(5):373–374.

Counterexamples
Although the above conditions for ensuring existence of 𝑓 βˆ— and unique-
ness of π‘₯ βˆ— are fairly mild, it is easy to construct counterexamples.
Consider a function on the closed unit interval, which we note is 41 It turns out this naΓ―ve model of white noise
compact: 𝑓 : [0, 1] β†’ ℝ. We endow 𝑓 with a β€œwhite noise”41 Gaussian has horrible mathematical properties, but it is
sufficient for this counterexample.
process with
πœ‡ (π‘₯) ≑ 0; 𝐾 (π‘₯, π‘₯ β€²) = [π‘₯ = π‘₯ β€²].
Now 𝑓 almost surely does not have a maximum. Roughly, because the 42 Let 𝑄 = β„š ∩ [0, 1] = {π‘žπ‘– } be the rationals in
value of 𝑓 at every point in the domain is independent of every other, the domain and let 𝑓 βˆ— be a putative maximum.
Defining πœ™π‘– = 𝑓 (π‘žπ‘– ), we must have πœ™π‘– ≀ 𝑓 βˆ—
there will almost always be a point with value exceeding any putative for every 𝑖; call this event 𝐴.
maximum.42 However, the conditions of sample path continuity were Define the event π΄π‘˜ by 𝑓 βˆ— exceeding the
violated as the covariance is discontinuous at π‘₯ = π‘₯ .β€² first π‘˜ elements of 𝑄. From independence,
We may also construct a Gaussian process that almost surely achieves Γ–
π‘˜
a maximum that is not unique. Consider a random function 𝑓 defined Pr(π΄π‘˜ ) = Pr(πœ™π‘– ≀ 𝑓 βˆ— ) = Ξ¦(𝑓 βˆ— ) π‘˜,
on the (compact) interval [0, 4πœ‹] defined by the parametric model 𝑖=1

so Pr(π΄π‘˜ ) β†’ 0 as π‘˜ β†’ ∞. But {π΄π‘˜ } β†— 𝐴,


𝑓 (π‘₯) = 𝛼 cos π‘₯ + 𝛽 sin π‘₯, so Pr(𝐴) = 0, and 𝑓 βˆ— is almost surely not the
maximum.
where 𝛼 and 𝛽 are independent standard normal random variables. Then
𝑓 has a Gaussian process distribution with

πœ‡ (π‘₯) ≑ 0; 𝐾 (π‘₯, π‘₯ β€²) = cos(π‘₯ βˆ’ π‘₯ β€²). (2.30)

Here πœ‡ is continuous and 𝐾 is HΓΆlder continuous, and thus 𝑓 is sample


path continuous and almost surely achieves a global maximum. However,
𝑓 is also periodic with period 2πœ‹ with probability one and will thus almost
surely achieve its maximum twice. Note that the covariance function does Our counterexample gp without a unique max-
not satisfy the conditions outlined in the above theorem, as any input imum. Every sample achieves its maximum
twice.
locations separated by 2πœ‹ have perfectly correlated function values.

2.8 inference with non-gaussian observations and constraints


Gaussian process inference is tractable when the observed values are inference from jointly Gaussian distributed
jointly Gaussian distributed with the function of interest (2.6). However, observations: Β§ 2.2, p. 18
this may not always hold for all relevant information we may receive.
36 gaussian processes

observations objective mean, Gaussian noise 95% credible interval, Gaussian noise

Figure 2.10: Regression with observations corrupted with heavy-tailed noise. The triangular marks indicate observations
lying beyond the plotted range. Shown is the posterior distribution of an objective function (along with
ground truth) modeling the errors as Gaussian. The posterior is heavily affected by the outliers.

𝑝 (𝑦 | π‘₯, πœ™)
One obvious limitation is an incompatibility with naturally non-
Gaussian observations. A scenario particularly relevant to optimization
is heavy-tailed noise. Consider the data shown in Figure 2.10, where
some observations represent extreme outliers. These errors are poorly
modeled as Gaussian, and attempting to infer the underlying objective
𝑦 function with the additive Gaussian noise model leads to overfitting
and poor predictive performance. A Student-𝑑 error model with 𝜈 β‰ˆ 4
πœ™ degrees of freedom provides a robust alternative:43
A Student-𝑑 error model (solid) with a Gaus-
sian error model (dashed) for reference. The 𝑝 (𝑦 | π‘₯, πœ™) = T (𝑦; πœ™, πœŽπ‘›2 , 𝜈). (2.31)
heavier tails of the Student-𝑑 model can better
explain large outliers. The heavier tails of this model can better explain large outliers; un-
fortunately, the non-Gaussian nature of this model also renders exact
43 k. l. lange et al. (1989). Robust Statistical Mod- inference impossible. We will demonstrate how to overcome this impasse.
eling Using the 𝑑 Distribution. Journal of the Constraints on an objective function, such as bounds on given func-
American Statistical Association 84(408):881–
896. tion values, can also provide valuable information during optimization,
but many natural constraints cannot be reduced to observations that can
be handled in closed form. Several Bayesian optimization policies impose
hypothetical constraints on the objective function when designing each
observation, requiring inference from intractable constraints even when
the observations themselves pose no difficulties.
To see how constraints might arise in optimization, consider a Gaus-
sian process belief on a one-dimensional objective 𝑓, and suppose we
wish to condition on 𝑓 on having a local maximum at a given loca-
differentiability, derivative observations: Β§ 2.6, tion π‘₯. Assuming the function is twice differentiable, we can invoke the
p. 30 second-derivative test to encode this information in two constraints:

𝑓 β€² (π‘₯) = 0; 𝑓 β€²β€² (π‘₯) < 0. (2.32)

We can condition a gp on the first of these conditions by following


our previous discussion. However, no gp is compatible with the second
2.8. inference with non-gaussian observations and constraints 37

true distribution
samples Figure 2.11: The probability density function of an example
distribution along with 50 samples drawn inde-
pendently from the distribution. In Monte Carlo
approaches, the distribution is effectively approxi-
mated by a mixture of Dirac delta distributions at
the sample locations.

condition as 𝑓 β€²β€² (π‘₯) would necessarily have a Gaussian distribution with


unbounded support (2.29). We need some other means to proceed.

Non-Gaussian observations: general case


We can address both non-Gaussian observations and constraints with
the following general case, which is flexible enough to handle a large
range of information. As in our discussion on exact inference, suppose
there is some vector y sharing a joint Gaussian process distribution with
a function of interest 𝑓 (2.6):
     !
𝑓 πœ‡ 𝐾 πœ…βŠ€
𝑝 (𝑓, y) = GP ; , .
y m πœ… C

Suppose we receive some information about y in the form of infor-


mation D inducing a non-Gaussian posterior on y. Here, it is convenient
to adopt the language of factor graphs44 and write the resulting posterior 44 f. r. kschischang et al. (2001). Factor Graphs
as proportional to the prior weighted by a function 𝑑 (y) encoding the and the Sum–Product Algorithm. ieee Trans-
actions on Information Theory 47(2):498–519.
available information, which may factorize:
Γ–
𝑝 (y | D) ∝ 𝑝 (y) 𝑑 (y) = N (y; m, C) 𝑑𝑖 (y). (2.33)
𝑖

The functions {𝑑𝑖 } are called factors or local functions that may comprise factors, local functions, {𝑑𝑖 }
a likelihood augmented by any desired (hard or soft) constraints. The
term β€œlocal functions” arises because each factor often depends only on 45 For example, when observations are condi-
a low-dimensional subspace of y, often a single entry.45 tionally independent given the corresponding
function values, the likelihood factorizes into
The posterior on y (2.33) in turn induces a posterior on 𝑓 : a product of one-dimensional factors (1.3):
∫ Γ–
𝑝 (𝑓 | D) = 𝑝 (𝑓 | y) 𝑝 (y | D) dy. (2.34) 𝑝 (y | x, 𝝓) = 𝑝 (𝑦𝑖 | π‘₯𝑖 , πœ™π‘– ).
𝑖

At first glance, we may hope to resolve this posterior easily as 𝑝 (𝑓 | y) is


a Gaussian process (2.9–2.10). Unfortunately, the non-Gaussian posterior
on y usually renders the posterior on 𝑓 intractable.

Monte Carlo sampling


A Monte Carlo approach to approximating the 𝑓 posterior (2.34) begins
by drawing samples from the y posterior (2.33):
{y𝑖 }𝑠𝑖=1 ∼ 𝑝 (y | D).
38 gaussian processes

observations objective posterior mean posterior 95% credible interval

Figure 2.12: Regression with observations corrupted with heavy-tailed noise. The triangular marks indicate observations
lying beyond the plotted range. Shown is the posterior distribution of an objective function (along with
ground truth) modeling the errors as Student-𝑑 distributed with 𝜈 = 4 degrees of freedom. The posterior was
approximated from 100 000 Monte Carlo samples. Comparing with the additive Gaussian noise model from
Figure 2.10, this model effectively ignores the outliers and the fit is excellent.

We may generate these by appealing to one of numerous Markov chain


46 Handbook of Markov Chain Monte Carlo (2011). Monte Carlo (mcmc) routines.46 One natural choice would be elliptical
Chapman & Hall. slice sampling,47 which is specifically tailored for latent Gaussian models
47 i. murray et al. (2010). Elliptical Slice Sam- of this form. Samples from a one-dimensional toy example distribution
pling. aistats 2010. are shown in Figure 2.11.
Given posterior samples of y, we may then approximate (2.34) via
the standard Monte Carlo estimator

1 βˆ‘οΈ 1 βˆ‘οΈ
𝑠 𝑠
𝑝 (𝑓 | D) β‰ˆ 𝑝 (𝑓 | y𝑖 ) = GP (𝑓 ; πœ‡D𝑖 , 𝐾D ). (2.35)
𝑠 𝑖=1 𝑠 𝑖=1

This is a mixture of Gaussian processes, each of the form in (2.9–2.10).


The posterior mean functions depend on the corresponding y samples,
whereas the posterior covariance functions are identical as there is no
dependence on the observed values. In this approximation, the marginal
belief about any function value is then a mixture of univariate Gaussians:

1 βˆ‘οΈ
𝑠
𝑝 (πœ™ | π‘₯, D) β‰ˆ N (πœ™; πœ‡π‘– , 𝜎 2 ); πœ‡π‘– = πœ‡D𝑖 (π‘₯); 𝜎 2 = 𝐾D (π‘₯, π‘₯).
𝑠 𝑖=1
(2.36)
Although slightly more complex than the Gaussian marginals of a Gaus-
sian process, this is often convenient enough for most needs.
example: Student-𝑑 observation model A Monte Carlo approximation to the posterior for the heavy-tailed
dataset from Figure 2.10 is shown in Figure 2.12. The observations were
modeled as corrupted by Student-𝑑 errors with 𝜈 = 4 degrees of freedom.
The posterior was approximated using a truly excessive number of sam-
ples (100 000, with a burn-in of 10 000) from the y posterior drawn using
elliptical slice sampling.47 The outliers in the data are ignored and the
predictive performance is excellent.
2.8. inference with non-gaussian observations and constraints 39

Gaussian approximate inference


An alternative to sampling is approximate inference, where we make
a parametric approximation to the y posterior that yields a tractable
posterior on 𝑓. In particular, if the posterior (2.33) were actually normal,
it would induce a Gaussian process posterior on 𝑓. This insight is the
basis for most approximation schemes.
In this vein, we proceed by first – somehow – approximating the
true posterior over y with a multivariate Gaussian distribution:

𝑝 (y | D) β‰ˆ π‘ž(y | D) = N (y; mΜƒ, CΜƒ). (2.37)

We are free to design this approximation as we see fit. There are several
general-purpose approaches available, distinguished by how they ap-
proach maximizing the fidelity of fitting the true posterior (2.33). These
include the Laplace approximation, Gaussian expectation propagation, Laplace approximation: Β§ b.1, p. 301
and variational Bayesian inference. The first two of these methods are Gaussian expectation propagation: Β§ b.2
covered in Appendix b, and nickisch and rasmussen provide an exten- p. 302
sive survey of these and other approaches in the context of Gaussian
process binary classification.48 48 h. nickisch and c. e. rasmussen (2008). App-
Regardless of the details of the approximation scheme, the high-level proximations for Binary Gaussian Process
Classification. Journal of Machine Learning Re-
result is the same – the normal approximation (2.37) in turn induces an search 9(Oct):2035–2078.
approximate Gaussian process posterior on 𝑓. To demonstrate this, we
consider the posterior on 𝑓 that would arise from a direct observation of
y (2.9–2.10) and integrate against the approximate posterior (2.37):
∫
𝑝 (𝑓 | D) β‰ˆ 𝑝 (𝑓 | y) π‘ž(y | D) dy = GP (𝑓 ; πœ‡D , 𝐾D ), (2.38)
where
πœ‡D (π‘₯) = πœ‡ (π‘₯) + πœ… (π‘₯)⊀ Cβˆ’1 ( mΜƒ βˆ’ m);
(2.39)
𝐾D (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯, π‘₯ β€²) βˆ’ πœ… (π‘₯)⊀ Cβˆ’1 (C βˆ’ CΜƒ) Cβˆ’1 πœ… (π‘₯ β€²).

For most approximation schemes, the posterior covariance on 𝑓


simplifies to a nicer, more familiar form. Most approximations to the y
posterior (2.37) yield an approximate posterior covariance of the form

CΜƒ = C βˆ’ C (C + N) βˆ’1 C, (2.40)

where N is positive definite. Although this might appear mysterious, it


is actually a natural form: it is the posterior covariance that would result
from observing y corrupted by additive Gaussian noise with covariance N
(2.19), except we are now free to design the noise covariance to maximize
the fit. For approximations of this form (2.40), the approximate posterior
covariance function on 𝑓 simplifies to the more familiar

𝐾D (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯, π‘₯ β€²) βˆ’ πœ… (π‘₯)⊀ (C + N) βˆ’1 πœ… (π‘₯ β€²). (2.41)

To demonstrate the power of approximate inference, we return to example: conditioning on a local optimum
our motivating scenario of conditioning a one-dimensional process on
having a local maximum at an identified point π‘₯, which we can achieve by
40 gaussian processes

posterior mean posterior 95% credible interval posterior sample

π‘₯ π‘₯ π‘₯

𝑓 β€² (π‘₯) = 0 𝑓 β€²β€² (π‘₯) < 0

Figure 2.13: Approximately conditioning a Gaussian process to have a local maximum at the marked point π‘₯. We show
each stage of the conditioning process with a sample drawn from the corresponding posterior. We begin
with the unconstrained process (left), which we condition on the first derivative being zero at π‘₯ using exact
inference (middle). Finally we use Gaussian expectation propagation to approximately condition on the second
derivative being negative at π‘₯.

conditioning the first derivative to be zero and constraining the second


derivative to be negative at π‘₯ (2.32). We illustrate an approximation to
the resulting posterior step-by-step in Figure 2.13, beginning with the
example Gaussian process in the left-most panel. We first condition
derivative observations: Β§ 2.6, p. 32 the process on the first derivative observation 𝑓 β€² (π‘₯) = 0 using exact
inference; the result is shown in the middle panel. Both the updated
posterior mean and the sample reflect this information; however, the
sample displays a local minimum at π‘₯, as the second-derivative constraint
has not yet been addressed.
To incorporate the second-derivative constraint, we begin with this
updated gp and consider the second derivative β„Ž = 𝑓 β€²β€² (π‘₯), which is
Gaussian distributed prior to the constraint (2.29):

𝑝 (β„Ž) = N (β„Ž; π‘š, 𝑠 2 ).

The negativity constraint induces a posterior on β„Ž incorporating the


factor [β„Ž < 0] (2.33); see Figure 2.14:

𝑝 (β„Ž | D) ∝ 𝑝 (β„Ž) [β„Ž < 0].

The result is a truncated normal posterior on β„Ž. We may use Gaussian


expectation propagation, which is especially convenient for handling
bound constraints of this form, to produce a Gaussian approximation:

˜ π‘ Λœ2 ).
𝑝 (β„Ž | D) β‰ˆ π‘ž(β„Ž | D) = N (β„Ž; π‘š,

Incorporating the updated belief on β„Ž into the Gaussian process (2.39)


yields the approximate posterior in the right-most panel of Figure 2.13.
Although there is still some residual probability that the second derivative
49 p. mccullagh and j. a. nelder (1989). Gener- is positive at π‘₯ in the approximate posterior (approximately 8%; see Figure
alized Linear Models. Chapman & Hall. 2.14), the belief reflects the desired information reasonably faithfully.
2.9. summary of major ideas 41

prior, 𝑝 (β„Ž) true posterior, 𝑝 (β„Ž | D) ∝ 𝑝 (β„Ž) [β„Ž < 0]


constraint factor, [β„Ž < 0] Gaussian ep approximation

0 0

Figure 2.14: A demonstration of Gaussian expectation propagation. On the left we have a Gaussian belief on the second
derivative, 𝑝 (β„Ž). We wish to constrain this value to be negative, introducing a step-function factor encoding
the constraint, [β„Ž < 0]. The resulting distribution is non-Gaussian (right), but we can approximate it with a
Gaussian, which induces an updated gp posterior on the function approximately incorporating the constraint.

Going beyond this example, we may use the approach outlined above 50 h. nickisch and c. e. rasmussen (2008). App-
to realize a general framework for Bayesian nonlinear regression by proximations for Binary Gaussian Process
Classification. Journal of Machine Learning Re-
combining a gp prior on a latent function with an observation model search 9(Oct):2035–2078.
appropriate for the task at hand, then approximating the posterior as
51 j. mΓΈller et al. (1998). Log Gaussian Cox Pro-
desired. The convenience and modeling flexibility offered by Gaussian cesses. Scandinavian Journal of Statistics 25(3):
processes can easily justify any extra effort required for approximating 451–482.
the posterior. This can be seen as a nonlinear extension of the well-known 52 r. p. adams et al. (2009). Tractable Nonpara-
family of generalized linear models.49 metric Bayesian Inference in Poisson Pro-
This approach is quite popular and has been realized countless times. cesses with Gaussian Process Intensities. icml
Notable examples include binary classification using a logistic or probit 2009.

observation model,50 modeling point processes as a nonhomogeneous 53 m. kuss (2006). Gaussian Process Models for
Robust Regression, Classification, and Rein-
Poisson process with unknown intensity,51, 52 and robust regression with forcement Learning. Ph.D. thesis. Technische
heavy-tailed additive noise such as Laplace53 or Student-𝑑54, 55 distributed UniversitΓ€t Darmstadt.[Β§ 5.4]
errors. With regard to the latter and our previous heavy-tailed noise 54 r. m. neal (1997). Monte Carlo Implementation
example, a Laplace approximation to the posterior for the data in Figures of Gaussian Process Models for Bayesian Re-
2.10–2.12 with the Student-𝑑 observation model produces an approximate gression and Classification. Technical report
posterior in excellent agreement with the Monte Carlo approximation (9702). Department of Statistics, University of
Toronto.
in Figure 2.12; see Figure 2.15. The cost of approximate inference in this
55 p. jylΓ€nki et al. (2011). Robust Gaussian Pro-
case was dramatically (several orders of magnitude) cheaper than Monte cess Regression with a Student-𝑑 Likelihood.
Carlo sampling. Journal of Machine Learning Research 12(99):
3227–3257.
56 diaconis identified an early application of gps
2.9 summary of major ideas by poincarΓ© for nonlinear regression:
Gaussian processes have been studied – in one form or another – for over p. diaconis (1988). Bayesian Numerical Anal-
100 years.56 Although we have covered a lot of ground in this chapter, we ysis. In: Statistical Decision Theory and Related
have only scratched the surface of an expansive body of literature. A good Topics iv.

entry point to that literature is rasmussen and williams’s monograph, h. poincarΓ© (1912). Calcul des probabilitΓ©s.
Gauthier–Villars.
which focuses on machine learning applications of Gaussian processes
but also covers their theoretical underpinnings and properties in depth.57 57 c. e. rasmussen and c. k. i. williams (2006).
Gaussian Processes for Machine Learning. mit
A good companion to this work is the book of adler and taylor, which Press.
takes a deep dive into the properties and geometry of sample paths, 58 r. j. adler and j. e. taylor (2007). Random
including statistical properties of their maxima.58 Fields and Geometry. Springer–Verlag.
42 gaussian processes

observations objective posterior mean posterior 95% credible interval

Figure 2.15: A Laplace approximation to the posterior from Figure 2.12.

Fortunately, the basic definitions and properties covered in Β§ 2.1 and


exact inference procedure covered in Β§ 2.2 already provide a sufficient
foundation for the majority of practical applications of Bayesian opti-
mization. This material also provides sufficient background knowledge
for the majority of the remainder of the book. However, we wish to
underscore the major results from this chapter at a high level.
β€’ Gaussian processes extend the multivariate normal distribution to model
functions on infinite domains. As in the finite-dimensional case, Gaussian
processes are specified by their first two moments – a mean function
and a positive-definite covariance function – which endow any finite set
of function values with a multivariate normal distribution (2.2–2.3).
β€’ Conditioning a Gaussian process on function observations that are either
exact or corrupted by additive Gaussian noise yields a Gaussian process
posterior with updated moments reflecting the assumptions in the prior
and the information in the observations (2.9–2.10).
inference from arbitrary joint Gaussian β€’ In fact, we may condition a Gaussian process on the observation of any
observations: Β§ 2.2, p. 22 observations sharing a joint Gaussian distribution with the function of
interest.
interpretation of posterior moments: Β§ 2.2, β€’ In the case of exact inference, the posterior moments of a Gaussian
p. 21 process can be rewritten in terms of correlations among function values
and 𝑧-scores of the observed values in a manner that may be more
intuitive than the standard formulas.
joint Gaussian processes: Β§ 2.4, p. 26 β€’ We may extend Gaussian processes to jointly model multiple correlated
functions via careful bookkeeping, a construction known as a joint Gaus-
Extensions and Related Settings: Chapter 11, sian process. Joint gps are widely used in optimization settings involving
p. 245 multiple objectives and/or cheaper surrogates for an expensive objective.
continuity: Β§ 2.5, p. 28 β€’ Continuity and differentiability of Gaussian process sample paths can
differentiability: Β§ 2.6, p. 30 be guaranteed under mild assumptions on the mean and covariance
functions. When these functions are sufficiently differentiable, a gp-
distributed function shares a joint gp distribution with its gradient (2.28).
2.9. summary of major ideas 43

This joint distribution allows us to condition a Gaussian process on derivative observations: Β§ 2.6, p. 32
(potentially noisy) derivative observations.
β€’ The existence and uniqueness of global maxima for Gaussian process existence and uniqueness of global maxima:
sample paths can be guaranteed under mild assumptions on the mean Β§ 2.7, p. 33
and covariance functions. Establishing these properties ensures that
the location π‘₯ βˆ— and value 𝑓 βˆ— of the global maximum are well-founded 59 In particular, policies grounded in information
random variables, which will be critical for some optimization methods theory under the umbrella of β€œentropy search.”
See Β§ 7.6, p. 135 for more.
introduced later in the book.59
β€’ Inference from non-Gaussian observations and constraints is possible inference with non-Gaussian observations
via Monte Carlo sampling or Gaussian approximate inference. and constraints: Β§ 2.8, p. 35

Looking forward, the focus of this chapter has been on theoretical


rather than practical properties of Gaussian processes. A huge outstand-
ing question is how to actually design a Gaussian process to model a
given system. This will be our focus for the next two chapters. In the
next chapter, we will explore model construction, and in the following
chapter we will consider model assessment in light of available data.
Finally, we have not yet discussed any computational issues inherent
to Gaussian process inference, including, most importantly, how the
cost of computing the posterior grows with respect to the number of
observations. We will discuss implementation details and scaling in a implementation and scaling of Gaussian
dedicated chapter later in the book. process inference: Β§ 9.1, p. 201
3
MODELING WI TH GAUSSIAN PROCESSES

Bayesian optimization relies on a faithful model of the system of in-


terest to make well-informed decisions. In fact, even more so than the
details of the optimization policy, the fidelity of the underlying model of
the objective function is the most decisive factor determining optimiza-
tion performance. This has been long acknowledged, with mockus for
example commenting in his seminal work that:1

The development of some system of a priori distributions suitable 1 j. mockus (1974). On Bayesian Methods for
for different classes of the function 𝑓 is probably the most important Seeking the Extremum. Optimization Tech-
problem in the application of [the] Bayesian approach to. . . global
niques: ifip Technical Conference.

optimization.

The importance of careful modeling has not waned in the intervening


years, but our capacity for building sophisticated models has improved.
Recall our approach to modeling observations obtained during opti- Bayesian inference of the objective function:
mization combines a prior process for a (perhaps not directly observable) Β§ 1.2, p. 8
objective function (1.8) and an observation model linking the values of
the objective to measured values (1.2). Both distributions must be speci-
fied before we can derive a posterior belief about the objective function
(1.10) and predictive distribution for proposed observations (1.7), which
together serve as the key enablers of Bayesian optimization policies.
In practice, the choice of observation model is often noncontrover-
sial,2 and our running prototypes of exact observation and additive 2 However, we may not be certain about some
Gaussian noise suffice for many systems. The bulk of modeling effort details, such as the scale of observation noise,
an issue we will address in the next chapter.
is thus spent crafting the prior process. Although specifying a Gaus-
sian process is seemingly as simple as choosing a mean and covariance
function, it can be difficult to intuit appropriate choices without a great
deal of knowledge about the system of interest. As an alternative to
prior knowledge, we may appeal to a data-driven approach, where we
establish a space of candidate models and search through this space for
those offering the best explanation of available data. Almost all Gaussian
process models used in practice are designed in this manner, and we will
lay the groundwork for this approach in this chapter and the next.
As a Gaussian process is specified by its first two moments, data-
driven model design boils down to searching for the prior mean and
covariance functions most harmonious with our observations. This can
be a daunting task as the space of possibilities is limitless. However, we do
not need to begin from scratch: there are mean and covariance functions
available off-the-shelf for modeling a range of behavioral archetypes, and
by systematically combining these components we may model functions
with a rich variety of behavior. We will explore the world of possibilities
in this chapter, while addressing details important to optimization.
Once we have established a space of candidate models, we will require
some mechanism to differentiate possible choices based on their merits, a
process known as model assessment that we will explore at length in the Chapter 4: Model Assessment, Selection, and
next chapter. We will begin the present discussion by revisiting the topic Averaging, p. 67

This material has been published by Cambridge University Press as Bayesian Optimization. 45
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
46 modeling with gaussian processes

Figure 3.1: The importance of the prior mean function in determining sample path behavior. The models in the first two
panels differ in their mean function but share the same covariance function. Sample path behavior is identical
up to translation. The model in the third panel features the same mean function as the first panel but a different
covariance function. Samples exhibit dramatically different behavior.

of prior mean and covariance functions with an eye toward practical


utility.

3.1 the prior mean function


Recall the mean function of a Gaussian process specifies the expected
value of an arbitrary function value πœ™ = 𝑓 (π‘₯):
πœ‡ (π‘₯) = 𝔼[πœ™ | π‘₯].
Although this is obviously a fundamental concern, the choice of prior
mean function has received relatively little consideration in the Bayesian
optimization literature.
impact of prior mean on sample paths There are several reasons for this. To begin, it is actually the covari-
ance function rather than the mean function that largely determines the
behavior of sample paths. This should not be surprising: the mean func-
tion only affects the marginal distribution of function values, whereas
the covariance function can further modify the joint distribution of
function values. To elaborate, consider an arbitrary Gaussian process
GP (𝑓 ; πœ‡, 𝐾). Its sample paths are distributed identically to those from
the corresponding centered process 𝑓 βˆ’ πœ‡, after shifting pointwise by
πœ‡. Therefore the sample paths of any Gaussian process with the same
covariance function are effectively the same up to translation, and it is
the covariance function determining their behavior otherwise; see the
demonstration in Figure 3.1.
impact of prior mean on posterior mean It is also important to understand the role of the prior mean func-
tion in the posterior process. Suppose we condition a Gaussian process
GP (𝑓 ; πœ‡, 𝐾) on the observation of a vector y with marginal distribution
(2.7) and cross-covariance function (2.24)
𝑝 (y) = N (y; m, C); πœ… (π‘₯) = cov[y, πœ™ | π‘₯].
The prior mean influences the posterior process only through the poste-
rior mean (2.10):
πœ‡D (π‘₯) = πœ‡ (π‘₯) + πœ… (π‘₯)⊀ Cβˆ’1 (y βˆ’ m).
3.1. the prior mean function 47

Figure 3.2: The influence of the prior


mean on the posterior mean.
We show two Gaussian pro-
cess posteriors differing only
in their prior mean functions,
shown as dashed lines. In
the β€œinterpolatory” region be-
tween the observations, the
posterior means are mostly
extrapolation interpolation extrapolation determined by the data, but
devolve to the respective
prior means when extrapolat-
We can roughly understand the behavior of the posterior mean by iden- ing outside this region.
tifying two regimes determined by the strength of correlation between a
given function value and the observations. In β€œinterpolatory” regions,
where function values have significant correlation with one-or-more
observed value, the posterior mean is mostly determined by the data behavior in interpolatory and extrapolatory
rather than the prior mean. On the other hand, in β€œextrapolatory” regions, regions
where πœ… (π‘₯) β‰ˆ 0, the data have little influence and the posterior mean
effectively equals the prior mean. Figure 3.2 demonstrates this effect.

Constant mean function


The primary impact of the prior mean on our predictions – and on an
optimization policy informed by these predictions – is in the extrapo-
latory regime. However, extrapolation without strong prior knowledge
can be a dangerous business. As a result, in Bayesian optimization, the
prior mean is often taken to be a constant:

πœ‡ (π‘₯; 𝑐) ≑ 𝑐, (3.1)

in order to avoid any unwanted bias on our decisions caused by spurious


structure in the prior process. This simple choice is supported empiri- 3 g. de ath et al. (2020). What Do You Mean?
cally by a study comparing optimization performance across a range of The Role of the Mean Function in Bayesian
Optimization. gecco 2020.
problems as a function of the choice of prior mean.3
When adopting a constant mean, the value of the constant 𝑐 is usually marginalizing constant prior mean
treated as a parameter to be estimated or (approximately) marginalized,
as we will discuss in the next chapter. However, we can actually do
better in some cases. Consider a parametric Gaussian process prior with model selection and averaging: Β§Β§ 4.3–4.4,
constant mean (3.1) and arbitrary covariance function: p. 73

𝑝 (𝑓 | 𝑐) = GP (𝑓 ; πœ‡ ≑ 𝑐, 𝐾),

and suppose we place a normal prior on 𝑐:

𝑝 (𝑐) = N (𝑐; π‘Ž, 𝑏 2 ). (3.2)

Then we can marginalize the unknown constant mean exactly to derive


the marginal Gaussian process
∫
𝑝 (𝑓 ) = 𝑝 (𝑓 | 𝑐) 𝑝 (𝑐) d𝑐 = GP (𝑓 ; πœ‡ ≑ π‘Ž, 𝐾 + 𝑏 2 ), (3.3)
48 modeling with gaussian processes

where the uncertainty in the mean has been absorbed into the prior
covariance function. We may now use this prior directly, avoiding any
4 Noting that 𝑐 and 𝑓 form a joint Gaussian pro- estimation of 𝑐. The unknown mean will be automatically marginalized
cess, we may perform inference as described in both the prior and posterior process, and we may additionally derive
in Β§ 2.4, p. 26 to reveal their joint posterior.
the posterior belief over 𝑐 given data if it is of interest.4

Linear combination of basis functions


5 The basis functions can be arbitrarily complex, We may extend the above result to marginalize the weights of an arbitrary
such as the output layer of a deep neural net- linear combination of basis functions under a normal prior, making this
work:
a particularly convenient class of mean functions. Namely, consider a
j. snoek et al. (2015). Scalable Bayesian Opti- parametric mean function of the form
mization Using Deep Neural Networks. icml
2015. πœ‡ (π‘₯; 𝜷) = 𝜷 ⊀𝝍 (π‘₯), (3.4)

basis functions, 𝝍 where the vector-valued function 𝝍 : X β†’ ℝ𝑛 defines the basis functions
weight vector, 𝜷 and 𝜷 is a vector of weights.5
Now consider a parametric Gaussian process prior with a mean
function of this form (3.4) and arbitrary covariance function 𝐾. Placing
a multivariate normal prior on 𝜷,

𝑝 (𝜷) = N (𝜷; a, B), (3.5)

6 a. o’hagan (1978). Curve Fitting and Optimal and marginalizing yields the marginal prior,6, 7
Design for Prediction. Journal of the Royal Sta-
tistical Society Series B (Methodological) 40(1): 𝑝 (𝑓 ) = GP (𝑓 ; π‘š, 𝐢),
1–42.
7 c. e. rasmussen and c. k. i. williams (2006). where
Gaussian Processes for Machine Learning. mit
Press. [Β§ 2.7] π‘š(π‘₯) = a⊀𝝍 (π‘₯); 𝐢 (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯, π‘₯ β€²) + 𝝍 (π‘₯)⊀ B𝝍 (π‘₯ β€²). (3.6)

We may recover the constant mean case above by taking πœ“ (π‘₯) ≑ 1.


8 For an example of such modeling in physics,
where the mean function was taken to be the Other options
output of a physically informed model, see:

m. a. ziatdinov et al. (2021). Physics Makes


We stress that a constant or linear mean function is by no means nec-
the Difference: Bayesian Optimization and Ac- essary, and when a system is understood sufficiently well to suggest a
tive Learning via Augmented Gaussian Pro- plausible alternative – perhaps the output of a baseline predictive model
cess. arXiv: 2108.10280 [[Link]-ph]. – it should be strongly considered. However, it is hard to provide general
9 This mean function was proposed in the con- advice, as this modeling will be situation dependent.8
text of Bayesian optimization (with diagonal One option that might be a reasonable choice in some optimization
A) by
contexts is a concave quadratic mean:
j. snoek et al. (2015). Scalable Bayesian Opti-
mization Using Deep Neural Networks. icml πœ‡ (x; A, b, 𝑐) = (x βˆ’ b) ⊀ Aβˆ’1 (x βˆ’ b) + 𝑐, (3.7)
2015,
where A β‰Ί 0.9 This mean encodes that values near b (according to the
who also proposed appropriate priors for A
and b. The mean was also proposed in the re- Mahalanobis distance (a.8)) are expected to be higher than those farther
lated context of Bayesian quadrature (see Β§ 2.6, away and could reasonably model an objective function expected to be
p. 33) by β€œbowl-shaped” to a first approximation. The middle panel of Figure 3.1
l. acerbi (2018). Variational Bayesian Monte incorporates a mean of this form; note that the maxima of sample paths
Carlo. neurips 2018. are of course not constrained to agree with that of the prior mean.
3.2. the prior covariance function 49

3.2 the prior covariance function


The prior covariance function determines the covariance between the
function values corresponding to a pair of input locations π‘₯ and π‘₯ β€²:

𝐾 (π‘₯, π‘₯ β€²) = cov[πœ™, πœ™ β€² | π‘₯, π‘₯ β€²]. (3.8)

The covariance function determines fundamental properties of sample sample path continuity: Β§ 2.5, p. 28
path behavior, including continuity, differentiability, and aspects of the sample path differentiability: Β§ 2.6, p. 30
global optima, as we have already seen. Perhaps more so than the mean existence and uniqueness of global maxima:
function, careful design of the covariance function is critical to ensure Β§ 2.7, p. 33
fidelity in modeling. We will devote considerable discussion to this topic,
beginning with some important properties and moving on to useful
examples and mechanisms for systematically modifying and composing
multiple covariance functions together to model complex behavior.
After appropriate normalization, a covariance function 𝐾 may be
loosely interpreted as a measure of similarity between points in the
domain. Namely, given π‘₯, π‘₯ β€² ∈ X, the correlation between the corre-
sponding function values is correlation between function values, 𝜌

𝐾 (π‘₯, π‘₯ β€²)
𝜌 = corr[πœ™, πœ™ β€² | π‘₯, π‘₯ β€²] = √︁ , (3.9)
𝐾 (π‘₯, π‘₯) 𝐾 (π‘₯ ,β€² π‘₯ β€²)
and we may interpret the strength of this dependence as a measure of
similarity between the input locations. This intuition can be useful, but
some caveats are in order. To begin, note that correlation may be negative,
which might be interpreted as indicating dis-similarity as the function
values react to information with opposite sign.
Further, for a proposed covariance function 𝐾 to be admissible, it
must satisfy two global consistency properties ensuring that the collec-
tion of random variables comprising 𝑓 are able to satisfy the purported
relationships. First, we can immediately deduce from its definition (3.8)
that 𝐾 must be symmetric in its inputs. Second, the covariance function symmetry and positive semidefiniteness
must be positive semidefinite; that is, given any finite set of points x βŠ‚ X,
the Gram matrix 𝐾 (x, x) must have only nonnegative eigenvalues.10 10 Symmetry guarantees the eigenvalues are real.
To illustrate how positive semidefiniteness ensures statistical validity, consequences of positive semidefiniteness
note that a direct consequence is that 𝐾 (π‘₯, π‘₯) = var[πœ™ | π‘₯] β‰₯ 0, and thus
marginal variance is always nonnegative. On a slightly less trivial level,
consider a pair of points x = (π‘₯, π‘₯ β€²) and normalize the corresponding
Gram matrix 𝚺 = 𝐾 (x, x) to yield the correlation matrix:
 
1 𝜌
P = corr[𝝓 | x] = ,
𝜌 1

where 𝜌 is given by (3.9). For this matrix to be valid, we must have


𝜌 ∈ [βˆ’1, 1]. This happens precisely when P is positive semidefinite, as
its eigenvalues are 1 ± 𝜌. Finally, noting that P is congruent to 𝚺,11 we 11 We have
√ 𝚺 = SPS, where S is diagonal with
conclude the implied correlations are consistent if and only if 𝚺 is positive 𝑆𝑖𝑖 = Σ𝑖𝑖 .
semidefinite. With more than two points, the positive semidefiniteness
of 𝐾 ensures similar consistency at higher orders.
50 modeling with gaussian processes

Figure 3.3: Left: a sample from a station-


ary Gaussian process in two
dimensions. The joint distri-
bution of function values is
translation- but not rotation-
invariant, as the function
tends to vary faster in some
directions than others. Right:
a sample from an isotropic
process. The joint distribu-
tion of function values is
both translation- and rotation-
invariant.
stationary and anisotropic stationary and isotropic

Stationarity, isotropy, and Bochner’s theorem


Some covariance functions exhibit structure giving rise to certain com-
putational benefits. Namely, a covariance function 𝐾 (π‘₯, π‘₯ β€²) that only
stationary covariance function, 𝐾 (π‘₯ βˆ’ π‘₯ β€² ) depends on the difference π‘₯ βˆ’ π‘₯ β€² is called stationary.12 When convenient,
we will abuse notation and write a stationary covariance function in
12 Of course this definition requires π‘₯ βˆ’ π‘₯ β€² to terms of a single input, writing 𝐾 (π‘₯ βˆ’ π‘₯ β€²) for 𝐾 (π‘₯, π‘₯ β€²) = 𝐾 (π‘₯ βˆ’ π‘₯ ,β€² 0). If a
be well defined. This is trivial in Euclidean gp has a stationary covariance function and constant mean function (3.1),
spaces; a fairly general treatment for more
exotic spaces would assume an abelian group
then the process itself is also called stationary. A consequence of station-
structure on X with binary operation + and arity is that the distribution of any set of function values is invariant
inverse βˆ’ and define π‘₯ βˆ’ π‘₯ β€² = π‘₯ + (βˆ’π‘₯ β€² ). under translation; that is, the function β€œacts the same” everywhere from
a statistical viewpoint. The left panel of Figure 3.3 shows a sample from
a 2𝑑 stationary gp, demonstrating this translation-invariant behavior.
Stationarity is a convenient assumption when modeling, as defining
the local behavior around a single point suffices to specify the global
behavior of an entire function. Many common covariance functions
have this property as a result. However, this may not always be a valid
assumption in the context of optimization, as an objective function may
for example exhibit markedly different behavior near the optimum than
elsewhere. We will shortly see some general approaches for addressing
nonstationarity when appropriate.
If X βŠ‚ ℝ𝑛, a covariance function 𝐾 (π‘₯, π‘₯ β€²) only depending on the
isotropic covariance function, 𝐾 (𝑑) Euclidean distance 𝑑 = |π‘₯ βˆ’π‘₯ β€² | is called isotropic. Again, when convenient,
we will notate such a covariance with 𝐾 (𝑑). Isotropy is a more restrictive
assumption than stationarity – indeed it trivially implies stationarity –
as it implies the covariance is invariant to both translation and rotation,
and thus the function has identical behavior in every direction from
13 s. bochner (1933). Monotone Funktionen, every point. An example sample from a 2𝑑 isotropic gp is shown in the
Stieltjessche Integrale und harmonische Anal- right panel of Figure 3.3. Many of the standard covariance functions
yse. Mathematische Annalen 108:378–410. we will define shortly will be isotropic on first definition, but we will
14 We do not quote the most general version of again develop generic mechanisms to modify them in order to induce
the theorem here; the result can be extended anisotropic behavior when desired.
to complex-valued covariance functions on ar-
bitrary locally compact abelian groups if nec- bochner’s theorem is an landmark result characterizing stationary
essary. It is remarkably universal. covariance functions in terms of their Fourier transforms:13, 14
3.3. notable covariance functions 51

Theorem (bochner, 1933). A continuous function 𝐾 : ℝ𝑛 β†’ ℝ is positive


semidefinite (that is, represents a stationary covariance function) if and
only if we have ∫
𝐾 (x) = exp(2πœ‹π‘–xβŠ€πƒ ) d𝜈,

where 𝜈 is a finite, positive Borel measure on ℝ𝑛. Further, this measure


is symmetric around the origin; that is, 𝜈 (𝐴) = 𝜈 (βˆ’π΄) for any Borel set
𝐴 βŠ‚ ℝ𝑛, where βˆ’π΄ is the β€œnegation” of 𝐴: βˆ’π΄ = {βˆ’π‘Ž | π‘Ž ∈ 𝐴}.
To summarize, bochner’s theorem states that the Fourier transform of
any stationary covariance function on ℝ𝑛 is proportional to a probability
measure and vice versa; the constant of proportionality is 𝐾 (0). The
measure 𝜈 corresponding to 𝐾 is called the spectral measure of 𝐾. When spectral measure, 𝜈
a corresponding density function πœ… exists, it is called the spectral density spectral density, πœ…
of 𝐾 and forms a Fourier pair with 𝐾:
∫ ∫
𝐾 (x) = exp(2πœ‹π‘–x 𝝃 ) πœ… (𝝃 ) d𝝃 ; πœ… (𝝃 ) = exp(βˆ’2πœ‹π‘–xβŠ€πƒ ) 𝐾 (x) dx.
⊀

(3.10)
The symmetry of the spectral measure implies a similar symmetry in symmetry of spectral density
the spectral density: πœ… (𝝃 ) = πœ… (βˆ’πƒ ) for all 𝝃 ∈ ℝ𝑛.
bochner’s theorem is surprisingly useful in practice, allowing us to
approximate an arbitrary stationary covariance function by approximat-
ing (e.g., by modeling or sampling from) its spectral density. This is the
basis of the spectral mixture covariance described in the next section, as
well as the sparse spectrum approximation scheme, which facilitates the sparse spectrum approximation: Β§ 8.7, p. 178
computation of some Bayesian optimization policies.

3.3 notable covariance functions


It can be difficult to define a new covariance function for a given sce-
nario de novo, as the positive-semidefinite criterion can be nontrivial to
guarantee for what might otherwise be an intuitive notion of similarity.
In practice, it is common to instead construct covariance functions by
combining and transforming established β€œbuilding blocks” modeling var-
ious atomic behaviors while following rules guaranteeing the result will 15 For a more complete survey, see
be valid. We describe several useful examples below.15 c. e. rasmussen and c. k. i. williams (2006).
Our presentation will depart from most in that several of the covari- Gaussian Processes for Machine Learning. mit
ance functions below will initially be defined without parameters that Press. [chapter 4]
some readers may be expecting. We will shortly demonstrate how cou-
pling these covariance functions with particular transformations of the
function domain and output gives rise to common covariance function
parameters such as characteristic length scales and output scales.

The MatΓ©rn family


If there is one class of covariance functions to be familiar with, it is
the MatΓ©rn family. This is a versatile family of covariance functions
for modeling isotropic behavior on Euclidean domains X βŠ‚ ℝ𝑛 of any
52 modeling with gaussian processes

𝜈 = 1/2, (3.11) 𝜈 = 3/2, (3.13) 𝜈 = 5/2, (3.14)


Figure 3.4: Samples from centered Gaussian processes with the MatΓ©rn covariance function with different values of the
smoothness parameter 𝜈. Sample paths with 𝜈 = 1/2 are continuous but not differentiable; incrementing this
parameter by one unit increases the number of continuous derivatives by one.

desired degree of smoothness, in terms of the differentiability of sample


sample path differentiability: Β§ 2.6, p. 30 paths. The MatΓ©rn covariance 𝐾M(𝜈) depends on a parameter 𝜈 ∈ ℝ>0
determining this smoothness; sample paths from a centered Gaussian
process with this covariance are βŒˆπœˆβŒ‰ βˆ’ 1 times continuously differentiable.
16 In theoretical contexts, general values for the In practice 𝜈 is almost always taken to be a half-integer,16 in which case
smoothness parameter 𝜈 ∈ 𝑅 >0 are consid- the expression for the covariance assumes a simple form as a function
ered, but lead to unwieldy expressions (10.12).
of the Euclidean distance 𝑑 = |π‘₯ βˆ’ π‘₯ β€² |.
To begin with the extremes, the case 𝜈 = 1/2 yields the so-called
exponential covariance function exponential covariance:

𝐾M1/2 (π‘₯, π‘₯ β€²) = exp(βˆ’π‘‘). (3.11)

Sample paths from a centered Gaussian process with exponential co-


variance are continuous but nowhere differentiable, which is perhaps
too rough to be interesting in most optimization contexts. However,
this covariance is often encountered in historical literature. In the one-
Ornstein–Uhlenbeck (ou) process dimensional case X βŠ‚ ℝ, a Gaussian process with this covariance is
known as a Ornstein–Uhlenbeck (ou) process and satisfies a continuous-
time Markov property that renders its posterior moments particularly
convenient.
Taking the limit of increasing smoothness 𝜈 β†’ ∞ yields the squared
exponential covariance from the previous chapter:

𝐾se (π‘₯, π‘₯ β€²) = exp βˆ’ 12 𝑑 2 . (3.12)

Samples from a centered Gaussian process We will refer to the MatΓ©rn and the limiting case of the squared expo-
with squared exponential covariance 𝐾se . nential covariance functions together as the Matérn family. The squared
exponential covariance is without a doubt the most prevalent covari-
ance function in the statistical and machine learning literature. However,
it may not always be a good choice in practice. Sample paths from a
centered Gaussian process with squared exponential covariance are in-
17 m. l. stein (1999). Interpolation of Spatial Data: finitely differentiable, which has been ridiculed as an absurd assumption
Some Theory for Kriging. Springer–Verlag. for most physical processes.17 stein does not mince words on this, start-
[Β§ 1.7] ing off a three-sentence β€œsummary of practical suggestions” with β€œuse
the MatΓ©rn model” and devoting significant effort to discouraging the
use of the squared exponential in the context of geostatistics.
3.3. notable covariance functions 53

Between these extremes are the cases 𝜈 = 3/2 and 𝜈 = 5/2, which
respectively model once- and twice-differentiable functions:
√  √ 
𝐾M3/2 (π‘₯, π‘₯ β€²) = 1 + 3𝑑 exp βˆ’ 3𝑑 ; (3.13)
√ 5 2
 √ 
𝐾M5/2 (π‘₯, π‘₯ ) = 1 + 5𝑑 + 3 𝑑 exp βˆ’ 5𝑑 .
β€²
(3.14)
Figure 3.4 illustrates samples from centered Gaussian processes with
different values of the smoothness parameters 𝜈. The 𝜈 = 5/2 case in 18 j. snoek et al. (2012). Practical Bayesian Op-
particular has been singled out as a prudent off-the-shelf choice for timization of Machine Learning Algorithms.
Bayesian optimization when no better alternative is obvious.18
neurips 2012.

The spectral mixture covariance


Covariance functions in the MatΓ©rn family express fairly simple corre-
lation structure, with the covariance dropping monotonically to zero 𝐾 (π‘₯, π‘₯ β€² ) 𝐾se
as the distance 𝑑 = |π‘₯ βˆ’ π‘₯ β€² | increases. All differences in sample path 𝐾M3/2
behavior such as differentiability, etc. are expressed entirely through 𝐾M1/2
nuances in the tail behavior of the covariance functions; see the figure
in the margin.
The Fourier transforms of these covariances are also broadly com-
parable: all are proportional to unimodal distributions centered on the 𝑑
origin. However, bochner’s theorem indicates that there is a vast world
of stationary covariance functions indexed by the entire space of sym- Some members of the MatΓ©rn family and the
metric spectral measures, which may have considerably more complex squared exponential covariance as a function
of the distance between inputs. All decay to
structure. Several authors have sought to exploit this characterization to zero correlation as distance increases.
build stationary covariance functions with virtually unlimited flexibility.
A notable contribution in this direction is the spectral mixture covari-
ance function proposed by wilson and adams.19 The idea is simple but 19 a. g. wilson and r. p. adams (2013). Gaussian
powerful: we parameterize a space of stationary covariance functions Process Kernels for Pattern Discovery and Ex-
trapolation. icml 2013.
by some suitable family of mixture distributions in the Fourier domain
representing their spectral density. The parameters of this spectral mix-
ture distribution specify a covariance function via the correspondence
in (3.10), and we can make the resulting family as rich as desired by
adjusting the number of components in the mixture. wilson and adams
proposed Gaussian mixtures for the spectral density, which are universal
approximators and have a convenient Fourier transform. We define a
Gaussian mixture spectral density πœ… as
βˆ‘οΈ  
π‘˜ (𝝃 ) = 𝑀𝑖 N (𝝃 ; 𝝁𝑖 , πšΊπ‘– ); πœ… (𝝃 ) = 21 π‘˜ (𝝃 ) + π‘˜ (βˆ’πƒ ) ,
𝑖

where the indirect construction via π‘˜ ensures the required symmetry.


Note that the weights {𝑀𝑖 } must be positive but need not sum to unity.
Taking the inverse Fourier transform (3.10), the corresponding covariance
function is

𝐾sm x, xβ€²; {𝑀𝑖 }, {𝝁𝑖 }, {πšΊπ‘– } = Samples from centered Gaussian processes
βˆ‘οΈ   with two realizations of a Gaussian spectral
𝑀𝑖 exp βˆ’2πœ‹ 2 (x βˆ’ xβ€²)⊀ πšΊπ‘– (x βˆ’ xβ€²) cos 2πœ‹ (x βˆ’ xβ€²)βŠ€ππ‘– . (3.15) mixture covariance function, offering a
𝑖 glimpse into the flexibility of this class.
54 modeling with gaussian processes

Inspecting this expression, we can see that every covariance function


induced by a Gaussian mixture spectral density is infinitely differentiable,
and one might object to this choice on the grounds of overly smooth sam-
ple paths. This can be mitigated by using enough mixture components
to induce sufficiently complex structure in the covariance (on the order
of ∼5 is common). Another option would be to use a different family of
spectral distributions; for example, a mixture of Cauchy distributions
would induce a family of continuous but nondifferentiable covariance
functions analogous to the exponential covariance (3.11), but this idea
has not been explored.

Linear covariance function


Another useful covariance function arises from a Bayesian realization
20 Independence is usual but not necessary; an of linear regression. Let the domain be Euclidean, X βŠ‚ ℝ,𝑛 and consider
arbitrary joint prior would add a term of 2b⊀ x the model
to (3.16), where b = cov[𝜷, 𝛽 ].
𝑓 (x) = 𝛽 + 𝜷 ⊀x,
where we have abused notation slightly to distinguish the constant term
linear basis functions: Β§ 3.1, p. 48 from the remaining coefficients. Following our discussion on linear basis
functions, if we take independent20 normal priors on 𝛽 and 𝜷:

𝑝 (𝛽) = N (𝛽; π‘Ž, 𝑏 2 ); 𝑝 (𝜷) = N (𝜷; a, B),

we arrive at the so-called linear covariance:

𝐾lin (x, xβ€²; 𝑏, B) = 𝑏 2 + x⊀Bx. (3.16)

Although this covariance is unlikely to be of any direct use in Bayesian


Samples from a centered Gaussian process
with linear covariance 𝐾lin . optimization (linear programming is much simpler!), it can be a useful
component of more complex composite covariance structures.

3.4 modifying and combining covariance functions


With the notable exception of the spectral mixture covariance, which
can approximate any stationary covariance function, several of the co-
variances introduced in the last section are still too rigid to be useful.
In particular, consider any of the MatΓ©rn family (3.11–3.14). Each
of these covariances encodes several explicit and possibly dubious as-
sumptions about the function of interest. To begin, each prescribes unit
variance for every function value:

var[πœ™ | π‘₯] = 𝐾 (π‘₯, π‘₯) = 1, (3.17)

which is an arbitrary, possibly inappropriate choice of scale. Further, each


21 Although an important concept, there is no of these covariance functions fixes an isotropic characteristic length scale
clear-cut definition of characteristic length of correlation of approximately one unit:21 at a separation of |π‘₯ βˆ’ π‘₯ β€² | = 1,
scale. It is simply a convenient separation dis-
tance for which correlation remains apprecia- the correlation between the corresponding function values drops to
ble, but beyond which correlation begins to roughly
noticeably decay. corr[πœ™, πœ™ β€² | π‘₯, π‘₯ β€²] β‰ˆ 0.5, (3.18)
3.4. modifying and combining covariance functions 55

prior mean prior 95% credible interval samples

Figure 3.5: Scaling a stationary covari-


ance by a nonconstant func-
tion (here, a smooth bump
function of compact support)
to yield a nonstationary co-
variance.
scaling function, π‘Ž
1

and this correlation continues to drop effectively to zero at a separation


of approximately five units. Again, this choice of scale is arbitrary, and
the assumption of isotropy is particularly restrictive.
In general, a Gaussian process encodes strong assumptions regarding
the joint distribution of function values (2.5), which may not be com-
patible with a given function β€œout of the box.” However, we can often
improve model fit by appropriate transformations of the objective. In fact,
linear transformations of function inputs and outputs are almost uni-
versally considered, although only implicitly by introducing parameters
conveying the effects of these transformations. We will show how both
linear and nonlinear transformations of function input and output lead
to more expressive models and give rise to common model parameters.

Scaling function outputs


We first address the issue of scale in function output (3.17) by considering
the statistical effects of arbitrary scaling. Consider a random function
𝑓 : X β†’ ℝ with covariance function 𝐾 and let π‘Ž : X β†’ ℝ be a known
scaling function.22 Then the pointwise product π‘Žπ‘“ : π‘₯ ↦→ π‘Ž(π‘₯)𝑓 (π‘₯) has 22 For this result 𝑓 need not have a gp distribu-
covariance function tion.

cov[π‘Žπ‘“ | π‘Ž] = π‘Ž(π‘₯)𝐾 (π‘₯, π‘₯ β€²)π‘Ž(π‘₯ β€²), (3.19)


by the bilinearity of covariance. If the scaling function is constant, π‘Ž ≑ πœ†,
then we have
cov[πœ†π‘“ | πœ†] = πœ† 2 𝐾 . (3.20)
This simple result allows us to extend a β€œbase” covariance 𝐾 with fixed
scale, as in (3.17), to a parametric family with arbitrary scale:
𝐾 β€² (π‘₯, π‘₯ β€²; πœ†) = πœ† 2 𝐾 (π‘₯, π‘₯ β€²).
56 modeling with gaussian processes

𝐾
In this context the parameter πœ† is known as an output scale, or when
the base covariance is stationary with 𝐾 (π‘₯, π‘₯) = 1, the signal variance,
as it determines the variance of any function value: var[πœ™ | π‘₯, πœ†] = πœ† 2.
The illustration in the margin shows the effect of scaling the squared
exponential covariance function by a series of increasing output scales.
We can also of course consider nonlinear transformations of the
𝑑
function output as well. This can be useful for modeling constraints –
such as nonegativity or boundedness – that are not compatible with
The squared exponential covariance 𝐾se the Gaussian assumption. However, a nonlinear transformation of a
scaled by a range of output scales πœ† (3.20). Gaussian process is no longer Gaussian, so it is often more convenient
to model the transformed function after β€œremoving the constraint.”
We may use the general form of this scaling result (3.19) to transform
a stationary covariance into a nonstationary one, as any nonconstant
scaling is sufficient to break translation invariance. We show an example
of such a transformation in Figure 3.5, where we have scaled a stationary
covariance by a bump function to create a prior on smooth functions
with compact support.

Transforming the domain and length scale parameters


We now address the issue of the scaling of correlation as a function of
Sample paths from centered gps with smaller distance (3.18) by introducing a powerful tool: transforming the domain
(top) and larger (bottom) output scales.
of the function of interest into a more convenient space for modeling.
Namely, suppose we wish to reason about a function 𝑓 : X β†’ ℝ,
𝐾
and let 𝑔 : X β†’ Z be a map from the domain to some arbitrary space
Z, which might also be X . If 𝐾Z is a covariance function on Z, then the
composition 
𝐾X (π‘₯, π‘₯ β€²) = 𝐾Z 𝑔(π‘₯), 𝑔(π‘₯ β€²) (3.21)
is trivially a covariance function on X. This allows us to define a covari-
ance for 𝑓 indirectly by jointly designing a map 𝑔 to another space and
𝑑 a corresponding covariance 𝐾Z (and mean πœ‡Z ) on that space. This ap-
proach offers a lot of flexibility, as we are free to design these components
The squared exponential covariance 𝐾se as we see fit to impose any desired structure.
dilated by a range of length scales β„“ (3.22). We will spend some time exploring this idea, beginning with the
relatively simple but immensely useful case of combining a linear trans-
formation on a Euclidean domain X βŠ‚ ℝ𝑛 with an isotropic covariance
on the output. Perhaps the simplest example is the dilation x ↦→ x/β„“,
which simply scales distance by β„“ .βˆ’1 Incorporating this transformation
into an isotropic base covariance 𝐾 (𝑑) on X yields a parametric family
of dilated versions:
𝐾 β€² (π‘₯, π‘₯ β€²; β„“) = 𝐾 (𝑑/β„“). (3.22)
If the base covariance has a characteristic length scale of one unit, the
length scale of the dilated version will be β„“; for this reason, this parameter
is simply called the characteristic length scale of the parameterized
Sample paths from centered gps with shorter covariance (3.22). Adjusting the length scale allows us to model functions
(top) and longer (bottom) characteristic with a range of β€œwiggliness,” where shorter length scale implies more
length scales.
wiggly behavior; see the margin for examples.
3.4. modifying and combining covariance functions 57

Figure 3.6: Left: a sample from a cen-


tered Gaussian process in two
dimensions with isotropic
squared exponential covari-
ance. Right: a sample from
a centered Gaussian process
with an ard squared exponen-
tial covariance. The length of
the lines on each axis are pro-
portional to the length scale
along that axis.

Taking this one step further, we may consider dilating each axis by a
separate factor:

π‘₯𝑖 ↦→ π‘₯𝑖 /ℓ𝑖 ; x ↦→ [diag β„“] βˆ’1 x, (3.23)

which induces the weighted Euclidean distance


βˆšοΈ„
βˆ‘οΈ (π‘₯ βˆ’ π‘₯ β€²) 2
𝑑ℓ = 𝑖 𝑖
. (3.24)
𝑖
ℓ𝑖

Geometrically, the effect of this map is to transform surfaces of equal dis-


tance around each point – which represent curves of constant covariance
for an isotropic covariance – from spheres into axis-aligned ellipsoids;
see the figure in the margin. Incorporating into an isotropic base covari-
ance 𝐾 (𝑑) produces a parametric family of anisotropic covariances with Possible surfaces of equal covariance with the
different characteristic length scales along each axis, corresponding to center when combining separate dilation of
each axis with an isotropic covariance.
the parameters β„“:
𝐾 β€² (π‘₯, π‘₯ β€²; β„“) = 𝐾 (𝑑 β„“ ). (3.25)
When the length scale parameters are inferred from data, this construc-
tion is known as automatic relevance determination (ard). The motivation automatic relevance determination, ard
for the name is that if the function has only weak dependence on some
mostly irrelevant dimension of the input, we could hope to infer a very
long length scale for that dimension. The contribution to the weighted
distance (3.24) for that dimension would then be effectively nullified,
and the resulting covariance would effectively β€œignore” that dimension.
Figure 3.6 shows samples from 2𝑑 centered Gaussian processes, com-
paring behavior with an isotropic covariance and an ard modified ver-
sion that contracts the horizontal and expands the vertical axis (see
curves of constant covariance in the margin). The result is anisotropic
behavior with a longer characteristic length scale in the vertical direction
Surfaces of equal covariance with the center
than in the horizontal direction, but with the behavior of local features for the examples in Figure 3.6: the isotropic
remaining aligned with the axes overall. covariance in the left panel (the smaller
Finally, we may also consider an arbitrary linear transformation circle), and the ard covariance in the right
panel (the elongated ellipse).
𝑔 : x ↦→ Ax, which induces the Mahalanobis distance (a.8)

𝑑 A = |Ax βˆ’ Axβ€² |.
58 modeling with gaussian processes

As before, we may incorporate this map into an isotropic base covariance


𝐾 to realize a family of anisotropic covariance functions:
𝐾 β€² (π‘₯, π‘₯ β€²; A) = 𝐾 (𝑑 A ). (3.26)
Geometrically, an arbitrary linear map can transform surfaces of constant
covariance from spheres into arbitrary ellipsoids; see the figure in the
margin. The sample from the left-hand side of Figure 3.3 was generated by
composing an isotropic covariance with a map inducing both anisotropic
scaling and rotation. The effect of the underlying transformation can be
seen in the shapes of local features, which are not aligned with the axes.
Due to the inherent number of parameters required to specify a
general transformation, this construction is perhaps most useful when
the map is to a much lower-dimensional space: ℝ𝑛 β†’ ℝ,π‘˜ π‘˜ β‰ͺ 𝑛. This
Possible surfaces of equal covariance with the
has been promoted as one strategy for modeling functions on high-
center when combining an arbitrary linear
transformation with an isotropic covariance. dimensional domains suspected of having hidden low-dimensional struc-
ture – if this low-dimensional structure is along a linear subspace of the
domain, we could capture it by an appropriately designed projection A.
high-dimensional domains: Β§ 3.5, p. 61 We will discuss this idea further in the next section.

Nonlinear warping
When using a covariance function with an inherent length scale, such as
a MatΓ©rn or squared exponential covariance, some linear transformation
of the domain is almost always considered, whether it be simple dilation
23 We did see a periodic gp in the previous chap- (3.22), anisotropic scaling (3.25), or a general transformation (3.26). How-
ter (2.30); however, that model only had sup- ever, nonlinear transformations can also be useful for imposing structure
port on perfectly sinusoidal functions.
on the domain, a process commonly referred to as warping.
24 d. j. c. mackay (1998). Introduction to Gaussian To provide an example that may not often be useful in optimization
Processes. In: Neural Networks and Machine
Learning. [Β§ 5.2]
but is illustrative nonetheless, suppose we wish to model a function
𝑓 : ℝ β†’ ℝ that we believe to be smooth and periodic with period 𝑝.
25 The covariance on the circle is usually
inherited from a covariance on ℝ2. The result
None of the covariance functions introduced thus far would be able to
of composing with the squared exponential induce the periodic correlations that this assumption would entail.23 A
covariance in particular is often called β€œthe” construction due to mackay is to compose a map onto a circle of radius
periodic covariance, but we stress that any
other covariance on ℝ2 could be used instead.
π‘Ÿ = 𝑝/(2πœ‹):24  
π‘Ÿ cos π‘₯
π‘₯ ↦→ (3.27)
π‘Ÿ sin π‘₯
with a covariance function on that space reflecting any desired properties
of 𝑓.25 As this map identifies points separated by any multiple of the
period, the corresponding function values are perfectly correlated, as
desired. A sample from a Gaussian process employing this construction
with a MatΓ©rn covariance after warping is shown in the margin.
A compelling use of warping is to build nonstationary models by
A sample path of a centered gp with MatΓ©rn
covariance with 𝜈 = 5/2 (3.14) after applying
composing a nonlinear map with a stationary covariance, an idea snoek
the periodic warping function (3.27). et al. explored in the context of Bayesian optimization.26 Many objec-
tive functions exhibit different behavior depending on the proximity to
26 j. snoek et al. (2014). Input Warping for Bayes- the optimum, suggesting that nonstationary models may sometimes be
ian Optimization of Non-Stationary Functions. worth exploring. snoek et al. proposed a flexible family of warping func-
icml 2014. tions for optimization problems with box-bounded constraints, where
3.4. modifying and combining covariance functions 59

prior mean prior 95% credible interval samples

1 Figure 3.7: An example of the beta warping method proposed


by snoek et al. We show three samples of the
stationary Gaussian process prior from Figure 2.1
(above) after applying a nonlinear warping through
a beta cdf (3.28) with (𝛼, 𝛽) = (4, 4) (right). The
length scale is compressed in the center of the do-
main and expanded near the boundary.
0
0 1

we may take the domain to be the unit cube by scaling and translating as
necessary: X = [0, 1] 𝑛. The idea is to warp each coordinate of the input
via the cumulative distribution function of a beta distribution: 1

π‘₯𝑖 ↦→ 𝐼 (π‘₯𝑖 ; 𝛼𝑖 , 𝛽𝑖 ), (3.28)

where (𝛼𝑖 , 𝛽𝑖 ) are shape parameters and 𝐼 is the regularized beta func-
tion. This represents a monotonic bijection on the unit interval that can
assume several shapes; see the marginal figure for examples. The map
may contract portions of the domain and expand others, effectively de- 0
creasing and increasing the length scale in those regions. Finally, taking 0 1
𝛼 = 𝛽 = 1 recovers the identity map, allowing us to degrade gracefully Some examples of beta cdf warping functions
to the unwarped case if desired. (3.28).
In Figure 3.7 we combine a beta warping on a one-dimensional do-
main with a stationary covariance on the output. The chosen warping
shortens the length scale near the center of the domain and extends it
near the boundary, which might be reasonable for an objective expected
to exhibit the most β€œinteresting” behavior on the interior of its domain.
A recent innovation is to use sophisticated artificial neural networks
as warping maps for modeling functions of high-dimensional data with
complex structure. Notable examples of this approach include the fami-
lies of manifold Gaussian processes introduced by calandra et al.27 and 27 r. calandra et al. (2016). Manifold Gaussian
deep kernels introduced contemporaneously by wilson et al.28 Here the Processes for Regression. ijcnn 2016.
warping function was taken to be an arbitrary neural network, the output 28 a. g. wilson et al. (2016). Deep Kernel Learn-
layer of which was fed into a suitable stationary covariance function. ing. aistats 2016.
This gives a highly parameterized covariance function where the pa-
rameters of the base covariance and the neural map become parameters
of the resulting model. In the context of Bayesian optimization, this
can be especially useful when there is sufficient data to learn a useful neural representation learning: Β§ 8.11, p. 198
representation of the domain via unsupervised methods.
60 modeling with gaussian processes

𝐾1 𝐾2 𝐾1 + 𝐾2
Figure 3.8: Samples from centered Gaussian processes with different covariance functions: (left) a squared exponential
covariance, (middle) a squared exponential covariance with smaller output scale and shorter length scale, and
(right) the sum of the two. Samples from the process with the sum covariance show smooth variation on two
different scales.

Combining covariance functions


In addition to modifying covariance functions via scaling the output
and/or transforming the domain, we may also combine multiple co-
variance functions together to model functions influenced by multiple
random processes.
Let 𝑓, 𝑔 : X β†’ ℝ be two centered, independent (not necessarily Gaus-
sian) random functions with covariance functions 𝐾 𝑓 and 𝐾𝑔 , respectively.
By the properties of covariance, the sum and pointwise product of these
functions have covariance functions with the same structure:

cov[𝑓 + 𝑔] = 𝐾 𝑓 + 𝐾𝑔 ; cov[𝑓 𝑔] = 𝐾 𝑓 𝐾𝑔 , (3.29)

and thus covariance functions are closed under addition and pointwise
29 The assumption of the processes being cen- multiplication.29 Combining this result with (3.20), we have that any
tered is needed for the product result only; polynomial of covariance functions with nonnegative coefficients forms
otherwise, there would be additional terms
involving scaled versions of each individual
a valid covariance. This enables us to construct infinite families of in-
covariance as in (3.19). The sum result does creasingly complex covariance functions from simple components.
not depend on any assumptions regarding the We may use a sum of covariance functions to model a function with
mean functions. independent additive contributions, such as random behavior on several
length scales. Precisely such a construction is illustrated in Figure 3.8.
If the covariance functions are nonnegative and have roughly the same
scale, the effect of addition is roughly one of logical disjunction: the sum
will assume nontrivial values whenever any one of its constituents does.
Meanwhile, a product of covariance functions can loosely be in-
terpreted in terms of logical conjunction, with function values having
appreciable covariance only when every individual covariance function
does. A prototypical example of this effect is a covariance function mod-
eling functions that are β€œalmost periodic,” formed by the product of a
bump-shaped isotropic covariance function such as a squared exponen-
tial (3.12) with a warped version modeling perfectly periodic functions
(3.27). The former moderates the influence of the latter by driving the cor-
relation between function values to zero for inputs that are sufficiently
A sample from a centered Gaussian process separated, regardless of their positions in the periodic cycle. We show a
with an β€œalmost periodic” covariance function.
sample from such a covariance in the margin, where the length scale of
the modulation term is three times the period.
3.5. modeling functions on high-dimensional domains 61

3.5 modeling functions on high-dimensional domains


Optimization on a high-dimensional domain can be challenging, as we
can succumb to the curse of dimensionality if we are not careful. As an curse of dimensionality
example, consider optimizing an objective function on the unit cube
[0, 1] 𝑛. Suppose we model this function with an isotropic covariance
from the MatΓ©rn family, taking the length scale to be β„“ = 1/10 so that
ten length scales span the domain along each axis.30 This choice implies 30 This is far from excessive: the domain for the
that function values on the corners of the domain would be effectively marginal sampling examples in this chapter
spans 15 length scales and there’s just enough
independent, as exp(βˆ’10) < 10βˆ’4 (3.11) and exp(βˆ’50) is smaller still room for interesting behavior to emerge.
(3.12). If we were to demand even a modicum of confidence in these
regions at termination, say by having a measurement within one length
scale of every corner, we would need 2𝑛 observations! This exponential
growth in the number of observations required to cover the domain is
the tyrannical curse of dimensionality.
However, compelling objectives do not tend to have so many degrees 31 j. bergstra and y. bengio (2012). Random
of freedom; if they did, we should perhaps give up on the idea of global Search for Hyper-Parameter Optimization.
Journal of Machine Learning Research 13:281–
optimization altogether. Rather, many authors have noted a tendency 305.
toward low intrinsic dimensionality in real-world problems: that is, most
32 c. li et al. (2018a). Measuring the Intrinsic Di-
of the variation in the objective is confined to a low-dimensional sub- mension of Objective Landscapes. iclr 2018.
space of the domain. This phenomenon has been noted for example in arXiv: 1804.08838 [[Link]].
hyperparameter optimization31 and optimizing the parameters of neural
networks.32 levina and bickel suggested that β€œhidden” low-dimensional
structure is actually a universal requirement for success on any task:33

There is a consensus in the high-dimensional data analysis commu- 33 e. levina and p. j. bickel (2004). Maximum
nity that the only reason any methods work in very high dimensions Likelihood Estimation of Intrinsic Dimension.
is that, in fact, the data are not truly high dimensional.
neurips 2004.

The global optimization community shares a similar consensus: typical


high-dimensional objectives are not β€œtruly” high dimensional. This intu-
ition presents us with an opportunity: if we could only identify inherent
low-dimensional structure during optimization, we could sidestep the
curse of dimensionality by restricting our search accordingly.
Several strategies are available for capturing low intrinsic dimension
with Gaussian process models. The general approach closely follows our
discussion from the previous section: we identify some appropriate map-
ping from the high-dimensional domain to a lower-dimensional space,
then model the objective function after composing with this embedding
(3.21). This is one realization of the general class of manifold Gaussian
processes,34 where the sought-after manifold is low dimensional. Adopt- 34 r. calandra et al. (2016). Manifold Gaussian
ing this approach then raises the issue of identifying useful families of Processes for Regression. ijcnn 2016.
mappings that can suitably reduce dimension while preserving enough
structure of the objective to keep optimization feasible.

Neural embeddings
Given the success of deep learning in designing feature representations
for complex, high-dimensional objects, neural embeddings – as used in
62 modeling with gaussian processes

Figure 3.9: An objective function on


a two-dimensional domain
(left) with intrinsic dimension
1. The entire variation of the
objective is determined on the
one-dimensional linear sub-
space Z corresponding to the
diagonal black line, which we
can model in its inherent di- Z
mension (right).
Z

35 a. g. wilson et al. (2016). Deep Kernel Learn- the family of deep kernels35 – present a tantalizing option. Neural embed-
ing. aistats 2016. dings have shown some success in Bayesian optimization, where they
can facilitate optimization over complex structured objects by providing
neural representation learning: Β§ 8.11, p. 198 a nice continuous latent space to work in.
snoek et al. demonstrated excellent performance on hyperparameter
tuning tasks by interpreting the output layer of a deep neural network as
a set of custom nonlinear basis functions for Bayesian linear regression,
36 j. snoek et al. (2015). Scalable Bayesian Opti- as in (3.6).36 An advantage of this particular construction is that Gaussian
mization Using Deep Neural Networks. icml process inference and prediction is accelerated dramatically by adopting
the linear covariance (3.6) – the cost of inference scales linearly with the
2015.

cost of Gaussian process inference: Β§ 9.1, p. 201 number of observations, rather than cubically as in the general case.

Linear embeddings
Another line of attack is to search for a low-dimensional linear subspace
of the domain encompassing the relevant variation in inputs and model
the function after projection onto that space. For an objective 𝑓 on a
high-dimensional domain X βŠ‚ ℝ,𝑛 we consider models of the form

𝑓 (x) = 𝑔(Ax); A ∈ ℝ,π‘˜Γ—π‘› (3.30)

where 𝑔 : β„π‘˜ β†’ ℝ is a (π‘˜ β‰ͺ 𝑛)-dimensional surrogate for 𝑓.


The simplest such approach is automatic relevance determination
37 j. bergstra and y. bengio (2012). Random (3.25), where we learn separate length scales along each dimension.37
Search for Hyper-Parameter Optimization. Although the corresponding linear transformation (3.23) does not reduce
Journal of Machine Learning Research 13:281–
305.
dimension, axes with sufficiently long length scales are effectively elimi-
nated, as they do not have strong influence on the covariance. This can
be effective when some dimensions are likely to be irrelevant, but limits
us to axis-aligned subspaces only.
A more flexible option is to consider arbitrary linear transforma-
38 f. vivarelli and c. k. i. williams (1998). Dis-
covering Hidden Features with Gaussian Pro-
tions in the model (3.26, 3.30), an idea that has seen significant attention
cess Regression. neurips 1998. for Gaussian process modeling in general38 and for Bayesian optimiza-
39 z. wang et al. (2016b). Bayesian Optimization
tion in particular.39 Figure 3.9 illustrates a simple example where a one-
in a Billion Dimensions via Random Embed- dimensional objective function is embedded in two dimensions in a non-
dings. Journal of Artificial Intelligence Research axis-aligned manner. Both axes would appear important for explaining
55:361–387. the function when using ard, but a one-dimensional subspace suffices
3.5. modeling functions on high-dimensional domains 63

Figure 3.10: A sample from a gp in two dimensions with the


π‘₯ 2βˆ— decomposition 𝑓 (x) = 𝑔1 (π‘₯ 1 ) + 𝑔2 (π‘₯ 2 ). Here a
nominally two-dimensional function is actually
the sum of two one-dimensional components de-
fined along each axis with no interaction. The
maximum of the function is achieved at the point
corresponding to the maxima of the individual
components.

π‘₯ 1βˆ—

if chosen carefully. This approach offers considerably more modeling


flexibility than ard at the expense of a π‘˜-fold increase in the number
of parameters that must be specified. However, several algorithms have
been proposed for efficiently identifying a suitable map A,40, 41 and wang 40 j. djolonga et al. (2013). High-Dimensional
et al. demonstrated success in optimizing objectives in extremely high Gaussian Process Bandits. neurips 2013.
dimension by simply searching along a random low-dimensional sub- 41 r. garnett et al. (2014). Active Learning of
space. The authors also provided theoretical guarantees regarding the Linear Embeddings for Gaussian Processes.
recoverability of the global optimum with this approach, assuming the
uai 2014.

hypothesis of low intrinsic dimensionality holds.


If more flexibility is desired, we may represent an objective function
as a sum of contributions on multiple relevant linear subspaces:
βˆ‘οΈ
𝑓 (x) = 𝑔𝑖 (A𝑖 x). (3.31)
𝑖

This decomposition is similar in spirit to the classical family of gener-


alized additive models,42 where the linear maps can be arbitrary and of 42 t. hastie and r. tibshirani (1986). General-
variable dimension. If we assume the additive components in (3.31) are ized Additive Models. Statistical Science 1(3):
297–318.
independent, each with Gaussian process prior GP (𝑔𝑖 ; πœ‡π‘– , 𝐾𝑖 ), then the
resulting model for 𝑓 is a Gaussian process with additive moments (3.29):
βˆ‘οΈ βˆ‘οΈ
πœ‡ (x) = πœ‡π‘– (A𝑖 x); 𝐾 (x, xβ€²) = 𝐾𝑖 (A𝑖 x, A𝑖 xβ€²).
𝑖 𝑖

Several specific schemes have been proposed for building such de-
compositions. One convenient approach is to partition the coordinates
43 k. kandasamy et al. (2015). High Dimensional
of the input into disjoint groups and add a contribution defined on each Bayesian Optimisation and Bandits via Addi-
subset.43, 44 Figure 3.10 shows an example, where a two-dimensional ob- tive Models. icml 2015.
jective is the sum of independent axis-aligned components. We might use 44 j. r. gardner et al. (2017). Discovering and Ex-
such a model when every feature of the input is likely to be relevant but ploiting Additive Structure for Bayesian Opti-
only through interaction with a limited number of additional variables. mization. aistats 2017.
64 modeling with gaussian processes

An advantage of a disjoint partition is that we may reduce optimization


45 p. rolland et al. (2018). High-Dimensional
Bayesian Optimization via Additive Models of the high-dimensional objective to separate optimization of each of its
with Overlapping Groups. aistats 2018. lower-dimensional components (3.31). Several other additive schemes
46 m. mutnΓ½ and a. krause (2018). Efficient have been proposed as well, including partitions with (perhaps sparsely)
High Dimensional Bayesian Optimization overlapping groups45, 46, 47 and decompositions of the general form (3.31)
with Additivity and Quadrature Fourier Fea- with arbitrary projection matrices.48, 49
tures. neurips 2018.
47 t. n. hoang et al. (2018). Decentralized High-
Dimensional Bayesian Optimization with Fac- 3.6 summary of major ideas
tor Graphs. aaai 2018.
Specifying a Gaussian process entails choosing a mean and covariance
48 e. gilboa et al. (2013). Scaling Multidimen-
sional Gaussian Processes Using Projected Ad- function for the function of interest. As we saw in the previous chapter,
ditive Approximations. icml 2013. the structure of these functions has important implications regarding
49 c.-l. li et al. (2016). High Dimensional Bayes- sample path behavior, and as we will see in the next chapter, important
ian Optimization via Restricted Projection Pur- implications regarding its ability to explain a given set of data.
suit Models. aistats 2016. In practice, the design of a Gaussian process model is usually data-
driven: we establish some space of candidate models to consider, then
search this space for the models providing the best explanation of avail-
able data. In this chapter we offered some guidance for the construction
of models – or parametric spaces of models – as possible explanations of
a given system. We will continue the discussion in the next chapter by
taking up the question of assessing model quality in light of data. Below
we summarize the important ideas arising in the present discussion.
prior mean function: Β§ 3.1, p. 46 β€’ The mean function of a Gaussian process determines the expected value
of function values. Although an important concern, the mean function
impact on sample path behavior: Figure 3.1, can only affect sample path behavior through pointwise translation,
p. 46 and surrounding discussion and most interesting properties of sample paths are determined by the
covariance function instead.
impact on extrapolation: Figure 3.2, p. 47 and β€’ Nonetheless, the mean function has important implications for predic-
surrounding discussion tion, namely, in extrapolation. When making predictions in locations
poorly explained by available data – that is, locations where function
value are not strongly correlated with any observation – the prior mean
function effectively determines the posterior predictive mean.
β€’ There are no restrictions on the mean function of a Gaussian process,
and we are free to use any sensible choice in a given scenario. In practice,
unless a better option is apparent, the mean function is usually taken to
have some relatively simple parametric form, such as a constant (3.1) or
a low-order polynomial (3.7). Such choices are both simple and unlikely
to cause grossly undesirable extrapolatory behavior.
β€’ When the mean function includes a linear combination of basis functions,
we may exactly marginalize the coefficients under a multivariate normal
prior (3.5). The result is a marginal Gaussian process where uncertainty
in the linear terms of the mean is absorbed into the covariance function
(3.6). As an important special case, we may marginalize the value of a
constant mean (3.3) under a normal prior (3.2).
prior covariance function: Β§ 3.2, p. 49 β€’ The covariance function of a Gaussian process is critical to determining
the behavior of its sample paths. To be valid, a covariance function must
3.6. summary of major ideas 65

be symmetric and positive semidefinite. The latter condition can be


difficult to guarantee for arbitrary β€œsimilarity measures,” but covariance
functions are closed under several natural operations, allowing us to
build complex covariance functions from simple building blocks.
β€’ In particular, sums and pointwise products of covariance functions are sums and products of covariance functions:
valid covariance functions, and by extension any polynomial expression Β§ 3.4, p. 55
of covariance functions with positive coefficients.
β€’ Many common covariance functions are invariant to translation of their stationarity: Β§ 3.2, p. 50
inputs, a property known as stationarity. An important result known
as bochner’s theorem provides a useful representation for the space of bochner’s theorem: Β§ 3.2, p. 51
stationary covariance functions: their Fourier transforms are symmetric,
finite measures, and vice versa. This result has important implications for
modeling and computation, as the Fourier representation can be much
easier to work with than the covariance function itself.
β€’ Numerous useful covariance functions are available β€œoff-the-shelf.” The the MatΓ©rn family and squared exponential
family of MatΓ©rn covariances – and its limiting case the squared expo- covariance: Β§ 3.3, p. 51
nential covariance – can model functions with any desired degree of
smoothness (3.11–3.14). A notable special case is the MatΓ©rn covariance
with 𝜈 = 5/2 (3.14), which has been promoted as a reasonable default.
β€’ The spectral mixture covariance (3.15) appeals to bochner’s theorem to spectral mixture covariance: Β§ 3.3, p. 53
provide a parametric family of covariance functions able to approximate
any stationary covariance.
β€’ Covariance functions can be modified by arbitrary scaling of function scaling function outputs: Β§ 3.4, p. 55
outputs (3.19) and/or arbitrary transformation of function inputs (3.21). transforming function inputs: Β§ 3.4, p. 56
This ability allows us to create parametric families of covariance functions
with tunable behavior.
β€’ Considering arbitrary constant scaling of function outputs gives rise to
parameters known as output scales (3.20).
β€’ Considering arbitrary dilations of function inputs gives rise to parame-
ters known as characteristic length scales (3.22). Taking the dilation to
be anisotropic introduces a characteristic length scale for each input
dimension, a construction known as automatic relevance determination
(ard). With an ard covariance, setting a given dimension’s length scale
very high effectively β€œturns off” its influence on the model.
β€’ Nonlinear warping of function inputs is also possible. This enables us to nonlinear warping: Figure 3.7, p. 59 and
easily build custom nonstationary covariance functions by combining a surrounding discussion
nonlinear warping with a stationary base covariance.
β€’ Optimization can be especially challenging in high dimensions due to modeling functions on high-dimensional
the curse of dimensionality. However, if an objective function has intrin- domains: Β§ 3.5, p. 61
sic low-dimensional structure, we can avoid some of the challenges by
finding a structure-preserving mapping to a lower-dimensional space
and modeling the function on the β€œsmaller” space. This idea has repeat-
edly proven successful, and several general-purpose constructions are
available.
4
MODEL ASSESSMEN T, SELECTION, AND AVERAGING

The previous chapter offered a glimpse into the flexibility of Gaussian


processes, which can evidently model functions with a wide range of
behavior. However, a critical question remains: how can we identify
which models are most appropriate in a given situation?
The difficulty of this question is compounded by several factors. To
begin, the number of possible choices is staggering. Any function can
serve as a mean function for a Gaussian process, and we may construct prior mean function: Β§ 3.1, p. 46
arbitrary complex covariance functions through a variety of mechanisms. prior covariance function: Β§ 3.2, p. 49
Even if we fix the general form of the moment functions, introducing
natural parameters such as output and length scales yields an infinite output and length scales: Β§ 3.4, p. 54
spectrum of possible models.
Further, many systems of interest act as β€œblack boxes,” about which
we may have little prior knowledge. Before optimization, we may have
only a vague notion of which models might be reasonable for a given
objective function or how any parameters of these models should be set.
We might even be uncertain about aspects of the observation process,
such as the nature or precise scale of observation noise. Therefore, we
may find ourselves in the unfavorable position of having infinitely many
possible models to choose from and no idea how to choose!
Acquiring data, however, provides a way out of this conundrum.
After obtaining some observations of the system, we may determine
which models are the most compatible with the data and thereby establish
preferences over possible choices, a process known as model assessment. 1 The interested reader can find an overview of
Model assessment is a surprisingly complex and nuanced subject – even this rich subject in:
if we limit the scope to Bayesian methods – and no method can rightfully a. vehtari and j. ojanen (2012). A Survey of
be called β€œthe” Bayesian approach.1 In this chapter we will present one Bayesian Predictive Methods for Model As-
convenient framework for model assessment via Bayesian inference over sessment, Selection and Comparison. Statistics
Surveys 6:142–228.
models, which are evaluated based on their ability to explain observed
data and our prior beliefs.
We will begin our presentation by carefully defining the models we models and model structures: Β§ 4.1, p. 68
will be assessing and discussing how we may build useful spaces of mod-
els for consideration. With Gaussian processes, these spaces will most
often be built from what we will call model structures, comprising a para-
metric mean function, covariance function, and observation model; in
the context of model assessment, the parameters of these model compo-
nents are known as hyperparameters. We will then show how to perform Bayesian inference over parametric model
Bayesian inference over the hyperparameters of a model structure from spaces: Β§ 4.2, p. 70
observations, resulting in a model posterior enabling model assessment
and other tasks. We will later extend this process to multiple model multiple model structures: Β§ 4.5, p. 78
structures and show how we can even automatically search for better automating model structure search: Β§ 4.6, p. 81
model structures.
Central to this approach is a fundamental measure of model fit known
as the marginal likelihood of the data or model evidence. Gaussian pro- marginal likelihood, model evidence: Β§ 4.2,
cess models are routinely selected by maximizing this score, which can p. 71
produce excellent results when sufficient data are available to unambigu- model selection via map inference: Β§ 4.3, p. 73
ously determine the best-fitting model. However, model construction

This material has been published by Cambridge University Press as Bayesian Optimization. 67
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
68 model assessment, selection, and averaging

in the context of Bayesian optimization is unusual as the expense of


gathering observations relegates us to the realm of small data. Effective
modeling with small datasets requires careful consideration of model
uncertainty: models explaining the data equally well may disagree dras-
tically in their predictions, and committing to a single model may yield
biased predictions with poorly calibrated uncertainty – and disappoint-
model averaging: Β§ 4.4, p. 74 ing optimization performance as a result. Model averaging is one solution
that has proven effective in Bayesian optimization, where the predictions
of multiple models are combined in the interest of robustness.

4.1 models and model structures


In model assessment, we seek to evaluate a space of models according to
their ability to explain a set of observations D = (x, y). Before taking up
this problem in earnest, let us establish exactly what we mean by β€œmodel”
in this context, which is a model for the given observations, rather than
of a latent function alone as was our focus in the previous chapter.
model, 𝑝 (y | x) For this discussion we will define a model to be a prior probability
distribution over the measured values y that would result from observing
at a set of locations x: 𝑝 (y | x). In the overarching approach we have
adopted for this book, a model is specified indirectly via a prior process
on a latent function 𝑓 and an observation model linking this function to
the observed values:  
𝑝 (𝑓 ), 𝑝 (𝑦 | π‘₯, πœ™) . (4.1)
Given explicit choices for these components, we may form the desired dis-
model induced by prior process and tribution by marginalizing the latent function values 𝝓 = 𝑓 (x) through
observation model the observation model:
∫
𝑝 (y | x) = 𝑝 (y | x, 𝝓) 𝑝 (𝝓 | x) d𝝓. (4.2)

All models we consider below will be of this composite form (4.1), but the
assessment framework we describe will accommodate arbitrary models.

Spaces of candidate models


2 Although defining a space of candidate mod- To proceed, we must establish some space of candidate models we wish
els may seem natural and innocuous, this is to consider as possible explanations of the observed data.2 Although
actually a major point of contention between
different approaches to Bayesian model assess-
this space can in principle be arbitrary, with Gaussian process models
ment. If we subscribe to the maxim β€œall models it is convenient to consider parametric collections of models defined
are wrong,” we might conclude that the true by parametric forms for the observation model and the prior mean and
model will never be contained in any space covariance functions of the latent function. We invested significant effort
we define, no matter how expansive. However,
some are likely β€œmore wrong” than others, and
in the last chapter laying the groundwork to enable this approach: a
we can still reasonably establish preferences running theme was the introduction of flexible parametric mean and
over the given space. covariance functions that can assume a wide range of different shapes –
perfect building blocks for expressive model spaces.
We will call a particular combination of observation model, prior
model structure mean function πœ‡, and prior covariance function 𝐾 a model structure.
4.1. models and model structures 69

Corresponding to each model structure is a natural model space formed


by exhaustively traversing the joint parameter space: model space, M
n  o
M= 𝑝 (𝑓 | 𝜽 ), 𝑝 (𝑦 | π‘₯, πœ™, 𝜽 ) | 𝜽 ∈ Θ , (4.3)
where

𝑝 (𝑓 | 𝜽 ) = GP 𝑓 ; πœ‡ (π‘₯; 𝜽 ), 𝐾 (π‘₯, π‘₯ β€²; 𝜽 ) .

We have indexed the space by a vector 𝜽 , the entries of which jointly vector of hyperparameters, 𝜽
specify any necessary parameters from their joint range Θ. The entries range of hyperparameter values, Θ
of 𝜽 are known as hyperparameters of the model structure, as they pa-
rameterize the prior distribution for the observations, 𝑝 (y | x, 𝜽 ) (4.2).
In many cases we may be happy with a single suitably flexible model
structure for the data, in which case we can proceed with the correspond-
ing space (4.3) as the set of candidate models. We may also consider
multiple model structures for the data by taking a discrete union of such
spaces, an idea we will return to later in this chapter. multiple model structures: Β§ 4.5, p. 78

Example
Let us momentarily take a step back from abstraction and create an ex-
plicit model space for optimization on the interval X = [π‘Ž, 𝑏].3 Suppose 3 The interval can be arbitrary; our discussion
our initial beliefs are that the objective will exhibit stationary behav- will be purely qualitative.
ior with a constant trend near zero, and that our observations will be
corrupted by additive noise with unknown signal-to-noise ratio.
For the observation model, we take homoskedastic additive Gaussian observation model: additive Gaussian noise
noise, a reasonable choice when there is no obvious alternative: with unknown scale

𝑝 (𝑦 | πœ™, πœŽπ‘› ) = N (𝑦; πœ™, πœŽπ‘›2 ), (4.4)

and leave the scale of the observation noise πœŽπ‘› as a hyperparameter.


Turning to the prior process, we assume a constant mean function (3.1) prior mean function: constant mean with
with a zero-mean normal prior on the unknown constant: unknown value

πœ‡ (π‘₯; 𝑐) ≑ 𝑐; 𝑝 (𝑐) = N (𝑐; 0, 𝑏 2 ),

and select the Matérn covariance function with 𝜈 = 5/2 (3.14) with un- prior covariance function: Matérn 𝜈 = 5/2
known output scale πœ† (3.20) and unknown length scale β„“ (3.22): with unknown output and length scales

𝐾 (π‘₯, π‘₯ β€²; πœ†, β„“) = πœ† 2𝐾M5/2 (𝑑/β„“).

Following our discussion in the last chapter, we may eliminate one eliminating mean parameter via
of the parameters above by marginalizing the unknown constant mean marginalization: Β§ 3.1, p. 47
under its assumed prior,4 leaving us with the identically zero mean
function and an additive contribution to the covariance function (3.3): 4 We would ideally marginalize the other pa-
rameters as well, but it would not result in a
πœ‡ (π‘₯) ≑ 0; 𝐾 (π‘₯, π‘₯ β€²; πœ†, β„“) = 𝑏 2 + πœ† 2𝐾M5/2 (𝑑/β„“). (4.5) Gaussian process, as we will discuss shortly.

This, combined with (4.4), completes the specification of a model struc-


ture with three hyperparameters: 𝜽 = [πœŽπ‘› , πœ†, β„“] ⊀
. Figure 4.1 illustrates
70 model assessment, selection, and averaging

Figure 4.1: Samples from our example πœŽπ‘›


model space for a range of the
hyperparameters: πœŽπ‘› , the ob-
servation noise scale, and β„“,
the characteristic length scale.
The output scale πœ† is fixed
for each example. Each exam-
ple demonstrates a sample of
the latent function and obser-
vations resulting from mea-
surements at a fixed set of 15
locations x. Elements of the
model space can model func-
tions with short- or long-scale
correlations that are observed
with a range of fidelity from
virtually exact observation to
extreme noise. β„“

samples from the joint prior over the objective function and the observed
values y that would result from measurements at 15 locations x (4.2) for
a range of these hyperparameters. Even this simple model space is quite
flexible, offering degrees of freedom for the variation in the objective
function and the precision of our measurements.

4.2 bayesian inference over parametric model spaces


Given a space of candidate models, we now turn to the question of
assessing the quality of these models in light of data. There are multiple
5 a. vehtari and j. ojanen (2012). A Survey of paths forward,5 but Bayesian inference offers one effective solution. By
Bayesian Predictive Methods for Model As- accepting that we can never be absolutely certain regarding which model
sessment, Selection and Comparison. Statistics
Surveys 6:142–228.
is the most faithful representation of a given system, we can – as with
anything unknown in the Bayesian approach – treat that β€œbest model”
as a random variable to be inferred from data and prior beliefs.
We will limit this initial discussion to parametric model spaces built
from a single model structure (4.1), which will simplify notation and
6 As it is most likely that no model among the allow us to conflate models and their corresponding hyperparameters
candidates actually generated the data, some 𝜽 as convenient. We will consider more complex spaces comprising
authors have suggested that any choice of
multiple alternative model structures presently.
prior is dubious. If this bothers the reader, it
can help to frame the inference as being over
the model β€œclosest to the truth” rather than
over the β€œtrue model” itself. Model prior
We first endow the model space with a prior encoding which models are
model prior, 𝑝 (𝜽 ) more plausible a priori, 𝑝 (𝜽 ).6 For convenience, it is common to design
the model hyperparameters such that the uninformative (and possibly
improper) β€œuniform prior”

𝑝 (𝜽 ) ∝ 1 (4.6)
4.2. bayesian inference over parametric model spaces 71

Figure 4.2: The dataset for our model as-


sessment example, generated
using a hidden model from
the space on the facing page.

may be used, in which case the model prior may not be explicitly ac-
knowledged at all. However, it can be helpful to express at least weakly
informative prior beliefs – especially when working with small datasets
– as it can offer gentle regularization away from patently absurd choices.
This should be possible for most hyperparameters in practice. For ex-
ample, when modeling a physical system, it would be unlikely that
interaction length scales of say one nanometer and one kilometer would
be equally plausible a priori; we might capture this intuition with a wide
prior on the logarithm of the length scale.

Model posterior
Given a set of observations D = (x, y), we may appeal to Bayes’ theorem model posterior, 𝑝 (𝜽 | D)
to derive the posterior distribution over the candidate models:

𝑝 (𝜽 | D) ∝ 𝑝 (𝜽 ) 𝑝 (y | x, 𝜽 ). (4.7)

The model posterior provides support to the models most consistent


with our prior beliefs and the observed data. Consistency with the data 7 Recall that this distribution is precisely what
is encapsulated by the 𝑝 (y | x, 𝜽 ) term, the prior pdf over observations a model defines: Β§ 4.1, p. 68.
evaluated on the actual data.7 This value is known as the model evidence model evidence, marginal likelihood,
or the marginal likelihood of the data, as it serves as a likelihood in Bayes’ 𝑝 (y | x, 𝜽 )
theorem (4.7) and, in our class of latent function models, is computed by
marginalizing the latent function values at the observed locations (4.2).

Marginal likelihood and Bayesian Occam’s razor


8 d. j. c. mackay (2003). Information Theory, In-
Model assessment becomes trivial in light of the model posterior if we ference, and Learning Algorithms. Cambridge
simply establish preferences over models according to their posterior University Press. [chapter 28]
probability. When using the uniform model prior (4.6) (perhaps implic-
itly), the model posterior is proportional to the marginal likelihood alone,
which can be then used directly for model assessment.
It is commonly argued that the model evidence encodes automatic
penalization for model complexity, a phenomenon known as Bayesian y
Occam’s razor.8 mackay outlines a simple argument for this effect by S
noting that a model 𝑝 (y | x) must integrate to unity over all possible A cartoon of the Bayesian Occam’s razor
measurements y. Thus if a β€œsimpler” model wishes to become more effect due to mackay. Interpreting models as
β€œcomplex” by putting support over a wider range of possible observations, pdfs over measurements y, a β€œsimple” model
that explains datasets in S well, but not
it can only do so by reducing the support for the datasets that are already
elsewhere. The β€œcomplex” alternative model
well explained; see the illustration in the margin. explains datasets outside S better, but in S
The marginal likelihood of a given dataset can be conveniently com- worse; the probability density must be lower
puted in closed form for Gaussian process models with additive Gaussian there to explain a broader range of data.
72 model assessment, selection, and averaging

Figure 4.3: The posterior distribution πœŽπ‘›


over the model space from
Figure 4.1 (the range of the
axes are compatible with that 2
figure) conditioned on the
dataset in Figure 4.2. The out- 3
put scale is fixed (to its true
value) for the purposes of il-
lustration. Significant uncer-
tainty remains in the exact
values of the hyperparame-
ters, but the model posterior
favors models featuring either βˆ—
short length scales with low
noise or long length scales
with high noise. The points
marked 1–3 are referenced in 1 0 map
Figure 4.4; the point marked β„“
βˆ— is the map (Figure 4.5).

noise or exact observation. In this case, we have (2.18):


𝑝 (y | x, 𝜽 ) = N (y; 𝝁, 𝚺 + N),
where 𝝁 and 𝚺 are the prior mean and covariance of the latent objective
function values 𝝓 (2.3), and N is the observation noise covariance matrix
(the zero matrix for exact observation) – all of which may depend on
𝜽 . As this value can be exceptionally small and have high dynamic
marginal likelihood for Gaussian process range, the logarithm of the marginal likelihood is usually preferred for
models with additive Gaussian noise computational purposes (a.6–a.7):

log 𝑝 (y | x, 𝜽 ) =
 
βˆ’ 12 (y βˆ’ 𝝁) ⊀ (𝚺 + N) βˆ’1 (y βˆ’ 𝝁) + log |𝚺 + N| + 𝑛 log 2πœ‹ . (4.8)
interpretation of terms The first term of this expression is the sum of the squared Mahalanobis
norms (a.8) of the observations under the prior and represents a measure
of data fit. The second term serves as a complexity penalty: the volume
of any confidence ellipsoid under the prior is proportional to |𝚺 + N|,
9 The dataset was realized using a moderate
and thus this term scales according to the volume of the model’s support
length scale (30 length scales spanning the in observation space. The third term simply ensures normalization.
domain) and a small amount of additive noise,
shown below. But this is impossible to know
from inspection of the data alone, and many Return to example
alternative explanations are just as plausible
according to the model posterior! Let us return to our example scenario and model space. We invite the
reader to consider the hypothetical set of 15 observations in Figure 4.2
from our example system of interest and contemplate which models
from our space of candidates in Figure 4.1 might be the most compatible
with these observations.9
We illustrate the model posterior given this data in Figure 4.3, where,
in the interest of visualization, we have fixed the covariance output
4.3. model selection via posterior maximization 73

posterior mean
posterior 95% credible interval, 𝑦 posterior 95% credible interval, πœ™

Figure 4.4: Posterior distributions given


the observed data correspond-
1 ing to the three settings of
the model hyperparameters
marked in Figure 4.3. Al-
though remarkably different
in their interpretations, each
model represents an equally
2 plausible explanation in the
model posterior. Model 1 fa-
vors near-exact observations
with a short length scale, and
models 2–3 favor large obser-
vation noise with a range of
length scales.
3

scale to its true value and set the range of the axes to be compatible
with the samples from Figure 4.1. The model prior was designed to be
weakly informative regarding the expected order of magnitude of the
hyperparameters by taking independent, wide Gaussian priors on the 10 Both parameters are nonnegative, so the prior
logarithm of the observation noise and covariance length scale.10 has support on the entire parameter range.
The first observation we can make regarding the model posterior is
that it is remarkably broad, with many settings of the model hyperparam-
eters remaining plausible after observing the data. However, the model
posterior does express a preference for models with either low noise and
short length scale or high noise combined with a range of compatible
length scales. Figure 4.4 provides examples of objective function and
observation posteriors corresponding to the hyperparameters indicated
in Figure 4.3. Although each is equally plausible in the posterior,11 their 11 The posterior probability density of these
explanations of the data are diverse. points is approximately 10% of the maximum.

4.3 model selection via posterior maximization


Winnowing down a space of candidate models to a single model for use
in inference and prediction is known as model selection. Model selection
becomes straightforward if we agree to rank candidates according to the
model posterior, as we may then select the maximum a posteriori (map) 12 If we only wish to find the maximum, there is
(4.7) model:12 no benefit to normalizing the posterior.

πœ½Λ† = arg max 𝑝 (𝜽 ) 𝑝 (y | x, 𝜽 ). (4.9)


𝜽

When the model prior is flat (4.6), the map model corresponds to the
maximum likelihood estimate (mle) of the model hyperparameters. Fig-
74 model assessment, selection, and averaging

Figure 4.5: The predictions of the maxi-


mum a posteriori (map) model
from the example data in Fig-
ure 4.2.

ure 4.5 shows the predictions made by the map model for our running
example; in this case, the map hyperparameters are in fact a reasonable
match to the parameters used to generate the example dataset.
acceleration via gradient-based optimization When the model space is defined over a continuous space of hyperpa-
rameters, computation of the map model can be significantly accelerated
via gradient-based optimization. Here it is advisable to work in the log
domain, where the objective becomes the unnormalized log posterior:
log 𝑝 (𝜽 ) + log 𝑝 (y | x, 𝜽 ). (4.10)
The log marginal likelihood is given in (4.8), noting that 𝝁, 𝚺, and N
are all implicitly functions of the hyperparameters 𝜽 . This objective
gradient of log marginal likelihood with (4.10) is differentiable with respect to 𝜽 assuming the Gaussian process
respect to 𝜽 : § c.1, p. 307 prior moments, the noise covariance, and the model prior are as well, in
which case we may appeal to off-the-shelf gradient methods for solving
(4.9). However, a word of warning is in order: the model posterior is
not guaranteed to be concave and may have multiple local maxima, so
multistart optimization is prudent.

4.4 model averaging


Reliance on a single model is questionable when the model posterior is
not well determined by the data. For example, in our running example,
a diverse range of models is consistent with the data (Figures 4.3–4.4).
Committing to a single model in this case may systematically bias our
predictions and underestimate predictive uncertainty – note how the
diversity in predictions from Figure 4.4 is lost in the map model (4.5).
model-marginal objective posterior, 𝑝 (𝑓 | D) An alternative is to marginalize the model with respect to the model
model-marginal predictive distribution, posterior, a process known as model averaging:
𝑝 (𝑦 | π‘₯, D) ∫
𝑝 (𝑓 | D) = 𝑝 (𝑓 | D, 𝜽 ) 𝑝 (𝜽 | D) d𝜽 ; (4.11)
∬
𝑝 (𝑦 | π‘₯, D) = 𝑝 (𝑦 | π‘₯, πœ™, 𝜽 ) 𝑝 (πœ™ | π‘₯, D, 𝜽 ) 𝑝 (𝜽 | D) dπœ™ d𝜽, (4.12)

where we have marginalized the hyperparameters of both the objective


and observation models. Model averaging is more consistent with the
ideal Bayesian convention of marginalizing nuisance parameters when
13 Although it may be unusual to consider the possible13 and promises robustness to model misspecification, at least
choice of model a β€œnuisance!” over the chosen model space.
Unfortunately, neither of these model-marginal distributions (4.11–
4.12) can be computed exactly for Gaussian process models except in
4.4. model averaging 75

posterior mean posterior 95% credible interval, 𝑦 posterior 95% credible interval, πœ™

Figure 4.6: A Monte Carlo estimate to the model-marginal predictive distribution (4.11) for our example sceneario using
100 samples drawn from the model posterior in Figure 4.3 (4.14–4.15); see illustration in margin. Samples
from the objective function posterior display a variety of behavior due to being associated with different
hyperparameters.

some special cases,14 so we must resort to approximation if we wish 14 A notable example is marginalizing the coeffi-
to pursue this approach. In fact, maximum a posteriori estimation can cients of a linear prior mean against a Gaus-
sian prior: Β§ 3.1, p. 47.
be interpreted as one rather crude approximation scheme where the
model posterior is replaced by a Dirac delta distribution at the map
hyperparameters:
𝑝 (𝜽 | D) β‰ˆ 𝛿 (𝜽 βˆ’ πœ½Λ† ).
This can be defensible when the dataset is large compared to the number
of hyperparameters, in which case the model posterior is often unimodal
with little residual uncertainty. However, large datasets are the exception
rather than the rule in Bayesian optimization, and more sophisticated
approximations can pay off when model uncertainty is significant.

Monte Carlo approximation


Monte Carlo approximation is one straightforward path forward. Draw-
ing a set of hyperparameter samples from the model posterior,

{πœ½π‘– }𝑠𝑖=1 ∼ 𝑝 (𝜽 | D), (4.13)

yields the following simple Monte Carlo estimates:

1 βˆ‘οΈ 
𝑠
𝑝 (𝑓 | D) β‰ˆ GP 𝑓 ; πœ‡D (πœ½π‘– ), 𝐾D (πœ½π‘– ) ; (4.14) πœŽπ‘›
𝑠 𝑖=1
𝑠 ∫
1 βˆ‘οΈ
𝑝 (𝑦 | π‘₯, D) β‰ˆ 𝑝 (𝑦 | π‘₯, πœ™, πœ½π‘– ) 𝑝 (πœ™ | π‘₯, D, πœ½π‘– ) dπœ™ . (4.15)
𝑠 𝑖=1

The objective function posterior is approximated by a mixture of Gaus-


sian processes corresponding to the sampled hyperparameters, and the
posterior predictive distribution for observations is then derived by inte- β„“
grating a Gaussian mixture (2.36) against the observation model.
Any Markov chain Monte Carlo procedure could be used to generate The 100 hyperparameter samples used to
the hyperparameter samples (4.13); a variation on Hamiltonian Monte produce Figure 4.6.
76 model assessment, selection, and averaging

posterior mean posterior 95% credible interval, 𝑦 posterior 95% credible interval, πœ™

Figure 4.7: An approximation to the model-marginal posterior (4.11) using the central composite design approach proposed
by rue et al. A total of nine hyperparameter samples are used for the approximation, illustrated in the margin
below.

15 m. d. hoffman and a. gelman (2014). The Carlo (hmc) such as the no u-turn sampler (nuts) would be a reasonable
No-U-turn Sampler: Adaptively Setting Path choice when the gradient of the log posterior (4.10) is available, as it can
Lengths in Hamiltonian Monte Carlo. Journal
of Machine Learning Research 15(4):1593–1623.
exploit this information to accelerate mixing.15
Figure 4.6 demonstrates a Monte Carlo approximation to the model-
marginal posterior (4.11–4.12) for our running example. Comparing with
the map approximation in Figure 4.5, the predictive uncertainty of both
objective function values and observations has increased considerably
due to accounting for model uncertainty in the predictive distributions.

Deterministic approximation schemes


The downside of Monte Carlo approximation is relatively inefficient use
of the hyperparameter samples – the price of random sampling rather
than careful design. This inefficiency in turn leads to an increased com-
putational burden for inference and prediction from having to derive
a gp posterior for each sample. Several more efficient (but less accu-
rate) alternative approximations for hyperparameter marginalization
have also been proposed. A common simplifying tactic taken by these
cheaper procedures is to approximate the hyperparameter posterior with
Laplace approximation: Β§ b.1, p. 301 a multivariate normal via a Laplace approximation:
Λ† C),
𝑝 (𝜽 | D) β‰ˆ N (𝜽 ; 𝜽, (4.16)
where πœ½Λ† is the map (4.9). Integrating this approximation into (4.11) gives
∫

𝑝 (𝑓 | D) β‰ˆ GP 𝑓 ; πœ‡D (𝜽 ), 𝐾D (𝜽 ) N (𝜽 ; 𝜽,Λ† C) d𝜽 . (4.17)

16 h. rue et al. (2009). Approximate Bayesian In- Unfortunately this integral remains intractable due to the nonlinear de-
ference for Latent Gaussian Models by Using pendence of the posterior moments on the hyperparameters, but reducing
Integrated Nested Laplace Approximations.
Journal of the Royal Statistical Society Series B to this common form allows us to derive deterministic approximations
(Methodological) 71(2):319–392. against a single assumed posterior.
17 g. e. p. box and k. b. wilson (1951). On the Ex- rue et al. introduced several approximation schemes representing
perimental Attainment of Optimum Condi- different tradeoffs between efficiency and fidelity.16 Notable among these
tions. Journal of the Royal Statistical Society is a simple, sample-efficient procedure grounded in classical experi-
Series B (Methodological) 13(1):1–45.
mental design. Here a central composite design17 in hyperparameter
4.4. model averaging 77

posterior mean posterior 95% credible interval, 𝑦 posterior 95% credible interval, πœ™

Figure 4.8: The approximation to the model-marginal posterior (4.11) for our running example using the approach proposed
by osborne et al.

space is transformed to agree with the moments of (4.16), then used as


πœŽπ‘›
nodes in a numerical quadrature approximation to (4.17). The resulting
approximation again takes the form of a (now weighted) mixture of
Gaussian processes (4.14): the map model augmented by a small num-
ber of additional models designed to reflect the important variation in
the hyperparameter posterior. The number of hyperparamater samples
required by this scheme grows relatively slowly with the dimension of
the hyperparameter space: less than 100 for |𝜽 | ≀ 8 and less than 1000
for |𝜽 | ≀ 21.18 The nine samples required for our running example are β„“
shown in the marginal figure. Figure 4.7 shows the resulting approximate
A Laplace approximation to the model pos-
posterior; comparing with the gold-standard Monte Carlo approximation
terior (performed in the log domain) and
from Figure 4.6, the agreement is excellent. hyperparameter settings corresponding to the
An even more lightweight approximation was proposed by osborne central composite design proposed by rue
et al., which despite its crudeness is arguably still preferable to map et al. The samples do a good job covering the
support of the true posterior.
estimation and can be used as a drop-in replacement.19 This approach
again relies on a Laplace approximation to the hyperparameter posterior
(4.16–4.17). The key observation is that under the admittedly strong 18 s. m. sanchez and p. j. sanchez (2005). Very
assumption that the posterior mean were in fact linear in 𝜽 and the Large Fractional Factorial and Central Com-
posite Designs. acm Transactions on Modeling
posterior covariance independent of 𝜽, we could resolve (4.17) in closed and Computer Simulation 15(4):362–377.
form. We proceed by taking the best linear approximation to the posterior
19 m. a. osborne et al. (2012). Active Learning of
mean around the map:20 Model Evidence Using Bayesian Quadrature.
neurips 2012.
πœ•πœ‡ D (π‘₯; 𝜽 ) Λ†
πœ‡D (π‘₯; 𝜽 ) β‰ˆ πœ‡D (π‘₯; πœ½Λ† ) + g(π‘₯) ⊀(𝜽 βˆ’ πœ½Λ† ); g(π‘₯) = (𝜽 ), 20 This is analogous to the linearization step in
πœ•πœ½ the extended Kalman filter, whereas the cen-
tral composite design approach is closer to the
and assuming the map posterior covariance is universal: 𝐾D (𝜽 ) β‰ˆ 𝐾D (πœ½Λ† ). unscented Kalman filter in pushing samples
The result is a single Gaussian process approximation to the posterior: through the nonlinear transformation.

𝑝 (𝑓 | D) β‰ˆ GP (𝑓 ; πœ‡Λ†D , 𝐾ˆD ), (4.18)

where

πœ‡Λ†D (π‘₯) = πœ‡D (π‘₯; πœ½Λ† ); 𝐾ˆD (π‘₯, π‘₯ β€²) = 𝐾D (π‘₯, π‘₯ β€²; πœ½Λ† ) + g(π‘₯) ⊀Cg(π‘₯ β€²).


78 model assessment, selection, and averaging

This is the map model with covariance inflated by a term determined by


the dependence of the posterior mean on the hyperparameters, g, and
the uncertainty in the hyperparameters, C.21
uncertainty in additive noise scale πœŽπ‘› osborne et al. did not address how to account for uncertainty in
observation model parameters when approximating 𝑝 (𝑦 | π‘₯, D), but we
21 This term vanishes if the hyperparameters are can derive a natural approach for independent additive Gaussian noise
completely determined by the data, in which with unknown scale πœŽπ‘› . Given π‘₯, let 𝑝 (πœ™ | π‘₯, D) β‰ˆ N (πœ™; πœ‡, 𝜎 2 ) as in
case the approximation regresses gracefully
to the map estimate. (4.18). We must approximate22
∫
22 In general we have
𝑝 (𝑦 | π‘₯, D) β‰ˆ N (𝑦; πœ‡, 𝜎 2 + πœŽπ‘›2 ) 𝑝 (πœŽπ‘› | π‘₯, D) dπœŽπ‘› .
𝑝 (𝑦 | π‘₯, D) =
∬
𝑝 (𝑦 | π‘₯, πœ™, πœŽπ‘› ) 𝑝 (πœ™, πœŽπ‘› | π‘₯, D) dπœ™ dπœŽπ‘› ,
A moment-matched approximation 𝑝 (𝑦 | π‘₯, D) β‰ˆ N (𝑦; π‘š, 𝑠 2 ) is possible
by appealing to the law of total variance:
and we have resolved the integral on πœ™ using
the single-gp approximation. π‘š = 𝔼[𝑦 | π‘₯, D] β‰ˆ πœ‡; 𝑠 2 = var[𝑦 | π‘₯, D] β‰ˆ 𝜎 2 + 𝔼[πœŽπ‘›2 | π‘₯, D].
If the noise scale is parameterized by its logarithm, then the Laplace
approximation (4.16) in particular yields
𝑝 (log πœŽπ‘› | π‘₯, D) β‰ˆ N (log πœŽπ‘› ; log πœŽΛ†π‘› , 𝑠 2 ); 𝔼[πœŽπ‘›2 | π‘₯, D] β‰ˆ πœŽΛ†π‘›2 exp(2𝑠 2 ).
Thus we predict with the map estimate πœŽΛ†π‘› inflated by a factor commen-
surate with the residual uncertainty in the noise contribution.
Figure 4.8 shows the resulting approximation for our running exam-
ple. Although not perfect, the predictive uncertainty in the observations
is more faithful than the map model from Figure 4.5, which severely
underestimates the scale of observation noise in the posterior.

4.5 multiple model structures


We have now covered model inference, selection, and averaging with
a single parametric model space (4.1). With a bit of extra bookkeeping,
we may extend this framework to handle multiple model structures
comprising different combinations of parametric prior moments and
observation models.
To begin, we may build a space of candidate models by taking a
discrete union of parametric spaces as in (4.1), with one built from each
model structure index, M desired model structure: {M𝑖 }. It is natural to index this space by (𝜽, M),
where 𝜽 is understood to be a vector of hyperparameters associated with
the specified model structure; the size and interpretation of this vector
may differ across structures. All that remains is to derive our previous
results while managing this compound structure–hyperparameter index.
We may define a model prior over this compound space by com-
bining a prior over the chosen model structures with priors over the
model structure prior, Pr(M) hyperparameters of each:
𝑝 (𝜽, M) = Pr(M) 𝑝 (𝜽 | M). (4.19)
Given data, the model posterior has a similar form as before (4.7):
𝑝 (𝜽, M | D) = Pr(M | D) 𝑝 (𝜽 | D, M). (4.20)
4.5. multiple model structures 79

Figure 4.9: The objective and dataset for


our multiple-model example.

The structure-conditional hyperparameter posterior 𝑝 (𝜽 | D, M) is as


in (4.7) and may be reasoned about following our previous discussion.
The model structure posterior is then given by model structure posterior, Pr(M | D)

Pr(M | D) ∝ Pr(M) 𝑝 (y | x, M); (4.21)


∫
𝑝 (y | x, M) = 𝑝 (y | x, 𝜽, M) 𝑝 (𝜽 | M) d𝜽 . (4.22)

The expression in (4.22) is the normalizing constant of the structure-


conditional hyperparameter posterior (4.7), which we could ignore when
there was only a single model structure. This integral is in general
intractable, but several approximations are feasible. One effective choice
is the Laplace approximation (4.16), which provides an approximation Laplace approximation: Β§ b.1, p. 301
to the integral as a side effect (b.2). The classical Bayesian information
criterion (bic) may be seen as an approximation to this approximation.23 23 s. konishi and g. kitagawa (2008). Informa-
Model selection may now be pursued by maximizing the model tion Criteria and Statistical Modeling. Springer–
Verlag. [chapter 9]
posterior over the model space as before, although we may no longer
appeal to gradient methods as the model space is not continuous with model selection
multiple model structures. A simple approach would be to find the map
hyperparameters for each of the model structures separately, then use
these map points to approximate (4.22) for each structure via the Laplace
approximation or bic. This would be sufficient to estimate (4.20–4.21)
and maximize over the map models.
Turning to model averaging, the model-marginal posterior to the model averaging
objective function is:
βˆ‘οΈ
𝑝 (𝑓 | D) = Pr(M𝑖 | D) 𝑝 (𝑓 | D, M𝑖 ). (4.23)
𝑖

The structure-conditional, hyperparameter-marginal distribution on


each space 𝑝 (𝑓 | D, M) is as before (4.11) and may be approximated
following our previous discussion. These are now combined in a mixture
distribution weighted by the model structure posterior (4.21).

Multiple structure example


We now present an example of model inference, selection, and averaging
over multiple model structures using the dataset in Figure 4.9.24 The data 24 The data are used as a demo in the code re-
were sampled from a Gaussian process with linear prior mean (a linear leased with:
trend with positive slope is evident) and Matérn 𝜈 = 3/2 prior covariance c. e. rasmussen and c. k. i. williams (2006).
(3.13), with a small amount of additive Gaussian noise. We also show a Gaussian Processes for Machine Learning. mit
sample from the objective function posterior corresponding to the true Press.
model generating the data for reference.
80 model assessment, selection, and averaging

m5 lin lin Γ— lin

Figure 4.10: Sample paths from our ex-


ample model structures.

m5 + lin m5 Γ— lin

initial model structure: p. 69 We build a model space comprising several model structures by aug-
menting our previous space with structures incorporating additional
covariance functions. The treatment of the prior mean (unknown con-
stant marginalized against a Gaussian prior) and observation model
(additive Gaussian noise with unknown scale) will remain the same
for all. The model structures reflect a variety of hypotheses positing
potential linear or quadratic behavior:
m5: the Matérn 𝜈 = 5/2 covariance (3.14) from our previous example;
lin: the linear covariance (3.16), where the the prior on the slope is vague
and centered at zero and the prior on the intercept agrees with the m5
model;
lin Γ— lin: the product of two linear covariances designed as above, modeling a
latent quadratic function with unknown coefficients;
m5 + lin: the sum of a Matérn 𝜈 = 5/2 and linear covariance designed as in the
corresponding individual model structures; and
m5 Γ— lin: the product of a MatΓ©rn 𝜈 = 5/2 and linear covariance designed as in the
corresponding individual model structures.
Objective function samples from models in each of these structures are
shown in Figure 4.10. Among these, the model structure closest to the
truth is arguably m5 + lin.
approximation to model structure posterior Following the above discussion, we find the map hyperparameters
for each of these model structures separately and use a Laplace approx-
imation (4.16) to approximate the hyperparameter posterior on each
space, along with the normalizing constant (4.22). Normalizing over the
structures provides an approximate model structure posterior:

Pr(m5 | D) β‰ˆ 10.8%;
Pr(m5 + lin | D) β‰ˆ 71.8%;
Pr(m5 Γ— lin | D) β‰ˆ 17.0%,

with the remaining model structures (lin and lin Γ— lin) sharing the re-
4.6. automating model structure search 81

posterior mean objective


posterior 95% credible interval, πœ™

Figure 4.11: An approximation to the model-marginal posterior (4.23) for our multiple-model example. The posterior on
each model structure is approximated separately as a mixture of Gaussian processes following rue et al. (see
Figure 4.7); these are then combined by weighting by an approximation of the model structure posterior (4.21).
We show the result with three superimposed, transparent credible intervals, which are shaded with respect to
their weight in contributing to the final approximation.

maining 0.4%. The m5 + lin model structure is the clear winner, and there
is strong evidence that the purely polynomial models are insufficient for
explaining the data alone.
Figure 4.11 illustrates an approximation to the model-marginal pos- approximation to marginal predictive
terior (4.23), approximated by applying rue et al.’s central composite distribution
design approach to each of the model structures, then combining these
into a Gaussian process mixture by weighting by the approximate model
structure posterior. The highly asymmetric credible intervals reflect
the diversity in explanations for the data offered by the chosen model
structures, and the combined model makes reasonable predictions of our
example objective function sampled from the true model.
For this example, averaging over the model structure has important averaging over a space of Gaussian processes
implications regarding the behavior of the resulting optimization policy. in policy computation: Β§ 8.10, p. 192
Figure 4.12 illustrates a common acquisition function25 built from the off- 25 to be specific, expected improvement: Β§ 7.3,
the-shelf m5 model, as well as from the structure-marginal model. The p. 127
former chooses to exploit near what it believes is a local optimum, but
the latter has a strong belief in an underlying linear trend and chooses
to explore the right-hand side of the domain instead. For our example
objective function sample, this would in fact reveal the global optimum
with the next observation.

4.6 automating model structure search


We now have a comprehensive framework for reasoning about model
uncertainty, including methods for model assessment, selection, and
averaging across one or multiple model structures. However, it is still not
clear how we should determine which model structures to consider for a
given system. This is critical as our model inference procedure requires
82 model assessment, selection, and averaging

acquisition function next observation location

Figure 4.12: Optimization policies built from the map m5 model (left) and the structure-marginal posterior (right). The m5
model chooses to exploit near the local optimum, but the structure-marginal model is aware of the underlying
linear trend and chooses to explore the right-hand side as a result.

the space of candidate models to be predefined; equivalently, the model


prior (4.19) is implicitly set to zero for every model outside this space.
Ideally, we would simply enumerate every possible model structure and
average over all of them, but even a naΓ―ve approximation of this ideal
would entail overwhelming computational effort.
However, the set of model structures we consider for a given dataset
can be adaptively tailored as we gather data. One powerful idea is to
appeal to metaheuristics such as local search: by establishing a suitable
space of candidate model structures, we can dynamically explore this
space for the best explanations of available data.

Spaces of candidate model structures


26 d. duvenaud et al. (2013). Structure Discovery To enable this approach, we must first establish a sufficiently rich space
in Nonparametric Regression through Com- of candidate model structures. duvenaud et al. proposed one convenient
positional Kernel Search. icml 2013.
mechanism for defining such a space via a simple productive grammar.26
addition and multiplication of covariance The idea is to appeal to the closure of covariance functions under ad-
functions: Β§ 3.4, p. 60 dition and pointwise multiplication to systematically build up families
of increasingly complex models from simple components. We begin by
base kernels, B choosing a set of so-called base kernels, B, modeling relatively simple
behavior, then extend this set to an infinite family of compositions via
the following context-free grammar:

𝐾 →𝐡
𝐾 →𝐾 +𝐾
𝐾 β†’ 𝐾𝐾
𝐾 β†’ (𝐾).

The symbol 𝐡 in the first rule represents any desired base kernel. The five
model structures considered in our multiple-structure example above in
4.6. automating model structure search 83

fact represent five members of the language generated by this grammar


with the base kernels B = {𝐾M5/2, 𝐾lin } or simply B = {m5, lin}. The
grammar however also generates arbitrarily more complicated expres-
sions such as

m5 + (m5 + lin)m5 (m5 + m5) + lin. (4.24)
We are free to design the base kernels to capture any potential atomic
behavior in the objective function. For example, if the domain is high
dimensional and we suspect that the objective may depend only on Samples from objective function models
sparse interactions of mostly independent variables, we might design the incorporating the example covariance
base kernels to model variations in single variables at a time, then rely structure (4.24).
on the grammar to generate an array of possible interaction structures.
Other spaces of model structures have also been proposed for auto- additive decompositions, Β§ 3.5, p. 61
mated structure search. With an eye toward high-dimensional domains,
gardner et al. for example considered spaces of additive model struc-
tures indexed by every possible partition of the input variables.27 This 27 j. r. gardner et al. (2017). Discovering and Ex-
is an expressive class of model structures, but the number of partitions ploiting Additive Structure for Bayesian Opti-
mization. aistats 2017.
grows so rapidly that exhaustive search is not feasible.

Searching over model structures


Once a space of candidate model structures has been established, we
may develop a search procedure seeking the most promising structures
to explain a given dataset. Several approaches have been proposed for
this search with a range of complexity, all of which frame the problem in
terms of optimizing some figure of merit over the space. Although any
score could be used in this context, a natural choice is an approximation
to the (unnormalized) model structure posterior (4.21) such as the Laplace
approximation or the Bayesian information criterion, and every method
described below uses one of these two scores.
duvenaud et al. suggested traversing their kernel grammar via
greedy search – here, we first evaluate the base kernels, then subject the
productive rules to the best among them to generate similar structures to
search next.26 We continue in this manner as desired, alternating between
evaluating the newly proposed structures, then using the grammar to
expand around the best-seen structure to generate new proposals. This
simple procedure is easy to implement and offers a strong baseline.
malkomes et al. refined this approach by replacing greedy search
with Bayesian optimization over the space of model structures.28 As in 28 g. malkomes et al. (2016). Bayesian Optimiza-
the duvenaud et al. procedure, the authors pose the problem in terms of tion for Automated Model Selection. neurips
maximizing a score over model structures: a Laplace approximation of
2016.

the (log) unnormalized structure posterior (4.21). This objective function


was then modeled using a Gaussian process, which informed a sequen-
tial Bayesian optimization procedure seeking to effectively manage the
exploration–exploitation tradeoff in the space of candidate structures.
The Gaussian process in model space requires a covariance function over
model structures, and the authors proposed an exotic β€œkernel kernel”
evaluating the similarity of proposed structures in terms of the overlap
84 model assessment, selection, and averaging

between their hyperparameter-marginal priors for the given dataset. The


resulting optimization procedure was found to rapidly locate promising
models across a range of regression tasks.

End-to-end automation
Follow-on work demonstrated a completely automated Bayesian opti-
mization system built on this structure search procedure avoiding any
29 g. malkomes and r. garnett (2018). Automat- manual modeling at all.29 The key idea was to dynamically maintain a
ing Bayesian Optimization with Bayesian Op- set of plausible model structures throughout optimization. Predictions
timization. neurips 2018.
are made via model averaging over this set, offering robustness to model
misspecification when computing the outer optimization policy. Every
time a new observation is obtained, the set of model structures is then
updated via a continual Bayesian optimization in model space given the
new data. This interleaving of Bayesian optimization in data space and
model space offered promising performance.
Finally, gardner et al. offered an alternative to optimization over
model structures by constructing a Markov chain Monte Carlo routine
30 j. r. gardner et al. (2017). Discovering and Ex- to sample model structures from their posterior (4.21).30 The proposed
ploiting Additive Structure for Bayesian Opti- sampler was a realization of the Metropolis–Hastings algorithm with a
mization. aistats 2017.
custom proposal distribution making minor modifications to the incum-
bent structure. In the case of the additive decompositions considered
in that work, this step consisted of applying random atomic operations
such as merging or splitting components of the existing decomposition.
Despite the absolutely enormous number of possible additive decomposi-
tions, this mcmc routine was able to quickly locate promising structures,
and averaging over the sampled structures for prediction resulted in
superior optimization performance as well.

4.7 summary of major ideas


We have presented a convenient framework for model assessment, se-
lection, and averaging grounded in Bayesian inference; this is the pre-
dominant approach with Gaussian process models. In the context of
31 j. snoek et al. (2012). Practical Bayesian Op- Bayesian optimization, perhaps the most important development was
timization of Machine Learning Algorithms. the notion of model averaging, which has proven beneficial to empirical
performance31 and has become standard practice.
neurips 2012.

β€’ Model assessment entails deriving preferences over a space of candidate


models of a given system in light of available data.
models and model structures: Β§ 4.1, p. 68 β€’ In its purest form, a model in this context is a prior distribution over
observed values y arising from observations at a given set of locations x,
𝑝 (y | x). A convenient mechanism for specifying a model is via a prior
process for a latent function, 𝑝 (𝑓 ), and an observation model conditioned
on this function, 𝑝 (𝑦 | π‘₯, πœ™) (4.1–4.2).
β€’ With Gaussian process models, it is convenient to work with combina-
tions of parametric forms for the prior mean function, prior covariance
4.7. summary of major ideas 85

function, and observation model, a construct we call a model structure. A


model structure defines a space of corresponding models by traversing
its parameter space (4.3), allowing us to build expressive model spaces.
β€’ Once we delineate a space of candidate models, model assessment be- Bayesian inference over (parametric) model
comes straightforward if we make the – perhaps dubious but nonetheless spaces: Β§ 4.2, p. 70
practical – assumption that the mechanism generating our data is con-
tained within this space. This allows us to treat that true model as a
random variable and proceed via Bayesian inference over the chosen
model space.
β€’ This inference proceeds as usual. We first define a model prior capturing
any initial beliefs over the model space. Then, given a set of observations
D = (x, y), the model posterior is proportional to the model prior and a
measure of model fit known as the marginal likelihood or model evidence,
the probability (density) of the observed data under the model.
β€’ In addition to quantifying fit, the model evidence encodes an automatic Bayesian Occam’s razor, Β§ 4.2, p. 71
penalty for model complexity, an effect known as Bayesian Occam’s razor.
β€’ Model evidence can be computed in closed form for Gaussian process
models with additive Gaussian observation noise (4.8).
β€’ Model inference is especially convenient when the model space is a multiple model structures: Β§ 4.5, p. 78
single model structure, but can be extended to spaces built from multiple
model structures with a bit of extra bookkeeping.
β€’ The model posterior provides a simple means of model assessment by model selection via map inference: Β§ 4.3, p. 73
establishing preferences according to posterior probability. If we must
commit to a single model to explain the data – a task known as model
selection – we then select the maximum a posteriori (map) model.
β€’ Model selection may not be prudent when the model posterior is very model averaging: Β§ 4.4, p. 74
flat, which is common when observations are scarce. In this case many
models may be compatible with the data but incompatible in their predic-
tions, which should be accounted for in the interest of robustness. Model
averaging is a natural solution, where we marginalize the unknown
model when making predictions according to the model posterior.
β€’ Model averaging cannot in general be performed in closed form for approximations to model-marginal posterior:
Gaussian process models; however, we may proceed via mcmc sampling Figures 4.6–4.8 and surrounding text
(4.14–4.15) or by appealing to more lightweight approximation schemes.
β€’ Appealing to metaheuristics allows us to automatically search a space of automating model structure search: Β§ 4.6, p. 81
candidate model structures to explain a given dataset. Once sufficiently
mature, such schemes may some day enable fully automated Bayesian
optimization pipelines that sidestep explicit modeling altogether.
The next chapter marks a major departure from our discussion thus
far, which has focused on modeling and making predictions from data. We
will now shift our attention from inference to decision making, with the
goal of building effective optimization policies informed by the models
we have now fully developed. This endeavor will consume the bulk of
the remainder of the book.
86 model assessment, selection, and averaging

The first step will be to develop a framework for optimal decision


making under uncertainty. Our work to this point will serve an essential
component of this framework, as every such decision will be made with
reference to a posterior belief about what might happen as a result. In
the context of optimization, this belief will take the form of a posterior
predictive distribution for proposed observations given data, and our
investment in building faithful models will pay off in spades.
5
DECISION THEORY FOR OPTIMIZATION

Optimization entails a series of decisions. Most obviously, we must re-


peatedly decide where to make each observation guided by the available
data. Some settings also demand we decide when to terminate opti-
mization, weighing the potential benefit from continuing optimization
against any costs that may be incurred. It is not obvious how we should
make these decisions, especially in the face of incomplete and constantly
evolving knowledge about the objective function that is only refined via
the outcomes of our own actions.
In the previous four chapters, we established Bayesian inference as a
framework for reasoning about uncertainty that offers partial guidance.
The primary obstacle to decision making during optimization is uncer-
tainty about the objective function, and, by extension, the outcomes of
proposed observations. Bayesian inference allows us to reason about an
unknown objective function with a probability distribution over plausible
functions that we may seamlessly update as we gather new informa-
tion. This belief over the objective function in turn enables prediction of
proposed observations via the posterior predictive distribution.
How can we use these beliefs to guide our decisions? Bayesian in-
ference offers no direct answer, but in this chapter we will bridge this
gap. We will develop Bayesian decision theory as a principled means Bayesian decision theory
of decision making under uncertainty and apply this approach in the
context of optimization, demonstrating how to use a probabilistic belief
about an objective function to inform intelligent optimization policies.
Recall our model of sequential optimization outlined in Algorithm formalization of optimization, Β§ 1.1, p. 2
1.1, repeated for convenience on the following page. We begin with an
arbitrary set of data, which we build upon through a sequence of obser-
vations of our own design. The core of the procedure is an optimization optimization policy
policy, which examines any already gathered data and makes the funda-
mental decision of where to make the next observation. With a policy
in hand, optimization proceeds by repeating a straightforward pattern:
the policy selects the next observation location, then we acquire the
requested measurement and update our data accordingly. We repeat this
process until satisfied, at which point we return the collected data.
Barring the question of termination, the behavior of this procedure is
entirely determined by the policy, and constructing optimization policies
will be our primary concern in this and the following chapters. We will
begin with sheer audacity: we will derive the optimal policy – in terms
of maximizing the expected quality of the returned data – in a generic optimal optimization policies: Β§ 5.2, p. 91
setting. The reader may wonder why this book is so long if the optimal
policy is apparently so simple. As it turns out, this theoretically opti- running time and approximation: Β§ 5.3, p. 99
mal procedure is usually impossible to compute and rarely of practical Chapter 7: Common Bayesian Optimization
value. However, our careful derivation will shed light on how we might Policies, p. 123
derive effective approximations. This is a common theme in Bayesian
Chapter 8: Computing Policies with Gaussian
optimization and will be our focus in Chapters 7 and 8. Processes, p. 157
The question of when to terminate optimization also represents
a decision that can be of critical importance in some applications. A

This material has been published by Cambridge University Press as Bayesian Optimization. 87
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
88 decision theory for optimization

Algorithm 1.1: Sequential optimization. input: initial dataset D β–Ά can be empty


repeat
π‘₯ ← policy(D) β–Ά select the next observation location
𝑦 ← observe(π‘₯)
 β–Ά observe at the chosen location
D ← D βˆͺ (π‘₯, 𝑦) β–Ά update dataset
until termination condition reached β–Ά e.g., budget exhausted
return D

procedure for inspecting an observed dataset and deciding whether to


stopping rule stop or continue optimization is called a stopping rule. In optimization,
the stopping rule is often fixed and known before we begin, in which case
1 A predominant example is a preallocated bud- we do not need to worry over its design.1 However, in some scenarios, we
get on the number of allowed observations, may wish instead to consider our evolving understanding of the objective
in which case we are compelled to stop after
exhausting the budget regardless of progress. function and the expected cost of further observations to dynamically
decide when to stop, requiring more subtle adaptive stopping rules. We
optimal stopping rules: Β§ 5.4, p. 103 will also address termination decisions in this chapter and will again
begin by deriving the optimal – but intractable – stopping procedure,
practical stopping rules: Β§ 9.3, p. 210 which will inspire efficient and effective approximations.
Practical optimization routines will return datasets that reflect signif-
icant progress on our global optimization problem (1.1) in some way. For
example, we may wish to return datasets containing near-optimal values
of the objective function. Alternatively, we may be satisfied returning
datasets that indirectly reveal likely locations of the global optimum
or achieve some other related goal. We will formalize this notion of a
Chapter 6: Utility Functions for Optimization, returned dataset’s utility shortly and use it to guide optimization. First,
p. 109 we pause to introduce a useful and pervasive technique for implicitly
defining an optimization policy by maximizing a score function over the
domain.

Defining optimization policies via acquisition functions


A convenient mechanism for defining an optimization policy is by first
specifying an intermediate so-called acquisition function (also called an
acquisition function, infill function, figure of infill function or figure of merit) that provides a score to each potential
merit observation location commensurate with its propensity for aiding the
optimization task. We may then define a policy by observing at a point
judged most promising by the acquisition function. Nearly all Bayesian
optimization policies are defined in this manner, and this relationship
is so intimate that the phrase β€œacquisition function” is often used in-
terchangeably with β€œpolicy” in the literature and conversation, with
maximization of the acquisition function understood.
Specifically, an acquisition function 𝛼 : X β†’ ℝ assigns a score to
each point in the domain reflecting our preferences over locations for the
next observation. Of course, these preferences will presumably depend
on the data we have already observed. To make this dependence explicit,
acquisition function, 𝛼 (π‘₯; D) we adopt the notation 𝛼 (π‘₯; D) for a general acquisition function, where
available data serve as parameters shaping our preferences. In the Bayes-
5.1. introduction to bayesian decision theory 89

ian approach, acquisition functions are invariably defined by deriving


the posterior belief of the objective function given the data, 𝑝 (𝑓 | D),
then defining preferences with respect to this belief.
An acquisition function 𝛼 encodes preferences over potential obser- encoding preferences with an acquisition
vation locations by inducing a total order over the domain: given data function
D, observing at a point π‘₯ is preferred over another point π‘₯ β€² whenever
𝛼 (π‘₯; D) > 𝛼 (π‘₯ β€²; D). Thus a rational action in light of these preferences
is (any) one maximizing the acquisition function:2 2 Ties may be broken arbitrarily.

π‘₯ ∈ arg max 𝛼 (π‘₯ β€²; D). (5.1)


π‘₯ β€² ∈X

Solving (5.1) maps a set of observed data D to a point π‘₯ ∈ X to observe


next, exactly the role of an optimization policy.
At first this idea may sound absurd: we have proposed solving a global the paradox of Bayesian optimization: global
optimization problem (1.1) by repeatedly solving global optimization prob- optimization via. . . global optimization?
lems (5.1)! To resolve this apparent paradox, we note that acquisition
functions in common use have properties rendering their optimization
considerably more tractable than the problem we ultimately wish to
solve. Typical acquisition functions are both cheap to evaluate and ana-
lytically differentiable, allowing the use of off-the-shelf optimizers when
computing the policy (5.1). The objective function, on the other hand, is
assumed to be expensive to evaluate, and its gradient is often unavailable.
Therefore we can reduce a difficult, expensive problem to a series of
simpler, inexpensive problems – a reasonable pursuit!
Numerous acquisition functions have been proposed for Bayesian
optimization, and we will describe many popular choices in detail in
Chapter 7. The most prominent means to constructing acquisition func- Chapter 7: Common Bayesian Optimization
tions is Bayesian decision theory, an approach to optimal decision making Policies, p. 123
we will discuss over the remainder of the chapter.

5.1 introduction to bayesian decision theory


Bayesian decision theory is a framework for decision making under
uncertainty that is flexible enough to handle effectively any scenario. 3 The following would be excellent companion
texts:
Instead of presenting the entire theory in complete abstraction, we will
introduce the essential concepts with an eye to the context of optimiza- m. h. degroot (1970). Optimal Statistical Deci-
tion. For a more in-depth and theoretical treatment, the interested reader sions. McGraw–Hill.
may refer to numerous comprehensive reviews of the subject.3 A good j. o. berger (1985). Statistical Decision Theory
familiarity with this material can demystify some key ideas that are often and Bayesian Analysis. Springer–Verlag.
glossed over in the Bayesian optimization literature, as it serves as the
β€œhidden origin” of many common acquisition functions.
In this section we will introduce the Bayesian approach to decision
making and demonstrate how to make optimal decisions in the case of a
single isolated decision. Ultimately, we will require a theory for making
a sequence of decisions to reason over an entire optimization session. In
the next section, we will extend the line of reasoning presented below to
address sequential decision making and the construction of optimization
policies.
90 decision theory for optimization

Isolated decisions
A decision problem under uncertainty has two defining characteristics.
action space, A The first is the action space A, the set of all available decisions. Our
task is to select an action from this space. For example, in sequential
optimization, an optimization policy decision must select a point in the
domain X for observation, and so we have A = X.
The second critical feature is the presence of uncertain elements of
unknown variables affecting decision the world influencing the outcomes of our actions, complicating our
outcome, πœ“ decision. Let πœ“ represent a random variable encompassing any relevant
uncertain elements when making and evaluating a decision. Although
we may lack perfect knowledge, Bayesian inference allows us to reason
relevant observed data, D about πœ“ in light of data via the posterior distribution 𝑝 (πœ“ | D), and we
posterior belief about πœ“, 𝑝 (πœ“ | D) will use this belief to inform our decision.
Suppose now we must select a decision from an action space A under
uncertainty in πœ“, informed by a set of observed data D. To guide our
utility function, 𝑒 (π‘Ž,πœ“, D) choice, we select a real-valued utility function 𝑒 (π‘Ž,πœ“, D). This function
measures the quality of selecting the action π‘Ž if the true state of the world
were revealed to be πœ“, with higher utilities indicating more favorable
4 Typical presentations of Bayesian decision the- outcomes. The arguments to a utility function comprise everything
ory omit the data from the utility function, but required to judge the quality of a decision in hindsight: the proposed
including it offers more generality, and this
allowance will be important when we turn our action π‘Ž, what we know (the data D), and what we don’t know (the
attention to optimization policies. uncertain elements πœ“ ).4
We cannot know the exact utility that would result from selecting
any given action a priori, due to our incomplete knowledge of πœ“. We can,
however, compute the expected utility that would result from selecting
expected utility an action π‘Ž, according to our posterior belief:
∫
 
𝔼 𝑒 (π‘Ž,πœ“, D) | π‘Ž, D = 𝑒 (π‘Ž,πœ“, D) 𝑝 (πœ“ | D) dπœ“. (5.2)

This expected utility maps each available action to a real value, inducing
a total order and providing a straightforward mechanism for making our
decision. We pick an action maximizing the expected utility:
 
π‘Ž ∈ arg max 𝔼 𝑒 (π‘Ž ,β€² πœ“, D) | π‘Ž ,β€² D . (5.3)
π‘Žβ€² ∈A

5 One may question whether this framework This decision is optimal in the sense that no other action results in greater
is complete in some sense: is it possible to expected utility. (By definition!) This procedure for acting optimally
make rational decisions in some other man-
ner? The von Neumann–Morgenstern theorem
under uncertainty – computing expected utility with respect to relevant
shows that the answer is, surprisingly, no. As- unknown variables and maximizing to select an action – is the central
suming a certain set of rationality axioms, any tenet of Bayesian decision making.5
rational preferences over uncertain outcomes
can be captured by the expectation of some
utility function. Thus every rational decision Example: recommending a point for use after optimization
maximizes an expected utility:
With this abstract decision-making framework established, let us analyze
j. von neumann and o. morgenstern (1944).
Theory of Games and Economic Behavior. an example decision that might be faced in the context of optimization.
Princeton University Press. [appendix a] Consider a scenario where the purpose of optimization is to identify a
single point π‘₯ ∈ X for perpetual use in a production system, preferring
5.2. seqential decisions with a fixed budget 91

locations achieving higher values of the objective function. If we run an


optimizer and it returns some dataset D, which point should we select
for our final recommendation?
We may model this choice as a decision problem with action space
A = X, where we must reason under uncertainty about the objective
function 𝑓. We first select a utility function quantifying the quality of a
given recommendation π‘₯ in hindsight. One natural choice would be

𝑒 (π‘₯, 𝑓 ) = 𝑓 (π‘₯) = πœ™,

which rewards points for achieving high values of the objective function.
Now if our optimization procedure returned a dataset D, the expected
utility from recommending a point π‘₯ is simply the posterior mean of the
recommendation
corresponding function value:
  Optimal terminal recommendation. Above:
𝔼 𝑒 (π‘₯, 𝑓 ) | π‘₯, D = 𝔼[πœ™ | π‘₯, D] = πœ‡D (π‘₯). (5.4) posterior belief about an objective function
given the data returned by an optimizer,
Therefore, an optimal recommendation maximizes the posterior mean: 𝑝 (𝑓 | D). Below: the expected utility for our
example, the posterior mean πœ‡D (π‘₯). The
optimal recommendation maximizes the
π‘₯ ∈ arg max πœ‡D (π‘₯ β€²).
π‘₯ β€² ∈X
expected utility.

Of course, other considerations in a given scenario such as risk aversion


might suggest some other utility function or action space would be more
appropriate, in which case we are free to select any alternative as we
see fit. We will discuss terminal recommendations at length in the next terminal recommendations: Β§ 6.1, p. 109
chapter, including alternative utility functions and action spaces.

5.2 seqential decisions with a fixed budget


We have now introduced Bayesian decision theory as a framework for
computing optimal decisions informed by data. The key idea is to mea-
sure the post hoc quality of a decision with an appropriately designed
utility function, then choose actions maximizing expected utility accord-
ing to our beliefs. We will now apply this idea to the construction of
optimization policies. This setting is considerably more complicated be-
cause each decision we make over the course of optimization will shape
the context of all future decisions.

Modeling policy decisions


To define an optimization routine, we must design a policy to adaptively
design a sequence of observations seeking the optimum. Following our
discussion in the previous section, we will model each of these choices
as a decision problem under uncertainty. Some aspects of this modeling
will be straightforward and others will take some care. To begin, the
action space of each decision is the domain X, and we must act under
uncertainty about the objective function 𝑓, which induces uncertainty
about the outcomes of proposed observations. Fortunately, we may make
each decision guided by any data obtained from previous decisions.
92 decision theory for optimization

Bayesian inference of the objective function: To reason about uncertainty in the objective function, we follow
Β§ 1.2, p. 8 the path laid out in the preceding chapters and maintain a probabilistic
belief throughout optimization, 𝑝 (𝑓 | D). We make no assumptions
regarding the nature of this distribution, and in particular it need not
be a Gaussian process. Equipped with this belief, we may reason about
the result of making an observation at some point π‘₯ via the posterior
predictive distribution 𝑝 (𝑦 | π‘₯, D) (1.7), which will play a key role below.
The ultimate purpose of optimization is to collect and return a dataset
D. Before we can reason about what data we should acquire, we must
first clarify what data we would like to acquire. Following the previous
optimization utility function, 𝑒 (D) section, we will accomplish this by defining a utility function 𝑒 (D) to
evaluate the quality of data returned by an optimizer. This utility function
will serve to establish preferences over optimization outcomes: all other
things being equal, we would prefer to return a dataset with higher
utility than any dataset with lower utility. As before, we will use this
utility to guide the design of policies, by making observations that, in
Chapter 6: Utility Functions for Optimization, expectation, promise the biggest improvement in utility. We will define
p. 109 and motivate several utility functions used for optimization in the next
chapter, and some readers may wish to jump ahead to that discussion for
explicit examples before continuing. In the following, we will develop
the general theory in terms of an arbitrary utility function.

Uncertainty faced during optimization


Suppose D is a dataset of previous observations and that we must select
the next observation location π‘₯. This is the core decision defining an
optimization policy, and we will make all such decisions in the same
manner: by maximizing the expected utility of the data we will return.
Although this sounds straightforward, let us consider the uncertainty
faced when contemplating this decision in more detail. When evaluat-
ing a potential action π‘₯, uncertainty in the objective function induces
uncertainty in the corresponding value 𝑦 we will observe. Bayesian infer-
ence allows us to reason about this uncertain outcome via the posterior
predictive distribution (1.7), and we may hope to be able to address this
uncertainty without much trouble. However, we must also consider that
evaluating at π‘₯ would add the unknown observation (π‘₯, 𝑦) to our dataset,
and that the contents of this updated dataset would be consulted for all
future decisions. Thus we must reason not only about the outcome of
the present observation but also its impact on the entire remainder of
optimization. This requires special attention and distinguishes sequential
decisions from the isolated decisions discussed in the last section.
Intuitively, we might suspect that decisions made closer to termina-
tion should be easier, as fewer future decisions depend on their outcomes.
This is indeed the case, and it will be prudent to define optimization poli-
6 In fact, we have already begun by analyzing a cies in reverse.6 We will first reason about the final decision – when we
decision after optimization has completed! are freed from the burden of having to ponder any future observations –
and proceed backwards to the choice of the first observation location,
working out optimal behavior every step along the way.
5.2. seqential decisions with a fixed budget 93

In this section we will consider the construction of optimization fixed, known budget
policies assuming that we have a fixed and known budget on the number
of observations we will make. This scenario is both common in practice
and convenient for analysis, as we can for now ignore the question of
when to terminate optimization. Note that this assumption effectively
implies that every observation has a constant acquisition cost, which
may not always be reasonable. We will address variable observation cost-aware optimization: Β§ 5.4, p. 103
costs and the question of when to stop optimization later in this chapter.
Assuming a fixed observation budget allows us to reason about opti-
mization policies in terms of the number of observations remaining to
termination, which will always be known. The problem we will consider
in this section then becomes the following: provided an arbitrary set of
data, how should we design our next evaluation location when exactly 𝜏 number of remaining observations (horizon),
observations remain before termination? In sequential decision making, 𝜏
this value is known as the decision horizon, as it indicates how far we
must look ahead into the future when reasoning about the present.
To facilitate our discussion, we pause to define notation for future
data that will be encountered during optimization relative to the present.
When considering an observation at some point π‘₯, we will call the value
resulting from an observation there 𝑦. We will then call  the dataset
available at the next stage of optimization D1 = D βˆͺ (π‘₯, 𝑦) , where putative next observation and dataset: (π‘₯, 𝑦),
the subscript indicates the number of future observations incorporated D1
into the current data. We will write (π‘₯ 2, 𝑦2 ) for the following observa- putative following observation and dataset:
tion, which when acquired will form D2, etc. Our final observation 𝜏 (π‘₯ 2 , 𝑦2 ), D2
steps in the future will then be (π‘₯𝜏 , π‘¦πœ ), and the dataset returned by our putative final observation and dataset:
optimization procedure will be D𝜏 , with utility 𝑒 (D𝜏 ). (π‘₯𝜏 , π‘¦πœ ), D𝜏
This utility of the data we return is our ultimate concern and will
serve as the utility function used to design every observation. Note we
may write this utility in the same form we introduced in our general
discussion:
𝑒 (D𝜏 ) = 𝑒 ( D, π‘₯, 𝑦, π‘₯ 2, 𝑦2, . . . , π‘₯𝜏 , π‘¦πœ ),
known action unknown

which expresses the terminal utility in terms of a proposed current


action π‘₯, the known data D, and the unknown future data to be obtained:
the not-yet observed value 𝑦, and the locations {π‘₯ 2, . . . , π‘₯𝜏 } and values
{𝑦2, . . . , π‘¦πœ } of any following observations.
 
Following our treatment of isolated decisions, we evaluate a potential expected terminal utility, 𝔼 𝑒 (D𝜏 ) | π‘₯, D
observation location π‘₯ via the expected utility at termination ultimately
obtained if we observe at that point next:
 
𝔼 𝑒 (D𝜏 ) | π‘₯, D , (5.5)
and define an optimization policy via maximization:
 
π‘₯ ∈ arg max 𝔼 𝑒 (D𝜏 ) | π‘₯ ,β€² D . (5.6)
π‘₯ β€² ∈X

On its surface, this proposal is relatively simple. However, we must


now consider how to actually compute the expected terminal utility (5.5).
94 decision theory for optimization

Explicitly writing out the expectation over the future data in (5.5) yields
the following expression:
∫ ∫ Γ–
𝜏

Β· Β· Β· 𝑒 (D𝜏 ) 𝑝 (𝑦 | π‘₯, D) 𝑝 (π‘₯𝑖 , 𝑦𝑖 | Dπ‘–βˆ’1 ) d𝑦 d (π‘₯𝑖 , 𝑦𝑖 ) . (5.7)
𝑖=2

This integral certainly appears unwieldy! In particular, it is unclear how


to reason about uncertainty in our future actions, as we should hope that
these actions are made to maximize our welfare rather than generated
by a random process. We will show how to compute this expression
7 This is known as bellman’s principle of opti- under the bold but rational assumption that we make all future decisions
mality, and will be discussed further later in optimally,7 and this analysis will reveal the optimal optimization policy.
this section.
We will proceed via induction on the number of evaluations remain-
ing before termination, 𝜏. We will first determine optimal behavior when
only one observation remains and then inductively consider increasingly
8 This procedure is often called β€œbackward in- long horizons.8 For this analysis it will be useful to introduce notation
duction,” where we consider the last decision for the expected increase in utility achieved when beginning from an
first and work backward in time. Our approach
arbitrary dataset D, making an observation at π‘₯, and then continuing
of a forward induction on the horizon is equiv-
alent. optimally until termination 𝜏 steps in the future. We will write
 
π›Όπœ (π‘₯; D) = 𝔼 𝑒 (D𝜏 ) | π‘₯, D βˆ’ 𝑒 (D)

for this quantity, which is simply the expected terminal utility (5.5) shifted
by the utility of our existing data, 𝑒 (D). It is no coincidence this notation
defining a policy by maximizing an echoes our notation for acquisition functions! We will characterize the
acquisition function: Β§ 5, p. 88 optimal optimization policy by a family of acquisition functions defined
in this manner.

Fixed budget: one observation remaining


We first consider the case where only one observation remains before
termination; that is, the horizon is 𝜏 = 1. In this case the terminal
dataset will be the current dataset augmented with a single additional
observation. As there are no following decisions to consider, we may
analyze the decision using the framework we have already developed for
isolated decisions: Β§ 5.1, p. 89 isolated decisions. The marginal gain in utility from a final evaluation at
π‘₯ is an expectation over the corresponding value 𝑦 with respect to the
posterior predictive distribution:
∫
𝛼 1 (π‘₯; D) = 𝑒 (D1 ) 𝑝 (𝑦 | π‘₯, D) d𝑦 βˆ’ 𝑒 (D). (5.8)

The optimal observation maximizes the expected marginal gain:

π‘₯ ∈ arg max 𝛼 1 (π‘₯ β€²; D), (5.9)


π‘₯ β€² ∈X

and leads to our returning a dataset with expected utility

𝑒 (D) + 𝛼 1βˆ— (D); 𝛼 1βˆ— (D) = max


β€²
𝛼 1 (π‘₯ β€²; D). (5.10)
π‘₯ ∈X
5.2. seqential decisions with a fixed budget 95

observations posterior mean posterior 95% credible interval

expected marginal gain, 𝛼 1 next observation location, π‘₯ 𝛼 1βˆ—

π‘₯β€² π‘₯

Figure 5.1: Illustration of the optimal optimization policy with


𝑝 (𝑦 | π‘₯, D) a horizon of one. Above: we compute the expected
marginal gain 𝛼 1 over the domain and design our
𝑝 (𝑦 | π‘₯ ,β€² D)
next observation π‘₯ by maximizing this score. Left:
𝑒 (D1 ) βˆ’ 𝑒 (D)
the computation of the expected marginal gain for
the optimal point π‘₯ and a suboptimal point π‘₯ β€² indi-
cated above. In this example the marginal gain is
a simple piecewise linear function of the observed
𝑦 value (5.11), and the optimal point maximizes its
expectation.
𝑒 (D)

Here we have defined the symbol π›Όπœβˆ— (D) to represent the expected
increase in utility when starting with D and continuing optimally for value of D with horizon 𝜏, π›Όπœβˆ— (D)
𝜏 additional observations. This is called the value of the dataset with a
horizon of 𝜏 and will serve a central role below. We have now shown
how to compute the value of any dataset with a horizon of 𝜏 = 1 (5.10)
and how to identify a corresponding optimal action (5.9). This completes
the base case of our argument.
We illustrate the optimal optimization policy with one observation illustration of one-step optimal optimization
remaining in Figure 5.1. In this scenario the belief over the objective policy
function 𝑝 (𝑓 | D) is a Gaussian process, and for simplicity we assume
our observations reveal exact values of the objective. We consider an
intuitive utility function: the maximal objective value contained in the
data, 𝑒 (D) = max 𝑓 (x).9 The marginal gain in utility offered by a putative 9 This is a special case of the simple reward util-
final observation (π‘₯, 𝑦) is then a piecewise linear function of the observed ity function, which we discuss further in the
next chapter (Β§ 6.1, p. 109). The corresponding
value:
 expected marginal gain is the well-known ex-
𝑒 (D1 ) βˆ’ 𝑒 (D) = max 𝑦 βˆ’ 𝑒 (D), 0 ; (5.11) pected improvement acquisition function (Β§ 7.3,
p. 127).
that is, the utility increases linearly if we exceed the previously best-
seen value and otherwise remains constant. To design the optimal final
observation, we compute the expectation of this quantity over the domain
and choose the point maximizing it, as shown in the top panels. We also
illustrate the computation of this expectation for the optimal choice and
96 decision theory for optimization

observations posterior mean posterior 95% credible interval

expected marginal gain, 𝛼 2 next observation location, π‘₯ 𝛼 2βˆ—

Figure 5.2: Illustration of the optimal optimization policy with


a horizon of two. Above: the expected two-step
marginal gain 𝛼 2 . Right: computation of 𝛼 2 for
the optimal point π‘₯. The marginal gain is decom-
posed into two components (5.13): the immediate
gain 𝑝 (𝑦 | π‘₯, D)
 𝑒 (D1 ) βˆ’ 𝑒 (D) and the expected future gain 𝑒 (D1 ) βˆ’ 𝑒 (D)
𝔼 𝑒 (D2 ) βˆ’ 𝑒 (D1 ) . The chosen point offers a high  
expected future reward even if the immediate re- 𝔼 𝑒 (D2 ) βˆ’ 𝑒 (D1 ) 𝑦
ward is zero; see the facing page for the scenarios
resulting from the marked values. 𝑦′ 𝑒 (D) 𝑦 β€²β€²

a suboptimal alternative in the bottom panel. We expect an observation


at the chosen location to improve utility by a greater amount than any
alternative.

Fixed budget: two observations remaining


Rather than proceeding immediately to the inductive case, let us consider
the specific case of two observations remaining: 𝜏 = 2. Suppose we have
obtained an arbitrary dataset D and must decide where to make the
penultimate observation π‘₯. The reasoning for this special case presents
the inductive argument most clearly.
We again consider the expected increase in utility by termination,
now after two observations:
 
𝛼 2 (π‘₯; D) = 𝔼 𝑒 (D2 ) | π‘₯, D βˆ’ 𝑒 (D). (5.12)

Nominally this expectation requires marginalizing the observation 𝑦, as


well as the final observation location π‘₯ 2 and its value 𝑦2 (5.7). However, if
we assume optimal future behavior, we can simplify our treatment of the
final decision π‘₯ 2 . First, we rewrite the two-step expected gain 𝛼 2 in terms
of the one-step expected gain 𝛼 1 , a function for which we have already
established a good understanding. We write the two-step difference in
5.2. seqential decisions with a fixed budget 97

ο£Ή
ο£Ί
ο£Ί Figure 5.3: The posterior of the objec-
ο£Ί
ο£Ί tive function given two pos-
ο£Ί
ο£Ί sible observations resulting
ο£Ί 𝑦 = 𝑦′
ο£Ί from the optimal two-step ob-
𝛼 1βˆ— ο£Ί
ο£Ί servation π‘₯ illustrated on the
ο£Ί facing page. The relatively
ο£Ί
ο£Ί low value 𝑦 β€² offers no imme-
ο£Ί
π‘₯2 ο£» diate reward, but reveals a
new local optimum and the
ο£Ή expected future reward from
ο£Ί
ο£Ί the optimal final decision π‘₯ 2
ο£Ί
ο£Ί is high. The relatively high
ο£Ί
ο£Ί value 𝑦 β€²β€² offers a large imme-
ο£Ί 𝑦 = 𝑦 β€²β€² diate reward and respectable
ο£Ί
ο£Ί prospects from the optimal fi-
ο£Ί
ο£Ί nal decision as well.
ο£Ί
𝛼 1βˆ—
ο£Ί
ο£Ί
π‘₯2 ο£»

utility as a telescoping sum:


   
𝑒 (D2 ) βˆ’ 𝑒 (D) = 𝑒 (D1 ) βˆ’ 𝑒 (D) + 𝑒 (D2 ) βˆ’ 𝑒 (D1 ) ,

which yields
 
𝛼 2 (π‘₯; D) = 𝛼 1 (π‘₯; D) + 𝔼 𝛼 1 (π‘₯ 2 ; D1 ) | π‘₯, D .

That is, the expected increase in utility after two observations can be decomposition of expected marginal gain
decomposed as the expected increase after our first observation π‘₯ – the
expected immediate gain – plus the expected additional increase from
the final observation π‘₯ 2 – the expected future gain.
It is still not clear how to address the second term in this expression.
However, from our analysis of the base case, we can reason as follows.
Given 𝑦 (and thus knowledge of D1 ), the optimal final decision π‘₯ 2 (5.9)
results in an expected marginal gain of 𝛼 1βˆ— (D1 ), a quantity we know how
to compute (5.10). Therefore, assuming optimal future behavior, we have:
 
𝛼 2 (π‘₯; D) = 𝛼 1 (π‘₯; D) + 𝔼 𝛼 1βˆ— (D1 ) | π‘₯, D , (5.13)

which expresses the desired quantity as an expectation with respect


to the current observation 𝑦 only – the future value 𝛼 1βˆ— (5.10) does not
depend on either π‘₯ 2 (due to maximization) or 𝑦2 (due to expectation).
The optimal penultimate observation location maximizes the expected
gain as usual:
π‘₯ ∈ arg max 𝛼 2 (π‘₯ β€²; D), (5.14)
π‘₯ β€² ∈X

and provides an expected terminal utility of

𝑒 (D) + 𝛼 2βˆ— (D); 𝛼 2βˆ— (D) = max


β€²
𝛼 2 (π‘₯ β€²; D).
π‘₯ ∈X
98 decision theory for optimization

This demonstrates we can achieve optimal behavior for a horizon of


𝜏 = 2 and compute the value of any dataset with this horizon.
illustration of two-step optimal optimization The optimal policy with two observations remaining is illustrated in
policy Figures 5.2 and 5.3. The former shows the expected two-step marginal
gain 𝛼 2 and the optimal action. This quantity depends both on the imme-
diate gain from the next observation and the expected future gain from
the optimal final action. The chosen observation appears quite promis-
ing: even if the result offers no immediate gain, it will likely provide
information that can be exploited with the optimal final decision π‘₯ 2 . We
show the situation that would be faced in the final stage of optimization
for two potential values in Figure 5.3. The relatively low value 𝑦 β€² offers
no immediate gain but sets up an encouraging final decision, whereas the
relatively high value 𝑦 β€²β€² offers a significant immediate gain with some
chance of further improvement.

Fixed budget: inductive case


We now present the general inductive argument, which closely follows
the 𝜏 = 2 analysis above. Let 𝜏 be an arbitrary decision horizon, and for
the sake of induction assume we can compute the value of any dataset
with a horizon of 𝜏 βˆ’ 1. Suppose we have an arbitrary dataset D and
must decide where to make the next observation. We will show how to
do so optimally and how to compute its value with a horizon of 𝜏.
Consider the 𝜏-step expected gain in utility from observing at some
point π‘₯:  
π›Όπœ (π‘₯; D) = 𝔼 𝑒 (D𝜏 ) | π‘₯, D βˆ’ 𝑒 (D),
10 Namely: which we seek to maximize. We decompose this expression in terms of
shorter-horizon quantities through a telescoping sum:10
𝑒 (D𝜏 ) βˆ’ 𝑒 (D) =
     
𝑒 (D1 ) βˆ’ 𝑒 (D) + 𝑒 (D𝜏 ) βˆ’ 𝑒 (D1 ) . π›Όπœ (π‘₯; D) = 𝛼 1 (π‘₯; D) + 𝔼 π›Όπœβˆ’1 (π‘₯ 2 ; D1 ) | π‘₯, D .

Now if we knew 𝑦 (and thus D1 ), optimal continued behavior would


provide an expected further gain of π›Όπœβˆ’1
βˆ—
(D1 ), a quantity we can compute
via the inductive hypothesis. Therefore, assuming optimal behavior for
all remaining decisions, we have:
 βˆ— 
π›Όπœ (π‘₯; D) = 𝛼 1 (π‘₯; D) + 𝔼 π›Όπœβˆ’1 (D1 ) | π‘₯, D , (5.15)

which is an expectation with respect to 𝑦 of a function we can com-


pute. To find the optimal decision and the 𝜏-step value of the data, we
maximize:
π‘₯ ∈ arg max π›Όπœ (π‘₯ β€²; D); (5.16)
π‘₯ β€² ∈X
π›Όπœβˆ— (D) = max π›Όπœ (π‘₯ β€²; D). (5.17)
π‘₯ β€² ∈X

This demonstrates we can achieve optimal behavior for a horizon of 𝜏


given an arbitrary dataset and compute its corresponding value, estab-
lishing the inductive case and completing our analysis.
5.3. cost and approximation of the optimal policy 99

We pause to note that the value of any dataset with null horizon optimal policy: compact notation
is 𝛼 0βˆ— (D) = 0, and thus the expressions in (5.15–5.17) are valid for any
horizon and compactly express the proposed policy. Further, we have
actually shown that this policy is optimal in the sense of maximizing
expected terminal utility over the space of all policies, at least with optimality
respect to our model of the objective function and observations. This
follows from our induction: the base case is established in (5.9), and the 11 Since ties in (5.16) may be broken arbitrarily,
inductive case by the sequential maximization in (5.16).11 this argument does not rule out the possibility
of there being multiple, equally good optimal
policies.
Bellman optimality and the Bellman equation
Substituting (5.15) into (5.17), we may derive the following recursive 12 r. bellman (1952). On the Theory of Dy-
definition of the value in terms of the value of future data: namic Programming. Proceedings of the Na-
n  βˆ— o tional Academy of Sciences 38(8):716–719.
π›Όπœβˆ— (D) = max
β€²
𝛼 1 (π‘₯ β€²
; D) + 𝔼 𝛼 πœβˆ’1 (D1 ) | π‘₯ β€²
, D . (5.18)
π‘₯ ∈X

This is known as the Bellman equation and is a central result in the theory Bellman equation
of optimal sequential decisions.12 The treatment of future decisions in
this equation – recursively assuming that we will always act to maximize
expected terminal utility given the available data – reflects bellman’s bellman’s principle of optimality
principle of optimality, which characterizes optimal sequential decision
policies in terms of the optimality of subpolicies:13
An optimal policy has the property that whatever the initial state 13 r. bellman (1957). Dynamic Programming.
and initial decision are, the remaining decisions must constitute Princeton University Press.
an optimal policy with regard to the state resulting from the first
decision.
That is, to make a sequence of optimal decisions, we make the first
decision optimally, then make all following decisions optimally given
the outcome!

5.3 cost and approximation of the optimal policy


Although the framework presented in the previous section is concep-
tually simple and theoretically attractive, the optimal policy is unfortu-
nately prohibitive to compute except for very short decision horizons.
To demonstrate the key computational barrier, consider the selection
of the penultimate observation location. The expected two-step marginal
gain to be maximized is (5.13):
 
𝛼 2 (π‘₯; D) = 𝛼 1 (π‘₯; D) + 𝔼 𝛼 1βˆ— (D1 ) | π‘₯, D .
The second term appears to be a straightforward expectation over the
one-dimensional random variable 𝑦. However, evaluating the integrand
in this expectation requires solving a nontrivial global optimization
problem (5.10)! Even with only two evaluations remaining, we must
solve a doubly nested global optimization problem, an onerous task.
Close inspection of the recursively defined optimal policy (5.15–5.16) β€œunrolling” the optimal sequential policy
reveals that when faced with a horizon of 𝜏, we must solve 𝜏 nested
100 decision theory for optimization

𝑦 𝑦2 π‘₯𝜏 𝑒 (D𝜏 )
π‘₯ π‘₯2 Β·Β·Β· π‘¦πœ

arg max 𝔼𝑦 max 𝔼𝑦2 Β·Β·Β· max π”Όπ‘¦πœ 𝑒 (D𝜏 )


π‘₯ π‘₯2 π‘₯𝜏

Figure 5.4: The optimal optimization policy as a decision tree. Squares indicate decisions (the choice of each observation),
and circles represent expectations with respect to random variables (the outcomes of observations). Only one
possible optimization path is shown; dangling edges lead to different futures, and all possibilities are always
considered. We maximize the expected terminal utility 𝑒 (D𝜏 ), recursively assuming optimal future behavior.

optimization problems to find the optimal decision. Temporarily adopting


compact notation, we may β€œunroll” the optimal policy as follows:

π‘₯ ∈ arg max π›Όπœ ;
βˆ—
π›Όπœ = 𝛼 1 + 𝔼[π›Όπœβˆ’1 ]
= 𝛼 1 + 𝔼[max π›Όπœβˆ’1 ]
  
= 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼[π›Όπœβˆ’2
βˆ—
]
h n   i
= 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼 max 𝛼 1 + 𝔼[max{𝛼 1 + Β· Β· Β· .

The design of each optimal decision requires repeated maximization


over the domain and expectation over unknown observations until the
horizon is reached. This computation is visualized as a decision tree in
Figure 5.4, where it is clear that each unknown quantity contributes a
significant branching factor. Computing the expected utility at π‘₯ exactly
requires a complete traversal of this tree.
running time of optimal policy The cost of computing the optimal policy clearly grows with the
horizon. Let us perform a careful running time analysis for a naΓ―ve
implementation via exhaustive traversal of the decision tree in Figure 5.4
with off-the-shelf procedures. Suppose we use an optimization routine
for each maximization and a numerical quadrature routine for each
expectation encountered in this computation. If we allow 𝑛 evaluations
evaluation budget for optimization, 𝑛 of the objective for each call to the optimizer and π‘ž observations of the
evaluation budget for quadrature, π‘ž integrand for each call to the quadrature routine, then each decision
along the horizon will contribute a multiplicative factor of O(π‘›π‘ž) to the
total running time. Computing the optimal decision with a horizon of
𝜏 thus requires O(π‘›πœπ‘žπœ ) work, an exponential growth in running time
14 Detailed references are provided by: with respect to the horizon.
w. b. powell (2011). Approximate Dynamic Pro- Evidently, the computational effort required for realizing the optimal
gramming: Solving the Curses of Dimensional- policy quickly becomes intractable, and we must find some alternative
ity. John Wiley & Sons. mechanism for designing effective optimization policies. General approx-
d. p. bertsekas (2017). Dynamic Programming imation schemes for the optimal policy have been studied in depth under
and Optimal Control. Vol. 1. Athena Scientific. the name approximate dynamic programming,14 and usually operate as
5.3. cost and approximation of the optimal policy 101

Figure 5.5: A lookahead approximation


to the optimal optimization
𝑦 π‘₯β„“ 𝑒 (Dβ„“ ) policy. We choose the optimal
π‘₯ Β·Β·Β· 𝑦ℓ
decision for a limited horizon
β„“ β‰ͺ 𝜏 decisions, ignoring any
observations that would fol-
low.

follows. We begin with the intractable optimal expected marginal gain


(5.15):  βˆ— 
π›Όπœ (π‘₯; D) = 𝛼 1 (π‘₯; D) + 𝔼 π›Όπœβˆ’1 (D1 ) | π‘₯, D ,
and substitute a tractable approximation for the β€œhard” part of the ex-
pression: the recursively defined future value 𝛼 βˆ— (5.18). The result is an
acquisition function inducing a suboptimal – but rationally guided – ap-
proximate policy. Two particular approximations schemes have proven
useful in Bayesian optimization: limited lookahead and rollout.

Limited lookahead
One widespread and surprisingly effective approximation is to simply
limit how many future observations we consider in each decision. This
is practical as decisions closer to termination require substantially less
computation than earlier decisions.
With this in mind, we can construct a natural family of approxima-
tions to the optimal policy defined by artificially limiting the horizon
used throughout optimization to some computationally feasible maxi-
mum β„“. When faced with an infeasible decision horizon 𝜏, we make the
crude approximation

π›Όπœ (π‘₯; D) β‰ˆ 𝛼 β„“ (π‘₯; D),

and by maximizing this score, we act optimally under the incorrect but
convenient assumption that only β„“ observations remain. This effectively 15 Equivalently, we approximate the true future
assumes 𝑒 (D𝜏 ) β‰ˆ 𝑒 (Dβ„“ ).15 This may be reasonable if we expect decreas- value π›Όπœβˆ— βˆ’1 with 𝛼 β„“βˆ—βˆ’1 .
ing marginal gains, implying a significant fraction of potential gains can
be attained within the truncated horizon. This scheme is often described myopic approximations
(sometimes disparagingly) as myopic, as we limit our sight to only the
next few observations rather than looking ahead to the full horizon.
A policy that designs each observation to maximize the limited-
horizon acquisition function 𝛼 min{β„“,𝜏 } is called an β„“-step lookahead pol- β„“-step lookahead
icy.16 This is also called a rolling horizon strategy, as the fixed horizon rolling horizon
β€œrolls along” with us as we go. By limiting the horizon, we bound the
computational effort required for each decision to at-most O(𝑛 β„“ π‘ž β„“ ) time 16 We take the minimum to ensure we don’t look
with the implementation described above. This can be a considerable beyond the true horizon, which would be non-
sense.
speedup when the observation budget is much greater than the selected
lookahead. A lookahead policy is illustrated as a decision tree in Figure
5.5. Comparing to the optimal policy in Figure 5.4, we simply β€œcut off”
and ignore any portion of the tree lying deeper than β„“ steps in the future.
102 decision theory for optimization

𝑦 𝑦2 π‘₯𝜏 𝑒 (D𝜏 )
π‘₯ π‘₯2 Β·Β·Β· π‘¦πœ

Figure 5.6: A decision tree representing a rollout policy. Comparing to the optimal policy in Figure 5.4, we simulate future
decisions starting with π‘₯ 2 using an efficient but suboptimal heuristic policy, rather than the intractable optimal
policy. We maximize the expected terminal utility 𝑒 (D𝜏 ), assuming potentially suboptimal future behavior.

one-step lookahead Particularly important in Bayesian optimization is the special case of


one-step lookahead, which successively maximizes the expected marginal
gain after acquiring a single additional observation, 𝛼 1 . One-step looka-
head is the most efficient lookahead approximation (barring the absurdity
that would be β€œzero-step” lookahead), and it is often possible to derive
closed-form, analytically differentiable expressions for 𝛼 1 , enabling effi-
cient implementation. Many well-known acquisition functions represent
Chapter 7: Common Bayesian Optimization one-step lookahead approximations for some implicit choice of utility
Policies, p. 123 function, as we will see in Chapter 7.

Rollout
The optimal policy evaluates a potential observation location by simu-
lating the entire remainder of optimization following that choice, recur-
sively assuming we will use the optimal policy for every future decision.
Although sensible, this is clearly intractable. Rollout is an approach to ap-
proximate policy design that emulates the structure of the optimal policy,
but using a tractable suboptimal policy to simulate future decisions.
A rollout policy is illustrated as a decision tree in Figure 5.6. Given a
base policy, heuristic policy putative next observation (π‘₯, 𝑦), we use an inexpensive so-called base
or heuristic policy to simulate a plausible – but perhaps suboptimal –
realization of the following decision π‘₯ 2 . Note there is no branching in
the tree corresponding to this decision, as it does not depend on the
exhaustively enumerated subtree required by the optimal policy. We
then take an expectation with respect to the unknown value 𝑦2 as usual.
Given a putative value of 𝑦2 , we use the base policy to select π‘₯ 3 and
continue in this manner until reaching the decision horizon. We use the
terminal utilities in the resulting pruned tree to estimate the expected
marginal gain π›Όπœ , which we maximize as a function of π‘₯.
choice of base policy There are no constraints on the design of the base policy used in
rollout; however, for this approximation to be sensible, we must choose
something relatively efficient. One common and often effective choice
is to simulate future decisions with one-step lookahead. If we again
use off-the-shelf optimization and quadrature routines to traverse the
rollout decision tree in Figure 5.6 with this particular choice, the running
time of the policy with a horizon of 𝜏 is O(𝑛 2π‘žπœ ), significantly faster
5.4. cost-aware optimization and termination as a decision 103

Figure 5.7: A batch rollout policy as a de-


ο£±
 π‘₯2  ο£±
𝑦2  cision tree. Given a putative
.

 
 
.
 

𝑦 𝑒 (D𝜏 ) value for the next evaluation
.. ..
 
π‘₯
   
 ο£΄π‘¦πœ 
(π‘₯, 𝑦), we design all remain-
π‘₯𝜏  
ο£³ ο£Ύ ο£³ ο£Ύ ing decisions simultaneously
using a batch base policy and
take the expectation of the ter-
minal utility with respect to
their values.
than the optimal policy. Although there is still exponential growth with
respect to π‘ž, we typically have π‘ž β‰ͺ 𝑛,17 so we can usually entertain
farther horizons with rollout than with limited lookahead with the same 17 For estimating a one-dimensional expectation
amount of computational effort. we might take π‘ž on the order of roughly 10,
but for optimizing a nonconvex acquisition
Due to the flexibility in the design of the base policy, rollout is a re- function over the domain we might take 𝑛 on
markably flexible approximation scheme. For example, we can combine the order of thousands or more.
rollout with the idea of limiting the decision horizon to yield approxi-
mate policies with tunable running time. In fact, we can interpret β„“-step
lookahead as a special case of rollout, where the base policy designs limited lookahead as rollout
the next β„“ βˆ’ 1 decisions optimally assuming a myopic horizon and then
simply terminates early, discarding any remaining budget.
We may also adopt a base policy that designs all remaining observa-
tions simultaneously. Ignoring the dependence between these decisions
can provide a computational advantage while retaining awareness of the
evolving decision horizon, and such batch rollout schemes have proven batch rollout
useful in Bayesian optimization. A batch rollout policy is illustrated as a
decision tree in Figure 5.7. Although we account for the entire horizon,
the tree depth is reduced dramatically compared to the optimal policy.

5.4 cost-aware optimization and termination as a decision


Thus far we have only considered the construction of optimization poli-
cies under a known budget on the total number of observations. Although
this scenario is pervasive, it is not universal. In some situations, we might
wish instead to use our evolving beliefs about the objective function to
decide dynamically when termination is the best course of action.
Dynamic termination can be especially prudent when we want to
reason explicitly about the cost of data acquisition during optimization.
For example, if this cost were to vary across the domain, it would not
be sensible to define a budget in terms of function evaluations. How-
ever, by accounting for observation costs in the utility function, we can
reason about cost–benefit tradeoffs during optimization and seek to ter-
minate whenever the expected cost of further observation outweighs
any expected benefit it might provide.

Modeling termination decisions and the optimal policy


We consider a modification to the sequential decision problem we an-
alyzed in the known-budget case, wherein we now allow ourselves to
104 decision theory for optimization

terminate optimization at any time of our choosing. Suppose we are at an


arbitrary point of optimization and have already obtained data D. We face
the following decision: should we terminate optimization immediately
and return D? If not, where should we make our next observation?
action space, A We model this scenario as a decision problem under uncertainty
with an action space equal to the domain X, representing potential
observation locations if we decide to continue, augmented with a special
termination option, βˆ… additional action βˆ… representing immediate termination:

A = X βˆͺ {βˆ…}. (5.19)

For the sake of analysis, after the termination action has been selected, it
is convenient to model the decision process as not actually terminating,
but rather continuing with the collapsed action space A = {βˆ…} – once
you terminate, there’s no going back.
As before, we may derive the optimal optimization policy in the
adaptive termination case via induction on the decision horizon 𝜏. How-
ever, we must address one technical issue: the base case of the induction,
which analyzes the β€œfinal” decision, breaks down if we allow the possi-
bility of a nonterminating sequence of decisions. To sidestep this issue,
bound on total number of observations, 𝜏max we assume there is a fixed and known upper bound 𝜏max on the total
number of observations we may make, at which point optimization is
compelled to terminate regardless of any other concern. This is not an
overly restrictive assumption in the context of Bayesian optimization.
Because observations are assumed to be expensive, we can adopt some
18 It is possible to consider unbounded sequen- suitably absurd upper bound without issue; for example, 𝜏max = 1 000 000
tial decision problems, but this is probably not would suffice for an overwhelming majority of plausible scenarios.18
of practical interest in Bayesian optimization:
After assuming the decision process is bounded, our previous induc-
m. h. degroot (1970). Optimal Statistical Deci- tive argument carries through after we demonstrate how to compute
sions. McGraw–Hill. [Β§ 12.7] the value of the termination action. Fortunately, this is straightforward:
termination does not augment our data, and once this action is taken, no
other action will ever again be allowed. Therefore the expected marginal
gain from termination is always zero:

π›Όπœ (βˆ…; D) = 0. (5.20)

With this, substituting A for X in (5.15–5.17) now gives the optimal policy.
Intuitively, the result in (5.20) implies that termination is only the
optimal decision if there is no observation offering positive expected
gain in utility. For the utility functions described in the next chapter –
all of which are agnostic to costs and measure optimization progress
19 This can be proven through various β€œinforma- alone – reaching this state is actually impossible.19 However, explicitly
tion never hurts” (in expectation) results. accounting for observation costs in addition to optimization progress in
the utility function resolves this issue, as we will demonstrate.

Example: cost-aware optimization


To illustrate the behavior of a policy allowing early termination, we
return to our motivating scenario of accounting for observation costs.
5.4. cost-aware optimization and termination as a decision 105

observations posterior mean posterior 95% credible interval

expected gain to data utility, 𝛼 1β€² observation cost, 𝑐

expected cost-adjusted gain, 𝛼 1 next observation location, π‘₯


𝛼 1βˆ—

Figure 5.8: Illustration of one-step lookahead with the option to terminate. With a linear utility and additive costs, the
expected marginal gain 𝛼 1 is the expected marginal gain to the data utility 𝛼 1β€² adjusted for the cost of acquisition
𝑐. For some points, the cost-adjusted expected gain is negative, in which case we would prefer immediate
termination to observing there. However, continuing with the chosen point is expected to increase the utility
of the current data.

Consider the objective function belief in the top panel of Figure 5.8 (which
is identical to that from our running example from Figures 5.1–5.3) and 20 We will consider unknown and stochastic
suppose that the cost of observation now depends on location according costs in Β§ 11.1, p. 245.
to a known cost function 𝑐 (π‘₯),20 illustrated in the middle panel. observation cost function, 𝑐 (π‘₯)
If we wish to reason about observation costs in the optimization
policy, we must account for them somehow, and the most natural place
to do so is in the utility function. Depending on the situation, there are
many ways we could proceed;21 however, one natural approach is to 21 We wish to stress this point – there is consider-
first select a utility function measuring the quality of a returned dataset able flexibility beyond the scheme we describe.
alone, ignoring any costs incurred to acquire it. We call this quantity
the data utility and notate it with 𝑒 β€²(D). The data utility is akin to the data utility, 𝑒 β€² (D)
cost-agnostic utility from the known-budget case, and any one of the Chapter 6: Utility Functions for Optimization,
options described in the next chapter could reasonably fill this role. p. 109
We now adjust the data utility to account for the cost of data acquisi-
tion. In many applications, these costs are additive, so that the total cost
of gathering a dataset D is simply observation costs, 𝑐 (D)
βˆ‘οΈ
𝑐 (D) = 𝑐 (π‘₯). (5.21)
π‘₯ ∈D 22 Some additional discussion on this natural ap-
proach can be found in:
If the acquisition cost can be expressed in the same units as the data
utility – for example, if both can be expressed in monetary terms22– then h. raiffa and r. schlaifer (1961). Applied Sta-
tistical Decision Theory. Division of Research,
we might reasonably evaluate a dataset D by the cost-adjusted utility: Graduate School of Business Administration,
Harvard University. [chapter 4]
𝑒 (D) = 𝑒 β€²(D) βˆ’ 𝑐 (D). (5.22)
106 decision theory for optimization

Demonstration: one-step lookahead with cost-aware utility


Returning to the scenario in Figure 5.8, let us adopt a cost-aware utility
function of the above form (5.22) and consider the behavior of a one-step
lookahead approximation to the optimal optimization policy.
For these choices, if we were to continue optimization by evaluating
at a point π‘₯, the resulting one-step marginal gain in utility would be:
 
𝑒 (D1 ) βˆ’ 𝑒 (D) = 𝑒 β€²(D1 ) βˆ’ 𝑒 β€²(D) βˆ’ 𝑐 (π‘₯),
the cost-adjusted marginal gain in the data utility alone. Therefore the
expected marginal gain in utility is:
𝛼 1 (π‘₯; D) = 𝛼 1β€² (π‘₯; D) βˆ’ 𝑐 (π‘₯),
where 𝛼 1β€² is the one-step expected gain in the data utility (5.8). That
is, we simply adjust what would have been the acquisition function in
the cost-agnostic setting by subtracting the cost of data acquisition. To
prefer evaluating at π‘₯ to immediate termination, this quantity must have
positive expected value (5.20).
example and discussion The resulting policy is illustrated in Figure 5.8. The middle panel
shows the cost-agnostic acquisition function 𝛼 1β€² (from Figure 5.1), which
is then adjusted for observation cost in the bottom panel. This renders the
expected marginal gain negative in some locations, where observations
are not expected to be worth their cost. However, in this case there are still
regions where observation is favored to termination, and optimization
continues at the selected location. Comparing with the cost-agnostic
setting in Figure 5.1, the optimal observation has shifted from the right-
hand side to the left-hand side of the previously best-seen point, as an
observation there is more cost-effective.

5.5 summary of major ideas


defining optimization policies via acquisition β€’ Optimization policies can be conveniently defined via an acquisition
functions: p. 88 function assigning a score to each potential observation location. We
then design observations by maximizing the acquisition function (5.1).
β€’ Bayesian decision theory is a general framework for optimal decision
making under uncertainty, through which we can derive optimal opti-
mization policies and stopping rules.
introduction to Bayesian decision theory: β€’ The key elements of a decision problem under uncertainty are:
Β§ 5.1, p. 89
– an action space A, from which we must choose an action π‘Ž,
– uncertainty in elements πœ“ relevant to the decision, represented by a
posterior belief 𝑝 (πœ“ | D), and
– a utility function 𝑒 (π‘Ž,πœ“, D) quantifying the quality of the action π‘Ž
assuming a given realization of the uncertain elements πœ“.

Given these, an optimal decision maximizes the expected utility (5.2–5.3).


modeling policy decisions: Β§ 5.2, p. 91 β€’ Optimization policy decisions may be cast in this framework by defining
a utility function for the data returned by an optimizer, then designing
5.5. summary of major ideas 107

each observation location to maximize the expected utility with respect


to all future data yet to be obtained (5.5–5.6).
β€’ To ensure the optimality of a sequence of decisions, we must recursively bellman’s principle of optimality: Β§ 5.2, p. 99
assume the optimality of all future decisions. This is known as bellman’s
principle of optimality. Under this assumption, the optimal policy can be
derived inductively and assumes a simple recursive form (5.15–5.17).
β€’ The cost of computing the optimal policy grows exponentially with the computational burden and approximation of
decision horizon, but several techniques under the umbrella approximate the optimal policy: Β§ 5.3, p. 99
dynamic programming provide tractable approximations. Two notable
examples are limited lookahead, where the decision horizon is artificially
limited, and rollout, where future decisions are simulated suboptimally.
β€’ Through careful accounting, we may explicitly account for the (possibly termination as a decision: Β§ 5.4, p. 103
nonuniform) cost of data acquisition in the utility function. Offering a ter-
mination option and computing the resulting optimal policy then allows
us to adaptively terminate optimization when continuing optimization
becomes a losing battle of cost versus expected gain.

In the next chapter we will discuss several prominent utility functions for
measuring the quality of a dataset returned by an optimization procedure.
In the following chapter, we will demonstrate how many common acqui- Chapter 7: Common Bayesian Optimization
sition functions for Bayesian optimization may be realized by performing Policies, p. 123
one-step lookahead with these utility functions.
6
U TILI TY F UNCTIONS FOR OPTIMIZATION

In the last chapter we introduced Bayesian decision theory, a framework


for decision making under uncertainty through which we can derive
theoretically optimal optimization policies. Central to this approach
is the notion of a utility function evaluating the quality of a dataset
returned from an optimization routine. Given a model of the objective
function – conveying our beliefs in the face of uncertainty – and a utility
function – expressing our preferences over outcomes – computing the
optimal policy is purely mechanical: we design every observation to
maximize the expected utility of the returned dataset (5.15–5.17). Setting
aside computational issues, adopting this approach entails only two
major hurdles: building an objective function model consistent with our
beliefs and designing a utility function consistent with our preferences.
Neither of these tasks is trivial! Beliefs and preferences are so innate
to the human experience that distilling them down to mathematical
symbols can be challenging. Fortunately, expressive and mathematically
convenient options for both are readily available. We devoted significant
attention to model building in the first part of this book, and we will
address the construction of utility functions in this chapter. We will in-
troduce a number of common utility functions designed for optimization,
each carrying a different perspective on how optimization performance
should be quantified. We hope that the underlying motivation for these
utility functions may inspire the design of novel alternatives when called
for. In the next chapter, we will demonstrate how approximating the opti- Chapter 7: Common Bayesian Optimization
mal optimization policy corresponding to the utility functions described Policies, p. 123
here yields many widespread Bayesian optimization algorithms.
Although we will be using Gaussian process models in our illustra-
tions throughout the chapter, we will not assume the objective function
model is a Gaussian process in our discussion. As in the previous chap-
ters, we will use the notation πœ‡D (π‘₯) = 𝔼[πœ™ | π‘₯, D] for the posterior mean posterior mean function, πœ‡ D
of the objective function; this should not be interpreted as implying any
particular model structure beyond admitting a posterior mean.

6.1 expected utility of terminal recommendation


The purpose of optimization is often to explore a space of possibilities in selecting a point for permanent use
search of the single best alternative, and after investing in optimization,
we commit to using some chosen point in a subsequent procedure. In
this context, the only purpose of the data collected during optimization
is to help select this final point. For example, in hyperparameter tuning,
we may evaluate numerous hyperparameters during model development,
only to use the apparently best settings found in a production system.
Selecting a point for permanent use represents a decision, which Chapter 5: Decision Theory for Optimization,
we may analyze using Bayesian decision theory. If the sole purpose of p. 87
optimization is to inform a final decision, it is natural to design the policy
to maximize the expected utility of the terminal decision directly, and
several popular policies are defined in this manner.

This material has been published by Cambridge University Press as Bayesian Optimization. 109
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
110 utility functions for optimization

Formalization of terminal recommendation decision


Suppose we have run an optimization routine, which returned a dataset
D = (x, y), and suppose we now wish to recommend a point π‘₯ ∈ X
for use in some task, with performance determined by the underlying
1 Dependence on πœ™ alone is not strictly neces- objective function value πœ™ = 𝑓 (π‘₯).1 This represents a decision under
sary. For example, in the interest of robustness uncertainty about πœ™, informed by the predictive distribution, 𝑝 (πœ™ | π‘₯, D).
we might wish to ensure that function values
To completely specify the decision problem, we must identify an
are high in the neighborhood of our recom-
mendation as well. This would be possible in action space A βŠ‚ X for our recommendation and a utility function 𝜐 (πœ™)
the same framework by redefining the utility evaluating a recommendation in hindsight according to its objective
function as desired. value πœ™. Given these, a rational recommendation maximizes the expected
utility:  
π‘₯ ∈ arg max 𝔼 𝜐 (πœ™ β€²) | π‘₯ ,β€² D .
π‘₯ β€² ∈A
The expected utility of the optimal recommendation only depends on
the data returned by the optimizer; it does not depend on the optimal
recommendation π‘₯ (due to maximization) nor its objective value πœ™ (due
π‘₯ 𝜐 (πœ™) to expectation). This suggests a natural utility for use in optimization:
the expected quality of an optimal terminal recommendation given the
data,  
= 𝑒 (D) 𝑒 (D) = max
β€²
𝔼 𝜐 (πœ™ β€²) | π‘₯ ,β€² D . (6.1)
π‘₯ ∈A
We may also interpret this class of utility func- In the context of the sequential decision tree from Figure 5.4, this utility
tions as augmenting the decision tree in Figure
5.4 with a final layer corresponding to the ter-
function effectively β€œcollapses” the expected utility of a final decision
minal decision. The utility of the data is then into a utility for the returned data; see the illustration in the margin.
the expected utility of this subtree, assuming We are free to select the action space and utility function for the final
optimal behavior. recommendation as we see fit; we provide some advice below.

Choosing an action space


We begin with the action space A βŠ‚ X. One extreme option is to restrict
our choice to only the visited points x. This ensures at least some knowl-
edge of the objective function at the recommended point, which may be
prudent when the objective function model may be misspecified. The
other extreme is the maximally permissive alternative: the entire domain
X, allowing us to recommend any point, including those arbitrarily far
from our observations. The wisdom of recommending an unvisited point
for perpetual use is ultimately a question of faith in the model’s beliefs.
bound on uncertainty Compromises between these extremes have also been occasionally
suggested in the literature. osborne et al. for example proposed restrict-
ing the choice of final recommendation to only those points where the
2 m. a. osborne et al. (2009). Gaussian Processes objective function is known with acceptable tolerance.2 Such a scheme
for Global Optimization. lion 3. can limit unwanted surprise from recommending points where the ob-
jective function value is not known with sufficient certainty. One might
accomplish this in several ways; osborne et al. adopted a parametric,
data-dependent action space of the form

A(πœ€; D) = π‘₯ | std[πœ™ | π‘₯, D] ≀ πœ€ ,
where πœ€ is a threshold specifying the largest acceptable uncertainty.
6.1. expected utility of terminal recommendation 111

Choosing a utility function and risk tolerance


In addition to selecting an action space, we must also select a utility func-
tion 𝜐 (πœ™) evaluating a recommendation at π‘₯ in light of the corresponding
function value πœ™. As our focus is on maximization (1.1), it is clear that the
utility should be monotonically increasing in πœ™, but it is not necessarily
clear what shape this function should assume. The answer depends on πœ™
our risk tolerance, a concept demonstrated in the margin. When mak-
ing our final recommendation, we may wish to consider not only the Consider the illustrated beliefs about the
expected function value of a given point but also our uncertainty in this objective function value corresponding
value, as points with greater uncertainty may result in more surprising to two possible recommendations. One
option has a higher expected value, but
and potentially disappointing results. also greater uncertainty, and proposing it
By controlling the shape of the utility function 𝜐 (πœ™), we may induce entails some risk. The alternative has a lower
different behavior with respect to risk. The simplest and most common expected value but is perhaps a safer option.
option encountered in Bayesian optimization is a linear utility: A risk-averse agent might prefer the latter
option, whereas a risk-tolerant agent might
prefer the former.
𝜐 (πœ™) = πœ™ . (6.2)

In this case, the expected utility from recommending π‘₯ is simply the


𝜐 (πœ™)
posterior mean of πœ™, as we have already seen (5.4):
 
𝔼 𝜐 (πœ™) | π‘₯, D = πœ‡D (π‘₯),

and an optimal recommendation maximizes the posterior mean over the


action space:
π‘₯ = arg max πœ‡D (π‘₯ β€²).
π‘₯ β€² ∈A πœ™
Uncertainty in the objective function is not considered in this decision at
all! Rather, we are indifferent between points with equal expected value, A risk-neutral (linear) utility function.
regardless of their uncertainty – that is, we are risk neutral.
Risk neutrality is computationally convenient due to the simple form
of the expected utility, but may not always reflect our true preferences.
In the margin we show beliefs over the objective values for two potential
recommendations with equal expected value but significantly different
risk. In many scenarios we would have a clear preference between the πœ™
two alternatives, but a risk-neutral utility induces complete indifference.
A useful concept when reasoning about risk preferences is the so- Beliefs over two recommendations with
called certainty equivalent. Consider a risky potential recommendation equal expected value. A risk-neutral agent
π‘₯, that is, a point for which we do not know the objective value exactly. would be indifferent between these alter-
natives, a risk-averse agent would prefer
The certainty equivalent for π‘₯ is the value of a hypothetical risk-free
the more certain option, and a risk-seeking
alternative for which our preferences would be indifferent. That is, the agent would prefer the more uncertain option.
certainty equivalent for π‘₯ corresponds to an objective function value πœ™ β€²
such that  
𝜐 (πœ™ β€²) = 𝔼 𝜐 (πœ™) | π‘₯, D .
Under a risk-neutral utility function, the certainty equivalent of a point
π‘₯ is simply its expected value: πœ™ β€² = πœ‡D (π‘₯). Thus we would abandon a po-
tential recommendation for another only if it had greater expected value,
independent of risk. However, we may encode risk-aware preferences
with appropriately designed nonlinear utility functions.
112 utility functions for optimization

If our preferences indicate risk aversion, we might be willing to rec-


𝜐 (πœ™) ommend a point with lower expected value if it also entailed less risk.
We may induce risk-averse preferences by adopting a utility function
that is a concave function of the objective value. In this case, by Jensen’s
inequality we have
   
𝜐 (πœ™ β€²) = 𝔼 𝜐 (πœ™) | π‘₯, D ≀ 𝜐 𝔼[πœ™ | π‘₯, D] = 𝜐 πœ‡D (π‘₯) ,

and thus the certainty equivalent of a risky recommendation is less


πœ™
than its expected value; see the example in the margin. Similarly, we
may induce risk-seeking preferences with a convex utility function, in
A risk-averse (concave) utility function.
which case the certainty equivalent of a risky recommendation is greater
than its expected value – our preferences encode an inclination toward
gambling. Risk-averse and risk-seeking utilities are rarely encountered
πœ™ in the Bayesian optimization literature; however, they may be preferable
in some practical settings, as risk neutrality is often questionable.
A risk-averse agent may be indifferent be- Numerous risk-averse utility functions have been proposed in the
tween a risky recommendation (the wide economics and decision theory literature,3 and a full discussion is beyond
distribution) and its risk-free certainty the scope of this book. However, one natural approach is to quantify the
equivalent with lower expected value (the
Dirac delta).
risk associated with recommending an uncertain value πœ™ by its standard
deviation:
𝜎 = std[πœ™ | π‘₯, D].
3 One flexible family is the hyperbolic abso-
lute risk aversion (hara) class, which includes Now we may establish preferences over potential recommendations
many popular choices as special cases: consistent with4 a weighted combination of a point π‘₯’s expected reward,
j. e. ingersoll jr. (1987). Theory of Financial πœ‡ = πœ‡D (π‘₯), and its risk, 𝜎:5
Decision Making. Rowman & Littlefield. [chap- πœ‡ + π›½πœŽ.
ter 1]
Here 𝛽 serves as a tunable risk-tolerance parameter: values 𝛽 < 0 penal-
4 Under Gaussian beliefs on function values, one
can find a family of concave (or convex) utility
ize risk and induce risk-averse behavior, values 𝛽 > 0 reward risk and
functions inducing equivalent recommenda- induce risk-seeking behavior, and 𝛽 = 0 induces risk neutrality (6.2).
tions. See Β§ 8.26, p. 171 for related discussion. Two particular utility functions from this general framework are
5 Note that expected reward (πœ‡) and risk (𝜎) widely encountered in Bayesian optimization, both representing the
have compatible units in this formulation, so expected utility of a risk-neutral optimal terminal recommendation.
this weighted combination is sensible.

Simple reward
Suppose an optimization routine returned data D = (x, y) to inform a
terminal recommendation, and that we will make this decision using
the risk-neutral utility function 𝜐 (πœ™) = πœ™ (6.2). If we limit the action
space of this recommendation to only the locations evaluated during
optimization x, the expected utility of the optimal recommendation is
6 This name contrasts with the cumulative re- the so-called simple reward:6, 7
ward : Β§ 6.2, p. 114.
7 One technical caveat is in order: when the 𝑒 (D) = max πœ‡D (x). (6.3)
dataset is empty, the maximum degererates
and we have 𝑒 (βˆ…) = βˆ’βˆž. In the special case of exact observations, where y = 𝑓 (x) = 𝝓, the
simple reward reduces to the maximal function value encountered during
optimization:
𝑒 (D) = max 𝝓. (6.4)
6.1. expected utility of terminal recommendation 113

global reward
simple reward Figure 6.1: The terminal recommenda-
tions corresponding to the
simple reward and global re-
ward for an example dataset
comprising five observations.
The prior distribution for the
objective for this demonstra-
tion is illustrated in Figure 6.3.

recommendations

One-step lookahead with the simple reward utility function produces expected improvement: Β§ 7.3 p. 127
a widely used acquisition function known as expected improvement,
which we will discuss in detail in the next two chapters.

Global reward
Another prominent utility is the global reward.8 Here we again consider a 8 β€œGlobal simple reward” would be a more accu-
risk-neutral terminal recommendation, but now expand the action space rate (but annoyingly bulky) name.
for this recommendation to the entire domain X. The expected utility of
this recommendation is the global maximum of the posterior mean:

𝑒 (D) = max πœ‡ D (π‘₯). (6.5)


π‘₯ ∈X

An example dataset exhibiting a large discrepancy between the simple


reward (6.3) and global reward (6.5) utilities is illustrated in Figure 6.1.
The larger action space underlying global reward leads to a markedly
different and somewhat riskier recommendation.
One-step lookahead with global reward (6.5) yields the knowledge knowledge gradient: Β§ 7.4, p. 129
gradient acquisition function, which we will also consider at length in
the following chapters.

A tempting, but nonsensical alternative


There is an alternative utility deceptively similar to the simple reward
that is sometimes encountered in the Bayesian optimization literature,
?
namely, the maximum noisy observed value contained in the dataset:9 9 The β€œquestionable equality” symbol = is re-
served for this single dubious equation.
?
𝑒 (D) = max y. (6.6)

In the case of exact observations of the objective function, this value


coincides with the simple reward (6.4), which has a natural interpretation
as the expected utility of a particular optimal terminal recommendation.
However, this correspondence does not hold in the case of inexact or
noisy observations, and the proposed utility is rendered absurd.
114 utility functions for optimization

max y

Figure 6.2: The utility 𝑒 (D) = max y would prefer the excessively noisy dataset on the left to the less-noisy dataset
on the right with smaller maximum value. The data on the left reveal little information about the objective
function, and the maximum observed value is very likely to be an outlier, whereas the data on the right indicate
reasonable progress.

large-but-noisy observations are not This is simple to demonstrate by contemplating the preferences over
necessarily preferable outcomes encoded in the utility, which may not align with intuition.
This disparity is especially notable in situations with excessively noisy
observations, where the maximum value observed will likely reflect
spurious noise rather than actual optimization progress.
example and discussion Figure 6.2 shows an extreme but illustrative example. We consider
two optimization outcomes over the same domain, one with excessively
noisy observations and the other with exact measurements. The noisy
dataset contains a large observation on the right-hand side of the domain,
but this is almost certainly the result of noise, as indicated by the objective
function posterior. Although the other dataset has a lower maximal value,
the observations are more trustworthy and represent a plainly better
outcome. But the proposed utility (6.6) prefers the noisier dataset! On
the other hand, both the simple and global reward utilities prefer the
noiseless dataset, as the data produce a larger effect on the posterior
mean – and thus yield more promising recommendations.
approximation to the simple reward Of course, errors in noisy measurements are not always as extreme
as in this example. When the signal-to-noise ratio is relatively high, the
utility (6.6) can serve as a reasonable approximation to the simple reward.
We will discuss this approximation scheme further in the context of
expected improvement: Β§ 7.3, p. 127 expected improvement.

6.2 cumulative reward


Simple and global reward are motivated by supposing that the goal of
optimization is to discover the best single point from a space of alterna-
tives. To this end, we evaluate data according to the highest function
value revealed and assume that the values of any suboptimal points
encountered are irrelevant.
In other settings, the value of every individual observation might
be significant, for example, if the optimization procedure is controlling
a critical external system. If the consequences of these decisions are
nontrivial, we might wish to discourage observing where we might
encounter unexpectedly low objective function values.
6.3. information gain 115

Cumulative reward encourages obtaining observations with large


average value. For a dataset D = (x, y), its cumulative reward is simply
the sum of the observed values:
βˆ‘οΈ
𝑒 (D) = 𝑦𝑖 . (6.7)
𝑖

One notable use of cumulative reward is in active search, a simple active search: Β§ 11.11, p. 282
mathematical model of scientific discovery. Here, we successively select
points for investigation seeking novel members of a rare, valuable class
V βŠ‚ X. Observing at a point π‘₯ ∈ X yields a binary observation indicating
membership in the desired class: 𝑦 = [π‘₯ ∈ V]. Most studies of active
search seek to maximize the cumulative reward (6.7) of the gathered
data, hoping to discover as many valuable items as possible.

6.3 information gain


Simple, global, and cumulative reward judge optimization performance
based solely on having found high objective function values, a natural
and pragmatic concern. Information theory10 provides an alternative 10 A broad introduction to information theory is
approach to measuring utility that is often used in Bayesian optimization. provided by the classical text:
An information-theoretic approach to sequential experimental design t. m. cover and j. a. thomas (2006). Elements
(including optimization) identifies some random variable that we wish of Information Theory. John Wiley & Sons,
to learn about through our observations. We then evaluate performance and a treatment focusing on the connections
by quantifying the amount of information about this random variable to Bayesian inference can be found in:
revealed by data, favoring datasets containing more information. This
d. j. c. mackay (2003). Information Theory, In-
line of reasoning gives rise to the notion of information gain. ference, and Learning Algorithms. Cambridge
Let πœ” be a random variable of interest that we wish to determine University Press.
through the observation of data. The choice of πœ” is open-ended and
should be guided by the application at hand. Natural choices aligned
with optimization include the location of the global optimum, π‘₯ ,βˆ— and
the maximal value of the objective, 𝑓 βˆ— (1.1), each of which has been
considered in depth in this context.
We may quantify our initial uncertainty about πœ” via the (differential) entropy
entropy of its prior distribution, 𝑝 (πœ”):
∫
𝐻 [πœ”] = βˆ’ 𝑝 (πœ”) log 𝑝 (πœ”) dπœ”.

The information gain offered by a dataset D is then the reduction in 11 A caveat is in order regarding this notation,
entropy when moving from the prior to the posterior distribution: which is not standard. In information theory
𝐻 [πœ” | D] denotes the conditional entropy of
𝑒 (D) = 𝐻 [πœ”] βˆ’ 𝐻 [πœ” | D], (6.8) πœ” given D, which is the expectation of the
given quantity over the observed values y. For
our purposes it will be more useful for this
where 𝐻 [πœ” | D] is the differential entropy of the posterior:11 to signify the differential entropy of the no-
∫ tationally parallel posterior 𝑝 (πœ” | D). When
𝐻 [πœ” | D] = βˆ’ 𝑝 (πœ” | D) log 𝑝 (πœ” | D) dπœ”. needed, we will write conditional
 entropy with
an explicit expectation: 𝔼 𝐻 [πœ” | D] | x .

Somewhat confusingly, some authors use an alternative definition


of information gain – the Kullback–Leibler (kl) divergence between the Kullback–Leibler (kl) divergence
116 utility functions for optimization

posterior distribution and the prior distribution:


∫
12 A simple example: suppose πœ” ∈ (0, 1) is the   𝑝 (πœ” | D)
unknown bias of a coin, with prior 𝑒 (D) = 𝐷 kl 𝑝 (πœ” | D) βˆ₯ 𝑝 (πœ”) = 𝑝 (πœ” | D) log dπœ”. (6.9)
𝑝 (πœ”)
𝑝 (πœ”) = Beta(πœ”; 2, 1); 𝐻 β‰ˆ βˆ’0.193.
That is, we quantify the information contained in data by how much
After flipping and observing β€œtails,” the poste- our belief in the πœ” changes as a result of collecting it. This definition
rior becomes
has some convenient properties compared to the previous one (6.8);
𝑝 (πœ” | D) = Beta(πœ”; 2, 2); 𝐻 β‰ˆ βˆ’0.125. namely, the expression in (6.9) is invariant to reparameterization of πœ”
The information β€œgained” was and always nonnegative, whereas β€œsurprising” observations may cause
𝐻 [πœ” ] βˆ’ 𝐻 [πœ” | D] β‰ˆ βˆ’0.068 < 0.
the information gain in (6.8) to become negative.12 However, the previous
definition as the direct reduction in entropy may be more intuitive.
Of course, the most likely outcome of the flip
Fortunately (and perhaps surprisingly!), there is a strong connection
a priori was β€œheads,” so the outcome was sur-
prising. Indeed the expected information gain between these two β€œinformation gains” (6.8–6.9) in the context of se-
before the experiment was quential decision making. Namely, their expected values with respect to
  observed values are equal, and thus maximizing expected utility with ei-
𝐻 [πœ” ] βˆ’ 𝔼 𝐻 [πœ” | D] β‰ˆ 0.137 > 0.
ther leads to identical decisions.13 For this reason, the reader may simply
13 See p. 138 for a proof. choose whichever definition they find more intuitive.
One-step lookahead with (either) information gain yields an acquisi-
mutual information, entropy search: Β§ 7.6, tion function known as mutual information. This is the basis for a family
p. 135 of related Bayesian optimization procedures sharing the moniker entropy
search, which we will discuss further in the following chapters.
information-theoretic policies as the scientific Unlike the other utility functions discussed thus far, information gain
method: Β§ 7.6, p. 136 is not intimately linked to optimization, and may be adapted to a wide
variety of tasks by selecting the random variable πœ” appropriately. Rather,
this scheme of refining knowledge through experiment is effectively a
mathematical formulation of scientific inquiry.

6.4 dependence on model of objective function


One striking feature of most of the utility functions defined in this chapter
is implicit dependence on an underlying model of the objective function.
Both the simple and global reward are defined in terms of the posterior
mean function πœ‡D , and information gain about the location or value of
the optimum is defined in terms of the posterior belief about these values,
𝑝 (π‘₯ ,βˆ— 𝑓 βˆ— | D); both of these quantities are byproducts of the objective
function posterior.
model averaging: Β§Β§ 4.4–4.5, p. 74 One way to mitigate model dependence in the computation of utility
model-agnostic alternatives is via model averaging (4.11, 4.23).14 We may also attempt to define purely
model-agnostic utility functions in terms of the data alone, without
14 The effect on simple and global reward is to reference to a model; however, the possibilities are somewhat limited if
maximize a model-marginal posterior mean, we wish the resulting utility to be sensible. Cumulative reward (6.7) is
and the effect on information gain is to evalu-
ate changes in model-marginal beliefs about
one example, as it depends only on the observed values y. The maximum
πœ”. function value observed is another possibility (6.6), but, as we have
shown, it is dubious when observations are corrupted by noise. Other
similarly defined alternatives may suffer the same fate – for additive noise
with zero mean, the expected contribution from noise to the cumulative
reward is zero; however, noise will bias many other natural measures
such as order statistics (including the maximum) of the observations.
6.5. comparison of utility functions 117

prior mean prior 95% credible interval samples

𝑝 (𝑓 βˆ— )

𝑝 (π‘₯ βˆ— )

Figure 6.3: The objective function prior used throughout our utility function comparison. Marginal beliefs of function
values are shown, as well as the induced beliefs over the location of the global optimum, 𝑝 (π‘₯ βˆ— ), and the value
of the global optimum, 𝑝 (𝑓 βˆ— ). Note that there is a significant probability that the global optimum is achieved
on the boundary of the domain, reflected by large point masses.

6.5 comparison of utility functions


We have now presented several utility functions for evaluating a dataset
returned by an optimization routine. Each utility quantifies progress
on our model optimization problem (1.1) in some way, but it may be
difficult at this point to appreciate their, sometimes subtle, differences in
approach. Here we will present and discuss example datasets for which
different utility functions diverge in their opinion of quality.
We particularly wish to contrast the behavior of the simple reward local vs. global properties of posterior
(6.3) with other utility functions. Simple reward is probably the most
prevalent utility in the Bayesian optimization literature (especially in
applications), as it corresponds to the widespread expected improve- expected improvement: Β§ 7.3, p. 127
ment acquisition function. A distinguishing feature of simple reward
is that it evaluates data based only on local properties of the objective
function posterior. This locality is both computationally convenient and
pragmatic. Simple reward is derived from the premise that we will be
recommending one of the points observed during the course of opti-
mization for permanent use, and thus it is sensible to judge performance
based on the objective function values at the observed locations alone.
Several alternatives instead measure global properties of the objective
function posterior. The global reward (6.5), for example, considers the
entire posterior mean function, reflecting a willingless to recommend
an unevaluated point after termination. Information gain (6.8) about
the location or value of the optimum considers the posterior entropy of
these quantities, again a global property. The consequences of reasoning
about local or global properties of the posterior can sometimes lead to
significant disagreement between the simple reward and other utilities.
In the following examples, we consider optimization on an inter- model of objective function
val with exact measurements. We model the objective function with
a Gaussian process with constant mean (3.1) and squared exponential
118 utility functions for optimization

observations posterior mean posterior 95% credible interval

𝑝 (𝑓 βˆ— | D)

𝑝 (π‘₯ βˆ— | D)

Figure 6.4: An example dataset of five observations and the resulting posterior belief of the objective function. This dataset
exhibits relatively low simple reward (6.3) but relatively high global reward (6.5) and information gain (6.8)
about the location π‘₯ βˆ— and value 𝑓 βˆ— of the optimum.

covariance (3.12). This prior is illustrated in Figure 6.3, along with the
induced beliefs about the location π‘₯ βˆ— and value 𝑓 βˆ— of the global optimum.
15 For this model, a unique optimum will exist Both distributions reflect considerable uncertainty.15 We will examine
with probability one; see Β§ 2.7, p. 34 for more two datasets that might be returned by an optimizer using this model and
details.
discuss how different utility functions would evaluate these outcomes.

Good global outcome but poor local outcome


Consider the dataset in Figure 6.4 and the resulting posterior belief about
the objective and its optimum. In this example, the simple reward is
low simple reward relatively low as the posterior mean at our observations is unremarkable.
In fact, every observation was lower than the prior mean, a seemingly
high global reward unlucky outcome. However, the global reward is relatively high: the data
imply a steep derivative in one location, inducing high values of the
posterior mean away from our data. This is a significant accomplishment
from the point of view of the global reward, as the model expects a
terminal recommendation in that region to be especially valuable.
final recommendations Figure 6.1 shows the optimal final recommendations associated with
these two utility functions. The simple reward recommendation prior-
itizes safety over reward, whereas the global reward recommendation
reflects more risk tolerance. Neither is inherently better: although the
global reward recommendation has a larger expected value, this ex-
high information gain pectation is computed using a model that might be mistaken. Further,
comparing the posterior distribution in Figure 6.4 with the prior in Figure
6.6. summary of major ideas 119

observations posterior mean posterior 95% credible interval

𝑝 (𝑓 βˆ— | D)

𝑝 (π‘₯ βˆ— | D)

Figure 6.5: An example dataset containing a single observation and the resulting posterior belief of the objective function.
This dataset exhibits relatively high simple reward (6.3) but relatively low global reward (6.5) and information
gain (6.8) about the location π‘₯ βˆ— and value 𝑓 βˆ— of the optimum.

6.3, we see this example dataset also induces a significant reduction in


our uncertainty about both the location and value of the global optimum,
despite not containing any particularly notable values itself. Therefore,
despite a somewhat low simple reward, observing this dataset results in
relatively high information gain about these quantities.

Good local outcome but poor global outcome


We illustrate a different example dataset in Figure 6.5. The dataset con-
tains a single observation with value somewhat higher than the prior
mean. Although this dataset may not appear particularly impressive, its
simple reward is higher than the previous dataset, as this observation high simple reward
exceeds every value seen in that scenario.
However, this dataset has lower value than the previous dataset
when evaluated by other utility functions. Its global reward is lower
than in the first scenario, as the global maximum of the posterior mean low global reward
is lower. This can be verified by visual inspection of Figures 6.4 and
6.5, whose vertical axes are compatible. Further, the single observation
in this scenario provides nearly no information regarding the location
nor the value of the global maximum. The observation of a moderately low information gain
high value provides only weak evidence that the global optimum may
be located nearby, barely influencing our posterior belief about π‘₯ βˆ—. The
observation does truncate our belief of the value of the global optimum
𝑓,βˆ— but only rules out a relatively small portion of its lower tail.

6.6 summary of major ideas


In Bayesian decision theory, preferences over outcomes are encoded by
a utility function, which in the context of optimization policy design,
120 utility functions for optimization

assesses the quality of data returned by an optimization routine, 𝑒 (D).


16 Just like human taste, there is no right or The optimization policy then seeks to design observations to maximize
wrong when it comes to preferences, at least the expected utility of the returned data. The general theory presented
not over certain outcomes. The von Neumann–
Morgenstern theorem mentioned on p. 90 en- in the last chapter makes no assumptions regarding the utility func-
tails rationality axioms, but these only apply tion.16 However, in the context of optimization, some utility functions
to preferences over uncertain outcomes. are particularly easy to motivate.
β€’ In many cases there is a decision following optimization in which we
must recommend a single point in the domain for perpetual use. In this
case, it is sensible to define an optimization utility function in terms of
expected utility of terminal recommendation: the expected utility of the optimal terminal recommendation informed
Β§ 6.1, p. 109 by the returned data. This requires fully specifying that terminal recom-
mendation, including its action space and utility function, after which
we may β€œpass through” the optimal expected utility (6.1).
risk tolerance: Β§ 6.1, p. 111 β€’ When designing a terminal recommendation – especially when we may
recommend points with residual uncertainty in their underlying objec-
tive value – it may be prudent to consider our risk tolerance. Careful
design of the terminal utility allows for us to tune our appetite for risk,
in terms of trading off a point’s expected value against its uncertainty.
Most utilities encountered in Bayesian optimization are risk neutral, but
this need not necessarily be the case.
simple reward: Β§ 6.1, p. 112 β€’ Two notable realizations of this scheme are simple reward (6.3) and global
global reward: Β§ 6.1, p. 113 reward (6.5), both of which represent the expected utility of an optimal
terminal recommendation with a risk-neutral utility. The action space for
simple reward is the points visited during optimization, and the action
space for global reward is the entire domain.
β€’ The simple reward simplifies when observations are exact (6.4).
cumulative reward: Β§ 6.2, p. 114 β€’ An alternative to the simple reward is the cumulative reward (6.7), which
evaluates a dataset based on the average, rather than maximum, value
observed. This does not see too much direct use in policy design, but is
an important concept for the analysis of algorithms.
information gain: Β§ 6.3, p. 115 β€’ Information gain provides an information-theoretic approach to quanti-
fying the value of data in terms of the information provided by the data
regarding some quantity of interest. This can be quantified by either
measuring the reduction in differential entropy moving from the prior
to the posterior (6.8) or by the kl divergence between the posterior and
prior (6.9) – either induces the same one-step lookahead policy.
β€’ In the context of optimization, information gain regarding either the lo-
cation π‘₯ βˆ— or value 𝑓 βˆ— of the global optimum (1.1) are judicious realizations
of this general approach to utility design.
comparison of utility functions: Β§ 6.5, p. 117 β€’ An important feature distinguishing simple reward from most other
utility functions is its dependence on the posterior belief at the observed
locations alone, rather than the posterior belief over the entire objective
function. Even in relatively simple examples, this may lead to disagree-
ment between simple reward and other utility functions in judging the
quality of a given dataset.
6.6. summary of major ideas 121

The utility functions presented in this chapter form the backbone


of the most popular Bayesian optimization algorithms. In particular,
many common policies are realized by maximizing the one-step expected one-step lookahead: Β§ 5.3, p. 101
marginal gain to one of these utilities, as we will show in the next chapter.
7
COMMON BAYESIAN OPTIMIZATION POLICIES

The heart of an optimization routine is its policy, which sequentially


designs each observation in light of available data.1 In the Bayesian
approach to optimization, policies are designed with reference to a prob-
abilistic belief about the objective function, with this belief guiding the
policy in making decisions likely to yield beneficial outcomes. Numer-
ous Bayesian optimization policies have been proposed in the literature,
many of which enjoy widespread use. In this chapter we will present 1 The reader may wish to recall our model opti-
an overview of popular Bayesian optimization policies and emphasize mization procedure: Algorithm 1.1, p. 3.
common themes in their construction. In the next chapter we will pro- Chapter 8: Computing Policies with Gaussian
vide explicit computational details for implementing these policies with Processes, p. 157
Gaussian process models of the objective function.
Nearly all Bayesian optimization algorithms result from one of two
primary approaches to policy design. The most popular is Bayesian Chapter 5: Decision Theory for Optimization,
decision theory, the focus of the previous two chapters. In Chapter 5 we p. 87
introduced Bayesian decision theory as a general framework for deriving
optimal, but computationally prohibitive, optimization policies. In this
chapter, we will apply the ideas underlying these optimal procedures to
realize computationally tractable and practically useful policies. We will
see that a majority of popular Bayesian optimization algorithms can be one-step lookahead: Β§ 5.3, p. 101
interpreted in a uniform manner as performing one-step lookahead for Chapter 6: Utility Functions for Optimization:
some underlying utility function. p. 109
Another avenue for policy design is to adopt algorithms for multi- multi-armed bandits
armed bandits to the optimization setting. A multi-armed bandit is a finite-
dimensional model of sequential optimization with noisy observations.
We consider an agent faced with a finite set of alternatives (β€œarms”), who
is compelled to select a sequence of items from this set. Choosing a given 2 The name references a gambler contemplating
item yields a stochastic reward drawn from an unknown distribution how to allocate their bankroll among a wall
of slot machines. Slot machines are known as
associated with that arm. We seek a sequential policy for selecting arms β€œone-armed bandits” in American vernacular,
maximizing the expected cumulative reward (6.7).2 as they eventually steal all your money.
Multi-armed bandits have seen decades of sustained study, and some
policies have strong theoretical guarantees on their performance, sug-
gesting these policies may also be useful for optimization. To this end, we
may model optimization as an infinite-armed bandit, where each point in
the domain π‘₯ ∈ X represents an arm with uncertain reward depending
on the objective function value πœ™ = 𝑓 (π‘₯). Our belief about the objective
function then provides a mechanism to reason about these rewards and πœ™
derive a policy. This analogy has inspired several Bayesian optimization
policies, many of which enjoy strong performance guarantees. Exploration vs. exploitation. We show reward
A central concern in bandit problems is the exploration–exploitation distributions for two possible options. The
dilemma: we must repeatedly decide whether to allocate resources to an more certain option returns higher expected
reward, but the alternative reflects more
arm already known to yield high reward (β€œexploitation”) or to an arm with uncertainty and may actually be superior.
uncertain reward to learn about its reward distribution (β€œexploration”). Which should we prefer?
Exploitation may yield a high instantaneous reward, but exploration
may provide valuable information for improving future rewards. This 3 p. whittle (1982). Optimization over Time:
tradeoff between instant payoff and learning for the future has been Dynamic Programming and Stochastic Control.
called β€œa conflict evident in all human action.”3 A similar choice is faced Vol. 1. John Wiley & Sons.

This material has been published by Cambridge University Press as Bayesian Optimization. 123
This version is free to view and download for personal use only. Not for redistribution,
resale, or use in derivative works. Β©Roman Garnett 2023. [Link]
124 common bayesian optimization policies

samples observations posterior mean posterior 95% credible interval

Figure 7.1: The scenario we will consider for illustrating optimization policies. The objective function prior is a Gaussian
process with constant mean and MÑtern covariance with 𝜈 = 5/2 (3.14). We show the marginal predictive
distributions and three samples from the posterior conditioned on the indicated observations.

throughout optimization, as we must continually decide whether to focus


on a suspected local maximum (exploitation) or to explore unknown
regions of the domain seeking new maxima (exploration). We will see
that typical Bayesian optimization policies reflect consideration of this
dilemma in some way, whether by explicit design on or as a consequence
of decision-theoretic reasoning.
Before diving into policy design, we pause to introduce a running
example we will carry through the chapter and notation to facilitate our
discussion. We will then derive a series of policies stemming from Bayes-
ian decision theory, and finally consider bandit-inspired algorithms.

7.1 example optimization scenario


Throughout this chapter we will demonstrate the behavior of optimiza-
4 Take note of the legend; it will not be repeated. tion policies on an example scenario illustrated in Figure 7.1.4 We consider
a one-dimensional objective function observed without noise and adopt
a Gaussian process prior belief about this function. The prior mean
Chapter 2: Gaussian Processes, p. 15 function is constant (3.1), and the prior covariance function is a MΓ‘tern
covariance with 𝜈 = 5/2 (3.14). The parameters are fixed so that the do-
main spans exactly 30 length scales. We condition this prior on three
observations, inducing two local maxima in the posterior mean and a
range of marginal predictive uncertainty.
objective function for simulation We will illustrate the behavior of policies by simulating optimization
to design a sequence of additional observations for this running example.
The ground truth objective function we will use for these simulations is
shown in Figure 7.2 and was drawn from the corresponding model. The
objective features numerous undiscovered local maxima and exhibits an
unusually high global maximum on the left-hand side of the domain.

7.2 decision-theoretic policies


Chapter 6: Utility Functions for Optimization: Central to decision-theoretic optimization is a utility function 𝑒 (D) mea-
p. 109 suring the quality of a dataset returned by an optimizer. After selecting
a utility function and a model of the objective function and our observa-
optimal policies: Β§ 5.2, p. 91 tions, we may design each observation to maximize the expected utility
7.2. decision-theoretic policies 125

objective, 𝑓

Figure 7.2: The true objective function used for simulating optimization policies.

of the returned data (5.16). This policy is optimal in the average case: it
maximizes the expected utility of the returned dataset over the space 5 To be precise, optimality is defined with re-
of all possible policies.5 Unfortunately, optimality comes at a great cost. spect to a model for the objective function
𝑝 (𝑓 ), an observation model 𝑝 (𝑦 | π‘₯, πœ™), a util-
Computing the optimal policy requires recursive simulation of the entire ity function 𝑒 (D), and an upper bound on the
remainder of optimization, a random process due to uncertainty in the number of observations allowed 𝜏. Bayesian
outcomes of our observations. In general, the cost of computing the decision theory provides a policy achieving
optimal policy grows exponentially with the horizon, the number of the maximal expected utility at termination
with respect to these choices.
observations remaining before termination.
However, the structure of the optimal policy suggests a natural family running time of optimal policy and efficient
of lookahead approximations based on fixing a computationally tractable approximations: Β§ 5.3, p. 99
maximum horizon throughout optimization. This line of reasoning has limited lookahead: Β§ 5.3, p. 101
led to many of the practical policies available for Bayesian optimization.
In fact, most popular algorithms represent one-step lookahead, where
in each iteration we greedily maximize the expected utility after obtain-
ing only a single additional observation. Although these policies are
maximally myopic, they are also maximally efficient among lookahead
approximations and have delivered impressive empirical performance in
a wide range of settings.
It may seem surprising that such dramatically myopic policies have
any use at all. There is a huge difference between the scale of reasoning
in one-step lookahead compared with the optimal procedure, which
may consider hundreds of future decisions or more when designing an
observation. However, the situation is somewhat more nuanced than it
might appear. In a seminal paper, kushner argued that myopic policies
may in fact show better empirical performance than a theoretically
optimal policy, and his argument remains convincing:6

Since a mathematical model of [𝑓 ] is available, it is theoretically 6 h. j. kushner (1964). A New Method of Locat-
possible, once a criterion of optimality is given, to determine the ing the Maximum Point of an Arbitrary Multi-
peak Curve in the Presence of Noise. Journal
mathematically optimal sampling policy. However. . . determination of Basic Engineering 86(1):97–106.
of the optimum sampling policies is extremely difficult. Because
of this, the development of our sampling laws has been guided
primarily by heuristic considerations.7 There are some advantages 7 Specifically, maximizing probability of im-
to the approximate approach. . . [and] its use may yield better results provement: Β§ 7.5, p. 131.
than would a procedure that is optimum for the model. Although the
model selected for [𝑓 ] is the best we have found for our purposes,
it is sometimes too general. . .
126 common bayesian optimization policies

What could possibly cause such a seemingly contradictory finding? As


kushner suggests, one possible reason could be model misspecification.
The optimal policy is only defined with respect to a chosen model of the
objective function and our observations, which is bound to be imperfect.
By relying less on the model’s belief, we may gain some robustness
alongside considerable computational savings.
The intimate relationship between many Bayesian optimization meth-
ods and one-step lookahead is often glossed over, with a policy often
introduced ex nihilo and the implied choice of utility function left un-
stated. This disconnect can sometimes lead to policies that are nonsen-
sical from a decision-theoretic perspective or that incorporate implicit
approximations that may not always be appropriate. We intend to clarify
these connections here. We hope that our presentation can help guide
practitioners in navigating the increasingly crowded space of available
policies when presented with a novel scenario.

One-step lookahead
notation for one-step lookahead policies Let us review the generic procedure for developing a one-step lookahead
policy and adopt standard notation to facilitate their description. Suppose
we have selected an arbitrary utility function 𝑒 (D) to evaluate a returned
dataset. Suppose further that we have already gathered an arbitrary
dataset D = (x, y) and wish to select the next evaluation location. This
is the fundamental role of an optimization policy.
proposed next point π‘₯ with putative value 𝑦 If we were to choose some point π‘₯, we would observe a corresponding

updated dataset Dβ€² = D βˆͺ (π‘₯, 𝑦) value 𝑦 and update our dataset, forming D β€² = (x,β€² yβ€²) = D βˆͺ (π‘₯, 𝑦) .
Note that in our discussion on decision theory in Chapter 5, we notated
this updated dataset with the symbol D1 , as we needed to be able to
distinguish between datasets after the incorporation of a variable number
of additional observations. As our focus in this chapter will be on one-step
lookahead, we can simplify notation by dropping subscripts indicating
time. Instead, we will systematically use the prime symbol to indicate
future quantities after the acquisition of the next observation.
In one-step lookahead, we evaluate a proposed point π‘₯ via the ex-
expected marginal gain pected marginal gain in utility after incorporating an observation there
(5.8):
 
𝛼 (π‘₯; D) = 𝔼 𝑒 (D β€²) | π‘₯, D βˆ’ 𝑒 (D),
which serves as an acquisition function inducing preferences over possi-
acquisition functions: Β§ 5, p. 88 ble observation locations. We design each observation by maximizing
this score:
π‘₯ ∈ arg max 𝛼 (π‘₯ β€²; D). (7.1)
π‘₯ β€² ∈X

When the utility function 𝑒 (D) represents the expected utility of


a decision informed by the data, such as a terminal recommendation
following optimization, the expected marginal gain is also known as the
value of sample information value of sample information from observing at π‘₯. This term originates
from the study of decision making in an economic context. Consider
7.3. expected improvement 127

Table 7.1: Summary of one-step looka-


utility function, 𝑒 (D) expected one-step marginal gain head optimization policies.
simple reward, (6.3) expected improvement, Β§ 7.3
global reward, (6.5) knowledge gradient, Β§ 7.4
improvement to simple reward probability of improvement, Β§ 7.5
information gain, (6.8) or (6.9) mutual information, Β§ 7.6
cumulative reward, (6.7) posterior mean, Β§ 7.10

an agent who must make a decision under uncertainty, and suppose


they have access to a third party who is willing to provide potentially
insightful advice in exchange for a fee. By reasoning about the potential
impact of this advice on the ultimate decision, we may quantify the 8 j. marschak and r. radner (1972). Economic
expected value of the information,8, 9 and determine whether the offered Theory of Teams. Yale University Press. [Β§ 2.12]
advice is worth the investment. 9 h. raiffa and r. schlaifer (1961). Applied Sta-
Due to its simplicity and inherent computational efficiency, one-step tistical Decision Theory. Division of Research,
Graduate School of Business Administration,
lookahead is a pervasive approximation scheme in Bayesian optimiza- Harvard University. [Β§ 4.5]
tion. Table 7.1 provides a list of common acquisition functions, each
representing the expected one-step marginal gain to a corresponding
utility function. We will discuss each in detail below.

7.3 expected improvement


Adopting the simple reward utility function (6.3) and performing one- simple reward: Β§ 6.1, p. 109
step lookahead defines the expected improvement acquisition function.
Sequential maximization of expected improvement is perhaps the most
widespread policy in all of Bayesian optimization.
Suppose that we wish to locate a single location in the domain with
the highest possible objective value and ultimately wish to recommend
one of the points investigated during optimization for permanent use.
The simple reward utility function evaluates a dataset D precisely by
the expected value of an optimal final recommendation informed by the
data, assuming risk neutrality: risk neutrality: Β§ 6.1, p. 109

𝑒 (D) = max πœ‡D (x). 10 This reasoning is the same for all one-step
lookahead policies, which could all be de-
Suppose we have already gathered observations D = (x, y) and wish scribed as maximizing β€œexpected improve-
to choose the next evaluation location. Expected improvement is derived ment.” But this name has been claimed for the
by measuring the expected marginal gain in utility, or the instantaneous simple reward utility alone.

improvement, 𝑒 (D β€²) βˆ’ 𝑒 (D),10 offered by making the next observation 11 As mentioned in the last chapter, simple re-
ward degenerates with an empty dataset; ex-
at a proposed location π‘₯:11
pected improvement does as well. In that case
∫
  we can simply ignore the second term and
𝛼 ei (π‘₯; D) = max πœ‡Dβ€² (xβ€²) 𝑝 (𝑦 | π‘₯, D) d𝑦 βˆ’ max πœ‡D (x). (7.2) compute the first, which for zero-mean addi-
tive noise becomes the mean function of the
prior process.
Expected improvement reduces to a particularly nice expression in
the case of exact observations of the objective, where the utility takes a
simpler form (6.4). Suppose that, when we elect to make an observation expected improvement without noise
at a location π‘₯, we observe the exact objective value πœ™ = 𝑓 (π‘₯). Consider
128 common bayesian optimization policies

expected improvement, 𝛼 ei next observation location

Figure 7.3: The expected improvement acquisition function (7.2) corresponding to our running example.

maximal value observed, incumbent πœ™ βˆ— a dataset D = (x, 𝝓), and define πœ™ βˆ— = max 𝝓 to be the so-called incum-
bent : the maximal objective value yet seen.12 As a consequence of exact
12 The value πœ™ βˆ— is incumbent as it is currently observation, we have
β€œholding office” as our standing recommenda-
tion until it is deposed by a better candidate. 𝑒 (D) = πœ™ βˆ— ; 𝑒 (D β€²) = max(πœ™ ,βˆ— πœ™);
and thus
𝑒 (D β€²) βˆ’ 𝑒 (D) = max(πœ™ βˆ’ πœ™ ,βˆ— 0).

Substituting into (7.2), in the noiseless case we have


∫
𝛼 ei (π‘₯; D) = max(πœ™ βˆ’ πœ™ ,βˆ— 0) 𝑝 (πœ™ | π‘₯, D) dπœ™ . (7.3)

example and interpretation Expected improvement is illustrated for our running example in
Figure 7.3. In this case, maximizing expected improvement will select
a point near the previous best point found, an example of exploitation.
Notice that the expected improvement vanishes near regions where
we have existing observations. Although these locations may be likely
to yield values higher than πœ™ βˆ— due to relatively high expected value,
the relatively narrow credible intervals suggest that the magnitude of
any improvement is likely to be small. Expected improvement is thus
considering the exploration–exploitation dilemma in the selection of the
next observation location, and the tradeoff between these two concerns
is considered automatically.
simulated optimization and interpretation Figure 7.4 shows the posterior belief of the objective after sequentially
maximizing expected improvement to gather 20 additional observations
of our example objective function. The global optimum was efficiently
located. The distribution of the sample locations, with more evaluations
in the most promising regions, reflects consideration of the exploration–
exploitation dilemma. However, there seems to have been a focus on
exploitative behavior resulting from myopia exploitation throughout the entire process; the first ten observations
for example never strayed from the initially known local optimum. This
behavior is a reflection of the simple reward utility function underlying
the policy, which only rewards the discovery of high objective func-
tion values at observed locations. As a result, one-step lookahead may
7.4. knowledge gradient 129

Figure 7.4: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the expected improve-
ment acquisition function (7.2) on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1–10 (top) and 11–20 (bottom). Observations within 0.2
length scales of the optimum have thicker marks; the optimum was located on iteration 19.

rationally choose to make marginal improvements to the value of the


best-seen point, even if the underlying function value is known with a
fair amount of confidence.

7.4 knowledge gradient


Adopting the global reward utility (6.5) and performing one-step-lookahead global reward: Β§ 6.1, p. 109
yields an acquisition function known as the knowledge gradient.
Assume that, just as in the situation leading to the derivation of
expected improvement, we again wish to identify a single point in the
domain maximizing the objective function. However, imagine that at
termination we are willing to commit to a location possibly never evalu-
ated during optimization. To this end, we adopt the global reward utility
function to measure our progress:
𝑒 (D) = max πœ‡D (π‘₯),
π‘₯ ∈X

which rewards data for increasing the posterior mean, irrespective of


location. Computing the one-step marginal gain to this utility results in
the knowledge gradient acquisition function:
∫ 
𝛼 kg (π‘₯; D) = max
β€²
πœ‡ D β€² (π‘₯ β€²
) 𝑝 (𝑦 | π‘₯, D) d𝑦 βˆ’ max
β€²
πœ‡D (π‘₯ β€²). (7.4)
π‘₯ ∈X π‘₯ ∈X
13 p. frazier and w. powell (2007). The Knowl-
The knowledge gradient moniker was coined by frazier and pow- edge Gradient Policy for Offline Learning with
Independent Normal Rewards. adprl 2007.
ell,13 who interpreted the global reward as the amount of β€œknowledge”
130 common bayesian optimization policies

knowledge gradient, 𝛼 kg next observation location

Figure 7.5: The knowledge gradient acquisition function (7.4) corresponding to our running example.

samples of updated posterior mean, πœ‡ Dβ€²


Figure 7.6: Samples of the updated posterior mean
when evaluating at the location chosen
by the knowledge gradient, illustrated in
Figure 7.5. Only the right-hand section of
the domain is shown.

about the global maximum offered by a dataset D. The knowledge gradi-


ent 𝛼 kg (π‘₯; D) can then be interpreted as the expected change in knowl-
edge (that is, a discrete-time gradient) offered by a measurement at π‘₯.
example and interpretation The knowledge gradient is illustrated for our running example in
Figure 7.5. Perhaps surprisingly, the chosen observation location is re-
markably close to the previously best-seen point. At first glance, this
may seem wasteful, as we are already fairly confident about the value
we might observe.
reason for selected observation However, the knowledge gradient seeks to maximize the global max-
imum of the posterior mean, regardless of its location. With this in mind,
we may reason as follows. There must be a local maximum of the objec-
tive function in the neighborhood of the best-seen point, but our current
knowledge is insufficient to pinpoint its location. Further, as the rele-
vant local maximum is probably not located precisely at this point, the
objective function is either increasing or decreasing as it passes through.
If we were to learn the derivative of the objective at this point, we would
adjust our posterior belief to reflect that knowledge. Regardless of the
sign or exact value of the derivative, our updated belief would reflect
the discovery of a new, higher local maximum of the posterior mean
in the indicated direction. By evaluating at the location selected by the
knowledge gradient, we can effectively estimate the derivative of the
objective; this is the principle behind finite differencing.
In Figure 7.6, we show samples of the updated posterior mean func-
tion πœ‡Dβ€² (π‘₯) derived from sampling from the predictive distribution at
the chosen evaluation location and conditioning. Indeed, these samples
exhibit newly located global maxima on either side of the selected point,
depending on the sign of the implied derivative. Note that the locations
7.5. probability of improvement 131

Figure 7.7: The posterior after 10 (top) and 20 (bottom) steps of the optimization policy induced by the knowledge gradient
acquisition function (7.4) on our running example. The tick marks show the points chosen by the policy,
progressing from top to bottom, during iterations 1–10 (top) and 11–20 (bottom). Observations within 0.2 length
scales of the optimum have thicker marks; the optimum was located on iteration 15.

of these new maxima coincide with local maxima of the expected im-
provement acquisition function; see Figure 7.3 for comparison. This is
not a coincidence! One way to interpret this relation is that, due to re-
warding large values of the posterior mean at observed locations only,
expected improvement must essentially guess on which side the hidden
local optimum of the objective lies and hope to be correct. The knowl-
edge gradient, on the other hand, considers identifying this maximum
on either side a success, and guessing is not necessary.
Figure 7.7 illustrates the behavior of the knowledge gradient policy simulated optimization and interpretation
on our example optimization scenario. The global optimum was located
efficiently. Comparing the decisions made by the knowledge gradient to
those made by expected improvement (see Figure 7.4), we can observe more exploration than expected improvement
a somewhat more even exploration of the domain, including in local from more-relaxed utility
maxima. The knowledge gradient policy does not necessarily need to
expend observations to verify a suspected maximum, instead putting
more trust into the model to have correct beliefs in these regions.

7.5 probability of improvement


As its name suggests, the probability of improvement acquisition function
computes the probability of an observed value to improve upon some
chosen threshold, regardless of the magnitude of this improvement.
132 common bayesian optimization policies

improvement target
𝑝 (πœ™ | π‘₯, D)
𝑝 (πœ™ β€² | π‘₯ ,β€² D)

Figure 7.8: An illustrative example comparing the behavior of probability of improvement with expected improvement
computed with respect to the dashed target. The predictive distributions for two points π‘₯ and π‘₯ β€² are shown. The
distributions have equal mean but the distribution at π‘₯ β€² has larger predictive standard deviation. The shaded
regions represent the region of improvement. The relatively safe π‘₯ is preferred by probability of improvement,
whereas the more-risky π‘₯ β€² is preferred by expected improvement.

simple reward: Β§ 6.1, p. 109 Consider the simple reward of an already gathered dataset D = (x, y):

𝑒 (D) = max πœ‡D (x).

The probability of improvement acquisition function scores a proposed


observation location π‘₯ according to the probability that an observation
desired margin of improvement, πœ€ there will improve this utility by at least some margin πœ€. Let us denote the
desired improvement threshold, 𝜏 desired utility threshold with 𝜏 = 𝑒 (D) + πœ€; we will use both the absolute
threshold 𝜏 and the marginal threshold πœ€ in the following discussion as
convenient. The probability of improvement is then the probability that
the updated utility 𝑒 (D β€²) exceeds the chosen threshold:

𝛼 pi (π‘₯; D, 𝜏) = Pr 𝑒 (D β€²) > 𝜏 | π‘₯, D . (7.5)

utility formulation We may interpret probability of improvement in the Bayesian decision-


theoretic framework as computing the expected one-step marginal gain
in a peculiar choice of utility function: a utility offering unit reward for
each observation increasing the simple reward by the desired amount.
noiseless case In the case of exact observation, we have

𝑒 (D) = max 𝑓 (x) = πœ™ βˆ—; 𝑒 (D β€²) = max(πœ™ ,βˆ— πœ™),

and we may write the probability of improvement in the somewhat


simpler form
𝛼 pi (π‘₯; D, 𝜏) = Pr(πœ™ > 𝜏 | π‘₯, D). (7.6)
In this case, the probability of improvement is simply the complementary
cumulative distribution function of the predictive distribution evaluated
at the improvement threshold 𝜏. This form of probability of improvement
is sometimes encountered in the literature, but our modification in terms
of the simple reward allows for inexact observations as well.
comparison with expected improvement It can be illustrative to compare the preferences over observation
locations implied by the probability of improvement and expected im-
provement acquisition functions. In general, probability of improvement
is somewhat more risk-averse than expected improvement, because prob-
ability of improvement would prefer a certain improvement of modest
7.5. probability of improvement 133

probability of improvement, 𝛼 pi (πœ€ = 0) next observation location

𝛼 pi (πœ€ = 0.1)

𝛼 pi (πœ€ = 1)

Figure 7.9: The probability of improvement acquisition function (7.5) corresponding to our running example for different
values of the target improvement πœ€. The target is expressed as a fraction of the range of the posterior mean
over the space. Increasing the target improvement leads to increasingly exploratory behavior.

magnitude to an uncertain improvement of potentially large magnitude.


Figure 7.8 illustrates this phenomenon. Shown are the predictive distri-
butions for the objective function values at two points π‘₯ and π‘₯ .β€² Both
points have equal predictive means; however, π‘₯ β€² has a significantly larger
predictive standard deviation. We consider improvement with respect
to the illustrated target. The shaded regions represent the regions of
improvement; the probability mass of these regions equal the probabil-
ities of improvement. Improvement is near certain at π‘₯ (𝛼 pi = 99.9%),
whereas it is somewhat smaller at π‘₯ β€² (𝛼 pi = 72.6%), and thus probability
of improvement would prefer to observe at π‘₯. The expected improvement
at π‘₯, however, is small compared to π‘₯ β€² with its longer tail:
𝛼 ei (π‘₯ β€²; D)
= 1.28.
𝛼 ei (π‘₯; D)
The expected improvement at π‘₯ β€² is 28% larger than at π‘₯, indicating a
preference for a less-certain but potentially larger payout.

The role of the improvement target


The magnitude of the required improvement plays a crucial role in shap-
ing the behavior of probability of improvement policies. By adjusting
this parameter, we may encourage exploration (with large πœ€) or exploita-
tion (with small πœ€). Figure 7.9 shows the probability of improvement for
134 common bayesian optimization policies

Figure 7.10: The posterior after 10 (top)


 and 20 (bottom) steps of the optimization policy induced by probability of
improvement with πœ€ = 0.1 max πœ‡ D (π‘₯) βˆ’ min πœ‡ D (π‘₯) (7.5) on our running example. The tick marks show the
points chosen by the policy, progressing from top to bottom, during iterations 1–10 (top) and 11–20 (bottom).
Observations within 0.2 length scales of the optimum have thicker marks; the optimum was located on
iteration 15.

our example scenario with thresholds corresponding to infinitesimal im-


provement, a modest improvement, and a significant improvement. The
shift toward exploratory behavior for larger improvement thresholds
can be clearly seen.
In Figure 7.10, we see 20 evaluations chosen by maximizing probabil-
ity of improvement with the target dynamically set to 10% of the range
of the posterior mean function. The global optimum was located, and
the domain appears sufficiently explored. Although performance was
quite reasonable here, the improvement threshold was set somewhat
arbitrarily, and it is not always clear how one should set this parameter.
the πœ€ = 0 case On one extreme, some authors define a parameter-free (and prob-
ably too literal) version of probability of improvement by fixing the
improvement target to πœ€ = 0, rewarding even infinitesimal improvement
to the current data. Intuitively, this low bar can induce overly exploita-
tive behavior. Examining the probability of improvement with πœ€ = 0
for our running example in Figure 7.9, we see that the acquisition func-
tion is maximized directly next to the previously best-found point. This
decision represents extreme exploitation and potentially undesirable
behavior. The situation after applying probability of improvement with
πœ€ = 0 to select 20 additional observation locations, shown in Figure 7.11,
clearly demonstrates a drastic focus on exploitation. Notably, the global
optimum was not identified.
7.6. mutual information and entropy search 135

Figure 7.11: The posterior after 20 steps of the optimization policy induced by probability of improvement with πœ€ = 0
(7.5) on our running example. The tick marks show the points chosen by the policy, progressing from top to
bottom.

Evidently we must carefully select the desired improvement threshold 14 d. r. jones (2001). A Taxonomy of Global Op-
to achieve ideal behavior. jones provided some simple, data-driven advice timization Methods Based on Response Sur-
faces. Journal of Global Optimization 21(4):345–
for choosing improvement thresholds that remains sound.14 Define 383.

πœ‡