Algorithms For Validation
Algorithms For Validation
Mykel J. Kochenderfer
Sydney M. Katz
Anthony L. Corso
Robert J. Moss
Stanford, California
© 2024 Kochenderfer, Katz, Corso, and Moss
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means
(including photocopying, recording or information storage and retrieval) without permission in writing from
the publisher.
This book was set in TEX Gyre Pagella by the authors in LATEX.
Printed and bound in the United States of America.
ISBN:
10 9 8 7 6 5 4 3 2 1
To our families.
Contents
Acknowledgments xi
Preface xiii
1 Introduction 1
1.1 Validation 1
1.2 History 3
1.3 Societal Consequences 6
1.4 Validation Algorithms 8
1.5 Challenges 14
1.6 Overview 16
2 System Modeling 19
2.1 Model Building 19
2.2 Probability 20
2.3 Parameter Learning 27
2.4 Agent Models 39
2.5 Model Validation 42
2.6 Summary 51
3 Property Specification 53
3.1 Properties of Systems 53
3.2 Metrics for Stochastic Systems 54
3.3 Composite Metrics 56
3.4 Logical Specifications 62
3.5 Temporal Logic 65
3.6 Reachability Specifications 73
3.7 Summary 77
viii contents
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
contents ix
a p p endices
A Systems 317
A.1 Default Implementations 317
A.2 Simple Gaussian System 318
A.3 Multivariate Gaussian System 318
A.4 Mass-Spring-Damper System 319
A.5 Inverted Pendulum System 321
A.6 Grid World System 322
A.7 Continuum World System 322
A.8 Aircraft Collision Avoidance System 325
B Mathematical Concepts 327
B.1 Measure Spaces 327
B.2 Probability Spaces 328
B.3 Metric Spaces 328
B.4 Normed Vector Spaces 328
B.5 Positive Definiteness 330
B.6 Information Content 330
B.7 Entropy 330
B.8 Cross Entropy 331
B.9 Relative Entropy 331
B.10 Taylor Expansion 331
C Neural Representations 335
D Julia 341
D.1 Types 341
D.2 Functions 354
D.3 Control Flow 357
D.4 Packages 359
D.5 Convenience Functions 363
References 365
Index 377
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
Acknowledgments
We wish to thank the many individuals who have provided valuable feedback
on early drafts of our manuscript, including Matthias Althoff, Stephen Boyd,
Emmanual Candés, Francois Chaubard, Harrison Delecki, Hanna Krasowski,
Liam Kruse, Alexandros Tzikas, Romeo Valentin, and Jun Wang. Many of the
algorithms discussed in this book were explored during the development of
the ACAS X aircraft collision avoidance systems with the generous support and
leadership of Neal Suchy of the Federal Aviation Administration. The participants
of Dagstuhl Seminar 24361 provided valuable input to the topics included in the
book. It has been a pleasure working with Elizabeth Swayze and the editing team
from the MIT Press in preparing this manuscript for publication.
The style of this book was inspired by Edward Tufte. Among other stylistic
elements, we adopted his wide margins and use of small multiples. The type-
setting of this book is based on the Tufte-LaTeX package by Kevin Godby, Bil
Kleb, and Bill Wood. The book’s color scheme was adapted from the Monokai
theme by Jon Skinner of Sublime Text (sublimetext.com) and a palette that better
accommodates individuals with color blindness.1 For plots, we use the viridis 1
B. Wong, “Points of View: Color
color map defined by Stéfan van der Walt and Nathaniel Smith. Blindness,” Nature Methods, vol. 8,
no. 6, pp. 441–442, 2011.
We have also benefited from the various open-source packages on which this
textbook depends (see appendix D). The authors thank Tor Fjelde for his help
with Turing.jl. The typesetting of the code was done with the help of pythontex,
which is maintained by Geoffrey Poore. The typeface used for the algorithms is
JuliaMono (github.com/cormullion/juliamono). The plotting was handled by
pgfplots, which is maintained by Christian Feuersänger.
Preface
Mykel J. Kochenderfer
Sydney M. Katz
Anthony L. Corso
Robert J. Moss
Stanford, California
December 23, 2024
1 Introduction
1.1 Validation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.2. history 3
example, aircraft designers validate the structural integrity of the wings through
extensive stress testing, and medical device manufacturers validate the safety
of their devices through clinical trials. In this book, we present an algorithmic
perspective on validation and focus specifically on the validation of decision-
making agents.
Decision-making agents interact with the environment and make decisions
based on the information they receive. These agents range from fully automated
systems that operate independently within their environment to decision-support
systems that inform human decision-makers.5 Examples include aircraft collision 5
Autonomy and automation have
different definitions in different
avoidance systems, adaptive cruise control systems, hiring assistants, disaster
communities. Autonomy is often
response systems, and other cyberphysical systems.6 While the algorithms pre- defined as the automation of high-
sented in this book can be applied to many different types of decision-making level tasks such as driving. The al-
gorithms in this book can be ap-
agents, we place a particular emphasis on sequential decision-making agents, plied to decision-making systems
which make a series of decisions over time. For example, an autonomous vehicle with any level of automation or au-
must make a sequence of decisions to navigate from one location to another. tonomy.
6
Cyberphysical systems are com-
putational systems that interact
1.2 History with the physical world.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4 c h apter 1. introduction
Design
System System
Requirements Validation
Development
Detailed Unit
Testing
Design Tests
Deployment
Implementation
Maintenance
During World War II, production volume increased to the point where it was no 9
W. M. Tsutsui, “W. Edwards Dem-
longer possible to inspect every product. This increase in production output led ing and the Origins of Quality Con-
trol in Japan,” Journal of Japanese
to the adoption of statistical quality control methods, which relied on sampling Studies, vol. 22, no. 2, pp. 295–325,
to speed up inspection. These ideas were developed by W. Edwards Deming9 1996.
(1900–1993) and Joseph M. Juran10 (1904–2008) and marked the beginning of 10
D. Phillips-Donaldson, “100
the field of statistical process control. Deming and Juran introduced these ideas Years of Juran,” Quality Progress,
vol. 37, no. 5, pp. 25–31, 2004.
to Japanese manufacturers after World War II, which played a key role in the
post-war economic recovery of Japan.
The advancements in computing technology in the latter half of the 20th cen-
tury increased our ability to use statistical methods to validate complex systems.
In the late 1940s, scientists at Los Alamos National Laboratory developed the
Monte Carlo method, which uses random sampling to solve complex mathemati-
cal problems.11 These methods were later used to validate complex systems in 11
A. F. Bielajew, “History of Monte
a variety of domains such as aviation and finance. Progress in computing tech- Carlo,” in Monte Carlo Techniques in
Radiation Therapy, CRC Press, 2021,
nology also led to new challenges in validation. The development of software pp. 3–15.
systems required new validation techniques and best practices to ensure that the
software operated correctly.
In the 1970s, software engineers began formalizing the software development
life cycle into phases that supported rigorous testing and validation. The water-
fall model of software development, introduced in 1970, divided the software
development process into distinct phases including requirements, design, im-
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.2. history 5
plementation, testing, and maintenance.12 In the 1990s, the waterfall model was 12
W. W. Royce, “Managing the De-
refined into the V model, which emphasizes the importance of testing and valida- velopment of Large Software Sys-
tems: Concepts and Techniques,”
tion throughout the software development process.13 The V model aligns testing IEEE WESCON, 1970.
and validation activities with the corresponding development activities, ensuring 13
K. Forsberg and H. Mooz, “The
Relationship of System Engineer-
that the system is validated at each stage of development. Figure 1.2 compares
ing to the Project Cycle,” Center
the waterfall and V models of the software development life cycle. for Systems Management, vol. 5333,
The 20th century also saw the emergence of regulatory bodies to guide the safe 1991.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6 c h apter 1. introduction
model to account for validation of the learning process. In general, the validation
of AI systems is still an active area of research.
1.3.1 Safety
Validation is necessary for ensuring the safety of systems that interact with the
physical world. Failures of safety-critical systems can result in catastrophic acci- 20
N. G. Leveson and C. S. Turner,
dents that cause injury or loss of life. For example, unintended behavior of the “An Investigation of the Therac-25
Accidents,” Computer, vol. 26, no. 7,
safety-critical software used by the Therac-25 radiation therapy machine caused pp. 18–41, 1993.
radiation overdoses that resulted in death or serious injury to six patients.20 Safety
is also important for transportation systems such as aircraft and cars. In 2002, 21
J. Kuchar and A. C. Drumm, “The
a mid-air collision over Überlingen, Germany resulted in 71 fatalities when the Traffic Alert and Collision Avoid-
ance System,” Lincoln Laboratory
traffic alert and collision avoidance system (TCAS) and air traffic control (ATC) Journal, vol. 16, no. 2, p. 277, 2007.
systems issued conflicting instructions to the pilots.21 Furthermore, it is important 22
R. L. McCarthy, “Autonomous
Vehicle Accident Data Analy-
to ensure that autonomous vehicles make safe decisions in a wide range of scenar-
sis: California OL 316 Reports:
ios to prevent potential accidents. Since their introduction, autonomous vehicles 2015–2020,” ASCE-ASME Journal
have been involved in accidents that have resulted in injuries or fatalities.22 of Risk and Uncertainty in Engi-
neering Systems, Part B: Mechanical
Engineering, vol. 8, no. 3, p. 034 502,
1.3.2 Fairness 2022.
When agents make decisions that affect the lives of large groups of people, we must
ensure that their decisions are fair and unbiased. Validation helps researchers
and organizations identify and correct biases in decision-making systems before
deployment. If these biases are not addressed, they can have serious consequences
for individuals and society as a whole. For example, an automated hiring system 23
A. L. Hunkenschroer and
developed by Amazon was ultimately discontinued after it was found to be A. Kriebitz, “Is AI Recruiting
(Un)ethical? A Human Rights
biased against women due to biases in the historical data it was trained on.23 In Perspective on the Use of AI for
another case, a software system designed to predict recidivism rates in criminal Hiring,” AI and Ethics, vol. 3, no. 1,
pp. 199–213, 2023.
defendants called COMPAS was found to be biased toward certain demographics
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.3. societal consequences 7
based on empirical data.24 Using the outputs of these systems to make decisions 24
Other research has argued that
can result in the unfair treatment of individuals. Validating these systems before the system is fair under a differ-
ent definition of fairness. A de-
deployment can help prevent this type of failure. tailed discussion is provided in J.
Kleinberg, S. Mullainathan, and
M. Raghavan, “Inherent Trade-Offs
1.3.3 Public Trust in the Fair Determination of Risk
Scores,” in Innovations in Theoretical
Public trust in autonomous systems is critical for their widespread adoption, and Computer Science (ITCS) Conference,
validation plays a key role in developing this trust. For example, trust has been 2017.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8 c hapter 1. introduction
Validation
Algorithm
Specification
Validation algorithms require two inputs, as shown in figure 1.3. The first input is
the system under test, which we will refer to as the system. The system represents
a decision-making agent operating in an environment. The agent makes decisions
based on information from the environment that it receives from sensors.30 The 30
Up to this point, we have infor-
second input is a specification, which expresses an operating requirement for the mally used the term system to refer
to only the agent and its sensors.
system. Specifications often pertain to safety, but they may also address other key For the remainder of the book, we
design objectives. Given these inputs, validation algorithms output metrics to will also include the operating en-
vironment as part of the system.
help us understand the scenarios in which the system does or does not satisfy
the specification. The rest of this section provides a high-level overview of these
inputs and outputs.
1.4.1 System
A system (algorithm 1.1) consists of three main components: an environment,
an agent, and a sensor. The environment represents the world in which the agent
operates. We refer to an agent’s configuration within its environment as its state s.
The state space S represents the set of all possible states. An environment consists
of an initial state distribution and a transition model. When the agent takes an
action, the state evolves probabilistically according to the transition model. The
transition model T (s0 | s, a) denotes the probability of transitioning to state s0
from state s when the agent takes action a.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.4. validation algorithms 9
For physical systems, the state often represents an agent’s position and velocity
in the environment, and the transition model is typically governed by the agent’s
equations of motion. Figure 1.4 shows an example of a state for an inverted
pendulum system. The state and transition model may also contain information
about other agents in the environment. For example, the environment for an
aircraft collision avoidance system contains the other aircraft in the airspace that ω
the agent must avoid. The other agents may also be human agents such as other θ
drivers or pedestrians in the environment of an autonomous vehicle. The presence
of other agents in the environment often increases our uncertainty in the outcome s = [θ, ω ]
of a particular action.
In many real-world systems, agents do not have access to their true state within Figure 1.4. The state s of an in-
the environment and instead rely on observations from sensors. We define the verted pendulum system can be
sensor component of a system as a mechanism for sensing information about the compactly represented as its cur-
rent angle from the vertical θ and
environment. Many real-world systems rely on multiple sensors, so the sensor its angular velocity ω.
component may contain multiple sensing modalities. For example, an autonomous
vehicle senses its position in the world using a combination of sensors such as
global positioning systems (GPS), cameras, and LiDAR. We model the sensor
component using an observation model O(o | s), which represents the probability
of producing observation o in state s. Observations come in multiple forms based
on the sensing modality. For example, GPS sensors output coordinates, while
camera sensors output image data. We call the set of all possible observations for
a system its observation space O .
An agent uses observations to select actions from a set of possible actions
known as the action space A. Agents may use a number of decision-making
algorithms or frameworks to select actions. While some agents select actions
based entirely on the observation, other agents use the observation to first estimate
the state and then select an action based on this estimate. Furthermore, some
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10 c hapter 1. introduction
observation o state s
Sensor
Observation Model
O
agents may keep track of previous actions and observations internally to improve
their state estimate. For example, an aircraft that only observes its altitude may
keep track of previous altitude measurements to estimate its climb or descent
rate. We abstract these behaviors of the agent using the notion of a policy π,
which is responsible for selecting an action given the current observation and
information the agent has stored previously. An agent’s policy can be stochastic
or deterministic. A stochastic policy samples actions according to a probability
distribution, while a deterministic policy will always produce the same action
given the same information.
The transition model T (s0 | s, a) satisfies the Markov assumption, which requires
that the next state depend only on the current state and action. The state space,
action space, observation space, observation model, and transition model are all el-
ements of a sequential decision-making framework known as a partially observable
Markov decision process (POMDP).31 Figure 1.5 demonstrates how these elements 31
M. J. Kochenderfer, T. A. Wheeler,
fit into the components of a system. Appendix A provides implementations of and K. H. Wray, Algorithms for De-
cision Making. MIT Press, 2022.
these components for the example systems discussed in this book.
We analyze the behavior of a system over time by considering the sequence
of states, observations, and actions that the agent experiences. This sequence
is known as a trajectory. We generate trajectories by performing a rollout of the
system (algorithm 1.2). A rollout begins by sampling an initial state from the
initial state distribution associated with the environment. At each time step, the
sensor produces an observation based on the current state, the agent selects an
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.4. validation algorithms 11
action based on the observation, and the environment transitions to a new state
based on the action. We repeat this process to a desired depth d to generate a
trajectory τ = (s1 , o1 , a1 , . . . , sd , od , ad ) where si+1 ∼ T (· | si , ai ), oi ∼ O(· | si ),
and ai ∼ π (· | oi ). Figure 1.6 shows an example trajectory for the inverted
pendulum system.
1.4.2 Specification
A specification ψ is a formal expression of a requirement that the system must
satisfy when deployed in the real world. These requirements may be derived from
domain knowledge or other systems engineering principles. Some industries
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12 c hapter 1. introduction
have regulatory agencies that govern requirements. These agencies are especially
common in safety-critical industries. For example, the FAA and the FDA in the
United States provide regulations and requirements for aircraft and healthcare
systems, respectively.
We express specifications by translating operating requirements to logical
formulas that can be evaluated on trajectories.32 For example, the specification 32
Chapter 3 discusses this process
for an aircraft collision avoidance system is that the agent should not collide with in detail.
other aircraft in the airspace. Given a trajectory, we want to check whether any of
the states in the trajectory represent a collision.
Algorithm 1.3 defines a general framework for specifications that we will use
throughout this book. Evaluating a specification on a trajectory results in a Boolean
value that indicates whether the specification is satisfied. We consider a trajectory
to be a failure if the specification is not satisfied. Example 1.1 demonstrates this
idea on a simple grid world system. We can also derive higher-level metrics from
specifications such as the probability of failure or the expected cost of failure.
In the grid world example shown on the right, the agent’s goal is to navigate Example 1.1. Example trajectories
evaluated against a specification
to the green goal state while avoiding the red obstacle state. Therefore, given for the grid world system.
a trajectory, the specification ψ will be satisfied if the trajectory contains the
goal state and does not contain the obstacle state. The green trajectory in the
figure satisfies the specification, while the red trajectory represents a failure.
Chapter 3 will discuss how to express this specification as a logical formula.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.4. validation algorithms 13
Falsification Failure Distribution Failure Probability Figure 1.7. Failure analysis outputs
for a simple system where failures
occur to the left of the dashed red
line with likelihood represented by
the height of the black curve. The
plot on the left shows a set of fail-
ure samples that could be identi-
fied through falsification. The plot
in the middle highlights the shape
of the failure distribution, and the
shaded region in the plot on the
right corresponds to the probabil-
ity of failure.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
14 c hapter 1. introduction
1.5 Challenges
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.5. challenges 15
• Cost and safety: Testing systems in the real world is expensive and can lead
to safety issues. For example, testing an aircraft collision avoidance system
involves operating aircraft in close proximity with one another for long periods
of time. For this reason, we often rely on simulation to test systems before
deploying them in the real world. We must be careful to ensure that the simu-
lated system accurately models the real-world system. However, capturing the
full complexity of the real world in simulation can result in simulators that are
computationally expensive to run.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
16 c hapter 1. introduction
catastrophic failures. Because these edge cases occur infrequently, they are
often difficult to identify.
1.6 Overview
This section outlines the remaining chapters of the book, which can be organized
into several categories:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.6. overview 17
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2 System Modeling
2. Select the parameters for the model class. This process involves selecting the pa-
rameters that best represent the system based on available data or expert
knowledge.
20 c hapter 2. system modeling
3. Validate the model. Once selected, the model should be validated to ensure that
it accurately represents the system.
In this chapter, we will discuss the different model classes that can be used to
represent the system components and the methods for selecting the parameters
of these models.
There are a variety of challenges when building models. We want to select a
model class that is expressive enough to capture the true system, which requires
capturing all possible scenarios the system may encounter. For example, a model
of an aircraft collision avoidance system must account for all possible pilot and
intruder behaviors. However, complex models can be difficult to use for validation.
Therefore, we want to ensure that we select the simplest model class that can
accurately represent the behavior of the system.1 Additionally, building models 1
This idea is captured in a quote
requires data and expert knowledge, which may require significant effort to pro- from British staistician George E. P.
Box (1919–2013), which states that
duce. A final challenge is selecting the objective and optimization technique used ‘‘all models are wrong, but some
to determine the best model parameters. Given these challenges, it is important are useful.’’ G. E. Box, “Science
and Statistics,” Journal of the Amer-
that we carefully validate the performance of the final model. ican Statistical Association, vol. 71,
no. 356, pp. 791–799, 1976.
2.2 Probability
Many systems have components with multiple possible outcomes and uncertainty
over which outcome will occur. To build mathematical models that account for
this uncertainty, we use the concept of probability.2 The probability of a particular 2
A detailed overview of proba-
outcome is a number between 0 and 1 that quantifies the likelihood of that out- bility theory is provided by E. T.
Jaynes, Probability Theory: The Logic
come occurring, relative to all possible outcomes. If one outcome is more likely to of Science. Cambridge University
occur than another, it has a higher probability. Press, 2003.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.2. probability 21
to 1 such that
0.3
∑ P( x ) = 1 (2.1)
x
P( x )
0.2
where 0 ≤ P( x ) ≤ 1 for all x. Figure 2.1 shows an example of a probability mass
function for a discrete distribution. 0.1
where p( x ) ≥ 0 for all x. The support of a continuous distribution is the set of all
a b
values x for which p( x ) > 0. x
Probability distributions are a common type of model class because they are
often represented using probability mass or density functions that are determined Figure 2.2. A probability density
function for a continuous distri-
by a set of parameters θ. For example, the probability density function of a bution over a variable x. We can
common distribution called the Gaussian distribution (also known as the normal find the probability that x falls be-
tween two values a and b by in-
distribution) is parameterized by its mean µ and variance σ2 such that θ = [µ, σ2 ] tegrating the probability density
(example 2.1). For a discrete distribution, the parameters θ typically correspond function over that interval.
to the probability mass associated with each possible outcome. In general, we
will use Pθ ( x ) and pθ ( x ) to denote the probability mass or probability density
function of a distribution with parameters θ.
We can form more complex distributions by mixing together simpler distri-
butions. Distributions formed in this way are known as mixture models. Many
common distributions such as the Gaussian distribution are unimodal, meaning
that they have a single peak. We can represent complex multimodal distributions
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
22 c hapter 2. system modeling
One common distribution used to describe continuous variables is the Gaus- Example 2.1. The Gaussian dis-
tribution for modeling continuous
sian distribution (also called the normal distribution) N (µ, σ2 ). A Gaussian variables. The mean of a Gaussian
distribution is parameterized by its mean µ and its variance σ2 . The proba- distribution controls the location of
the center of the distribution, while
bility density function for a Gaussian distribution is given by the variance controls the spread of
the distribution.
( x − µ )2
1
N x | µ, σ2 = √ exp − (2.4)
2πσ2 2σ2
given a mean µ and variance σ2 . The mean controls the location of the center
of the distribution, while the variance controls the spread of the distribution.
The plots below show examples of Gaussian distributions with different
means and variances.
µ=0 µ=1 µ=0
σ2 = 1 σ2 = 1 σ2 = 2
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.2. probability 23
A Gaussian mixture model is a mixture model that is simply a weighted Example 2.2. An example of a
Gaussian mixture model.
average of various Gaussian distributions. The parameters of a Gaussian mix-
ture model include the parameters of the Gaussian distribution components
2 , as well as their weights ρ . The density is given by
µ1:n , σ1:n 1:n
n
2
p( x | µ1:n , σ1:n , ρ1:n ) = ∑ ρi N (x | µi , σi2 ) (2.5)
i =1
scaled components
0.10 mixture density
p( x )
0.05
0.00
−10 −5 0 5 10
x
p x ( x ) = pz ( g( x ))| g0 ( x )| (2.6)
where g is the inverse of f . Multiplying the original density by the absolute value
of the derivative of g corrects for the stretching or shrinking of the distribution
that occurs when transforming the variable. Figure 2.3 transforms a Gaussian
distribution into a multimodal distribution. Normalizing flows are a class of models
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
24 c hapter 2. system modeling
−2 0 2 −2 0 2
z x
3
A comprehensive introduction to
normalizing flows is provided in
0 0.2 0.4 0.6 0.8 1 −2 0 2
I. Kobyzev, S. J. Prince, and M. A.
z x
Brubaker, “Normalizing Flows: An
Introduction and Review of Cur-
rent Methods,” IEEE Transactions
on Pattern Analysis and Machine In-
that use this idea to transform simple distributions into complex distributions by telligence, vol. 43, no. 11, pp. 3964–
3979, 2020.
applying a series of invertible transformations.3 4
Pseudorandom number se-
For some problems, we may not be able to produce an analytical form for the quences, such as those produced
probability density function of the distribution. However, we can still generate by a sequence of calls to rand, are
deterministic given a particular
samples from the distribution by applying transformations to samples from a seed but appear random.
pseudorandom number generator.4 We refer to models represented in this way as 5
I. Goodfellow, J. Pouget-Abadie,
generative models. Generative adversarial networks (GANs) are an example of a M. Mirza, B. Xu, D. Warde-Farley, S.
Ozair, A. Courville, and Y. Bengio,
generative model that learns to generate samples from complex distributions by “Generative Adversarial Nets,” Ad-
transforming samples from a simple distribution into samples that resemble the vances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 27,
complex distribution.5
2014.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.2. probability 25
over multiple variables is called a multivariate distribution. Joint distributions repre- 0 0 0 0.08
0 0 1 0.31
sent the likelihood of multiple outcomes occurring simultaneously. For example, 0 1 0 0.09
the joint distribution over two discrete variables X and Y is represented by the 0 1 1 0.37
probability mass function P( x, y), which outputs the probability that both X = x 1 0 0 0.01
1 0 1 0.05
and Y = y. 1 1 0 0.02
We use different strategies to represent joint distributions depending on whether 1 1 1 0.07
the variables are discrete or continuous. For discrete variables, we can represent
Table 2.1. Example of a joint distri-
the joint distribution as a table such as the one shown in table 2.1. The table as- bution involving binary variables
signs a probability to each possible combination of outcomes. These probabilities X, Y, and Z. This distribution has
8 parameters θ1 , . . . , θ8 that repre-
represent the parameters of the distribution. sent the probabilities of each possi-
We often want to represent joint distributions over many variables with many ble combination of outcomes.
possible outcomes, which can require a large number of parameters. If we make
additional assumptions about the structure of the joint distribution such as inde-
pendence between variables, we can use other representations such as decision
trees or Bayesian networks to reduce the number of parameters required to rep-
resent the distribution.6 We can represent continuous joint distributions using 6
For more details on represent-
multivariable functions. For example, a common distribution used to model un- ing complex probability distribu-
tions, see chapter 2 of M. J. Kochen-
certainty in multiple continuous variables is the multivariate Gaussian distribution derfer, T. A. Wheeler, and K. H.
(example 2.3). Wray, Algorithms for Decision Mak-
ing. MIT Press, 2022. A compre-
hensive overview is provided by
2.2.3 Conditional Distributions D. Koller and N. Friedman, Proba-
bilistic Graphical Models: Principles
A conditional distribution is a distribution over a variable given the value of one or and Techniques. MIT Press, 2009.
P(y, x )
P(y | x ) = (2.8)
P( x )
where P(y | x ) is read as ‘‘probability of y given x’’ and represents the probability
that the variable Y takes on the value y given that the variable X takes on the value
x. The agent, environment, and observation models introduced in section 1.4.1
are all conditional distributions. For example, the transition model T (s0 | s, a)
is a conditional distribution over the next state s0 given the current state s and
action a.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
26 c hapter 2. system modeling
The multivariate Gaussian distribution extends the Gaussian distribution Example 2.3. The multivariate
Gaussian distribution, a common
over n variables using the following probability density function: multivariate distribution used to
model uncertainty in multiple con-
1 1 > −1 tinuous variables.
N x | µ, Σ2 = exp − ( x − µ ) Σ ( x − µ ) (2.7)
(2π )n/2 |Σ|1/2 2
−10
−10 0 10 −10 0 10 −10 0 10
x1 x1 x1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 27
A common model class used to model the uncertainty in one continuous vari- Example 2.4. The conditional
Gaussian distribution. The plot be-
able conditioned on the value of another continuous variable is the conditional low shows the probability density
Gaussian distribution. Specifically, we represent the conditional distribution for the conditional Gaussian model
p(y | x ) = N ( x, 102 ).
pθ (y | x ) as a Gaussian distribution with a mean that depends on the value
of x:
10
p θ ( y | x ) = N ( y | f θ 0 ( x ), σ 2 )
0
where f θ0 is a function of x with parameters θ0 , and the full set of parameters
y
for the model is θ = [θ0 , σ2 ]. We often select f θ0 based on domain knowledge −10
of the physical laws that govern the system. For example, if we know that a
−10 0 10
sensor produces noisy measurements of the true state, we may set f θ0 ( x ) = x x
so that the measurements will be centered around the true state. The figure in
the caption shows an example of a conditional Gaussian distribution where
the mean of the distribution is determined by the function f θ0 ( x ) = x and
the variance is 102 . Brighter colors indicate higher probability density.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
28 c hapter 2. system modeling
Maximizing the log-likelihood forms the basis of many common objective func-
tions used in machine learning. For example, maximizing the log-likelihood of
the parameters of a conditional Gaussian distribution leads to the least-squares
objective function, which is commonly used in regression problems (example 2.5).
Algorithm 2.1 provides a general algorithm for maximum likelihood param-
eter learning. We can apply several optimization algorithms to maximize the
log-likelihood (see section 4.6). Example 2.6 uses algorithm 2.1 to learn the pa-
rameters of a conditional Gaussian observation model for the inverted pendulum
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 29
The result minimizes the sum of the squared errors between the model
outputs and the true outputs and is often referred to as the least-squares
objective function. This result also extends to the multivariate case.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
30 c hapter 2. system modeling
Suppose we have a dataset of states and observations for the inverted pendu- Example 2.6. Learning a condi-
tional Gaussian observation model
lum system (shown in the caption). For simplicity, we will assume in this for the inverted pendulum system.
example that the pendulum state only consists of its current angle. We can The plot shows the learned obser-
vation model with brighter colors
model the observation model O(o | s) as a conditional Gaussian distribution indicating higher probability den-
with a mean that depends on the state of the pendulum such that sity. The data points are plotted
on top of the observation model in
O ( o | s ) = N ( o | f θ0 ( s ) , σ 2 ) (2.12) pink.
We will further assume that the observation is a linear function of the state 0.2
such that f θ0 (s) = θ1 s + θ2 . We can learn the parameters θ = [θ1 , θ2 , σ2 ]
using algorithm 2.1 with the following code:
0
y
using Optim
likelihood(x, θ) = Normal(θ[1] * x + θ[2], exp(θ[3]))
optimizer(f) = minimizer(optimize(f, zeros(3), Optim.GradientDescent()))
alg = MaximumLikelihoodParameterEstimation(likelihood, optimizer) −0.2
θ = fit(alg, data) −0.2 0 0.2
x
The code uses the Optim.jl package to perform gradient descent optimiza-
tion to learn the parameters θ starting with an initial guess of θ = [0, 0, 0].
The optimized parameters result in the following model:
These results indicate that the observation model is centered around the true
state of the pendulum with a small amount of noise. As noted in example 2.5,
determining θ1 and θ2 in this way is equivalent to optimizing the least squares
objective. The figure in the caption shows the learned observation model
behind the samples. Brighter colors indicate higher probability density.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 31
system. Depending on the model class and optimization algorithm, algorithm 2.1
may not find the global minimum. However, for many common model classes,
we can perform this optimization analytically instead. Example 2.7 derives an
analytical solution for the maximum likelihood estimate of the parameters of a
discrete distribution, while examples 2.8 and 2.9 derive analytical solutions for
the parameters of Gaussian and conditional Gaussian distributions.
Suppose we have a binary variable X that takes on the value 1 with probability Example 2.7. Maximum likelihood
parameter learning for a binary
θ and the value 0 with probability 1 − θ. The probability of a sequence of m variable and a variable with k pos-
samples with n occurrences of 1 is sible values.
P ( D | θ ) = θ n (1 − θ ) m − n
∂ n m−n
`(θ ) = − =0
∂θ θ 1−θ
n
Solving for θ results in the maximum likelihood estimate θ̂ = m . Computing
the maximum likelihood estimate for a variable X that can assume k values
results in a similar formula. The maximum likelihood estimate for P( xi | n1:k )
is given by
n
θ̂i = k i
∑ j =1 n j
where n1:k are the observed counts for the k different values.
Algorithm 2.1 requires that we have all of the data required to learn the pa-
rameters. In practice, we may have missing data. For example, when we train a
Gaussian mixture models, we may not know which component of the mixture
generated each data point. Furthermore, when we learn the transition model and
observation models, we may only have access to the observations and actions
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
32 c hapter 2. system modeling
∂ 2 ∑i (oi − µ̂)
∂µ `( µ, σ ) = =0 (2.14)
σ̂2
2 m ∑i (oi − µ̂)2
∂σ `( µ, σ ) = − σ̂ +
∂
=0 (2.15)
σ̂3
After some algebraic manipulation, we get
1 1
µ̂ =
m ∑ oi σ̂2 =
m ∑(oi − µ̂)2 (2.16)
i i
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 33
Consider a linear Gaussian model p(y | x) = N (y | Ax + b, Σ). Suppose we Example 2.9. Analytical solution
for finding the parameters of a lin-
want to find the maximum likelihood estimate of A and b given a dataset ear Gaussian model using maxi-
of m observations. As shown in example 2.5, this process is equivalent to mum likelihood parameter learn-
ing.
minimizing the sum of the squared errors between the model outputs and
the true outputs:
m
arg min ∑ kAxi + b − yi k2
A,b i =1
Let the matrix X be defined such that each row is a data point xi augmented
with a one in the final column, and let Y be a matrix such that each row is a
data point yi . If we let θ = [A b]> , we can rewrite the optimization problem
as
arg minkXθ − Yk2
θ
Setting the gradient of the objective function with respect to θ to zero and
solving for θ results in the following closed-form solution:
where the pinv function computes the pseudoinverse of the matrix X. The
result is θ1 = 1.02 and θ2 = 0.00, which matches the result from example 2.6.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
34 c hapter 2. system modeling
taken by the agent and not the true state of the environment. In these cases, we
can use the expectation-maximization (EM) algorithm to learn the parameters of
the model, which involves iterative improvement of the parameter estimate.12 12
An overview of the EM algo-
rithm is provided in section 4.4 of
M. J. Kochenderfer, T. A. Wheeler,
2.3.2 Bayesian Parameter Learning and K. H. Wray, Algorithms for De-
cision Making. MIT Press, 2022.
In Bayesian parameter learning, we estimate a distribution over model parameters
given the data. We write this distribution as P(θ | D ).13 This distribution can 13
If θ is continuous, the distribu-
help us quantify our uncertainty about the true value of θ. We can convert this tion is represented by a probabil-
ity density p(θ | D ) instead of
distribution into a point estimate by computing the expectation: a probability mass. In this case,
the summations in equations (2.17)
θ̂ = Eθ∼ P(·| D) [θ] = ∑ θP(θ | D) (2.17) and (2.19) change to integrals.
θ
p(θ )
This estimate corresponds to a value of θ that is assigned the greatest density. 1
This is often referred to as the mode of the distribution. As shown in figure 2.5,
0
the mode may not be unique. 0 0.2 0.4 0.6 0.8 1
We can derive an expression for P(θ | D ) in terms of the likelihood model θ
introduced in section 2.3.1 using Bayes’ rule:14 Figure 2.5. An example of a distri-
bution where the expected value
of θ is not a good estimate. The ex-
P ( D | θ) P (θ)
P (θ | D ) = (2.19) pected value of 0.5 has a lower den-
∑θ P ( D | θ) P (θ) sity than occurs at the extreme val-
ues of 0 or 1.
In addition to the likelihood model P( D | θ), we need to specify a prior distribution 14
Bayes’ rule can be derived from
P(θ) over the parameters. The prior distribution encodes our beliefs about the the definition of conditional proba-
bility and is named for the English
values of the parameters before observing the data. The output of equation (2.19) statistician and Presbyterian min-
is often referred to as the posterior distribution. ister Thomas Bayes (c. 1701–1761)
who provided a formulation of
In general, computing the posterior distribution using equation (2.19) is chal-
this theorem. A history is pro-
lenging because the denominator is often difficult or impossible to compute vided by S. B. McGrayne, The The-
analytically. The number of terms in the summation scales exponentially with ory That Would Not Die. Yale Uni-
versity Press, 2011.
the number of parameters, and for continuous parameters, the integral is often
intractable. For some model classes and priors, however, an analytical solution is
possible. Example 2.10 shows an example of Bayesian parameter learning for a
simple model class using a conjugate prior. A conjugate prior is a prior distribution
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 35
2.3.3 Generalization
An important metric to consider when selecting model parameters is generaliza-
tion performance. The generalization performance of a model is a measure of its
performance over the distribution over its full input space, including points that
were not used to train the model. We measure generalization performance with
respect to a performance metric. A common performance metric is the average
log-likelihood that the model assigns to points in a dataset sampled from the
distribution over the input space. We want to select the parameters with the
best generalization performance. This section discusses techniques for estimating
generalization performance.
It may be tempting to estimate the generalization performance by computing
the performance metric on the training data. However, performing well on the
training data does not necessarily indicate good generalization performance.
Complex models may perform well on the training set, but they may not provide
good predictions at other points in the input space. This concept is often referred
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
36 c hapter 2. system modeling
Suppose we have a binary variable X that takes on the value 1 with probabil- Example 2.10. Bayesian parameter
learning for a binary variable us-
ity θ and the value 0 with probability 1 − θ. The likelihood of observing a ing the Beta distribution as a prior.
sequence of m samples with n occurrences of 1 is given by The plot shows the Beta distribu-
tion with different datasets.
P ( D | θ ) = θ n (1 − θ ) m − n
p(θ | D ) = Beta(θ | α + n, β + m − n)
6 prior
m = 3, n = 2
m = 10, n = 7
m = 20, n = 15
4
p(θ | D )
m = 40, n = 30
0
0 0.2 0.4 0.6 0.8 1
θ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 37
Consider the linear Gaussian observation model for the inverted pendulum Example 2.11. Implementation
of Bayesian parameter learning
introduced in example 2.6, and suppose we want to use Bayesian parameter for the inverted pendulum ob-
learning to sample a distribution over the parameters θ = [θ1 , θ2 , σ2 ]. We servation model. The results are
can use the following code to generate m = 1000 samples from the posterior shown in figure 2.6. A detailed
distribution over θ using algorithm 2.2: overview of the NUTS algorithms
is provided by M. D. Hoffman, A.
likelihood(x, θ) = Normal(θ[1] * x + θ[2], exp(θ[3]))
Gelman, et al., “The No-U-Turn
prior = MvNormal(zeros(3), 4I)
Sampler: Adaptively Setting Path
alg = BayesianParameterEstimation(likelihood, prior_dist, NUTS(), 1000)
θ = fit(alg, data) Lengths in Hamiltonian Monte
Carlo.,” Journal of Machine Learn-
The code uses the NUTS, or No U-Turn Sampler, from the Turing.jl package ing Research (JMLR), vol. 15, no. 1,
pp. 1593–1623, 2014.
to generate samples from the posterior distribution. Figure 2.6 shows the
results for different dataset sizes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
38 c hapter 2. system modeling
20 samples 100 samples 500 samples Figure 2.6. Learning the param-
eters of a linear Gaussian ob-
servation model for the inverted
0.2
pendulum system given different
amounts of data from the dataset
in example 2.6. The top row shows
0
y
−0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1
θ2 θ2 θ2
p(σ | D )
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
σ σ σ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.4. agent models 39
y
A simple approach to estimating the generalization performance using an
unseen dataset is the holdout method, which partitions the available data into a
test set and a training set. We use the training set to learn the model parameters x
and the test set for evaluation. Depending on the size and nature of the dataset,
Figure 2.7. Example where a com-
we may use different ratios to split the training and test data ranging from 50 % plex model (black line) fits the
train and 50 % test to 99 % train and 1 % test. Using too few samples for training training data (black) perfectly but
does not generalize well to other
can result in poor fits, whereas using too many will result in poor generalization
data points (blue). A simpler lin-
estimates. ear model (blue line) provides the
Using a train-test partition can be wasteful because our model tuning can take best fit when considering all data-
points.
advantage only of a segment of our data. We can often obtain better results using
k-fold cross validation.17 To perform this technique, we randomly partition the data 17
This method is also known as ro-
into k segments of approximately equal size. We then train k models, one on each tation estimation.
subset of k − 1 sets, and we use the withheld set to estimate the generalization
performance. The cross-validation estimate of generalization performance is the
mean generalization performance over all folds.18 18
Another common approach re-
lated to cross-validation is the boot-
strap method, which involves re-
2.4 Agent Models sampling the dataset with replace-
ment to estimate the generalization
performance. B. Efron, “Bootstrap
For some systems, the environment may contain other agents that we need to Methods: Another Look at the Jack-
incorporate into our environment model. Depending on the available data and knife,” in Breakthroughs in Statis-
tics: Methodology and Distribution,
the assumptions we make about the other agents, we can use different techniques Springer, 1992, pp. 569–593.
to model their behavior. This section discusses three categories of techniques for
modeling other agents.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
40 c hapter 2. system modeling
can use the techniques from section 2.3.1 to learn the parameters of the policy.
Figure 2.8 shows an example of behavioral cloning for a grid world agent.
If the model class used for behavioral cloning is not expressive enough or
the dataset contains errors, the learned policy may not be optimal. This result
can lead to cascading errors, which occur when small errors compound during a
rollout and eventually lead to states that are poorly represented in the training
data. The policy of a cloned agent may not generalize well to these states, causing
inaccurate behavior. One way to address the problem of cascading errors is to Figure 2.9. Cascading errors in be-
havioral cloning for a grid world
correct the learned policy with additional data. Sequential interactive demonstration agent. The clone is trained on a
methods such as data set aggregation (DAgger)19 alternate between collecting new trajectory that does not reach any
data in states reached by the trained policy and using this data to improve the states above the goal state. While
the original agent (blue) is able to
policy. correctly turn back toward the goal
Another common technique for imitation learning is inverse reinforcement learn- in this region, the clone (purple)
continues to move away from the
ing. In inverse reinforcment learning, we assume that the expert agent is opti- goal.
mizing an unknown reward function when selecting its actions, and our goal 19
S. Ross, G. J. Gordon, and J. A.
is to determine this reward function given a dataset of trajectory rollouts from Bagnell, “A Reduction of Imitation
Learning and Structured Predic-
the expert agent. Common techniques for inverse reinforcement learning select a tion to No-Regret Online Learn-
parametric function form for the reward function and learn the parameters of the ing,” in International Conference on
reward function according to a particular objective. Artificial Intelligence and Statistics
(AISTATS), vol. 15, 2011.
One common objective for learning reward function parameters involves max-
imizing the margin between the reward of the expert agent and the reward of 20
P. Abbeel and A. Y. Ng, “Appren-
ticeship Learning via Inverse Re-
other agents. This technique is known as maximum margin inverse reinforcement
inforcement Learning,” in Interna-
learning.20 Another common objective is to maximize the entropy of the distribu- tional Conference on Machine Learn-
tion over trajectories produced by the learned policy, which is known as maximum ing (ICML), 2004.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.4. agent models 41
entropy inverse reinforcement learning.21 Once we have learned the reward function, 21
B. D. Ziebart, A. Maas, J. A. Bag-
we can use it to model the policy of the agent.22 nell, and A. K. Dey, “Maximum En-
tropy Inverse Reinforcement Learn-
ing,” in AAAI Conference on Artifi-
cial Intelligence (AAAI), 2008.
2.4.2 Behavior Models 22
An overview of methods for
If we have a model of the utility of the an agent, we can use a behavior model to learning policies from reward func-
tions is provided by M. J. Kochen-
predict its actions. In particular, suppose we have a utility function U (s, a) that derfer, T. A. Wheeler, and K. H.
assigns a utility to each state-action pair. We can model the behavior as a policy Wray, Algorithms for Decision Mak-
ing. MIT Press, 2022.
that selects actions to maximize the utility of the agent. This policy is given by
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
42 c hapter 2. system modeling
We can model this interaction between agents using interaction models.26 For 26
The topic of interaction models
example, a hierarchical interaction model specifies the depth of rationality of an is closely related to the field of
game theory. Several introductory
agent by a level of k ≥ 0. A level 0 agents selects its action without regard to the books include D. Fudenberg and
actions of other agents. A level 1 agent selects its action by assuming that all other J. Tirole, Game Theory. MIT Press,
1991. Y. Shoham and K. Leyton-
agents are level 0 agents. In general, a level k agent selects its action by assuming Brown, Multiagent Systems: Algo-
that all other agents are level k − 1 agents. Figure 2.11 shows an example of a this rithmic, Game Theoretic, and Logical
model for an aircraft collision avoidance scenario. Foundations. Cambridge University
Press, 2009.
Another common behavior model is the hierarchical softmax model, which
accounts for the fact that agents may have different levels of rationality.27 A level 0 27
This approach is sometimes
called quantal-level-k or logit-level-k.
agent selects actions uniformly at random. A level 1 agent selects actions according
D. O. Stahl and P. W. Wilson, “Ex-
to the softmax response policy with a precision parameter λ that assumes that perimental Evidence on Players’
all other agents are level 0 agents. A level k agent selects actions according to a Models of Other Players,” Journal
of Economic Behavior & Organization,
softmax model of the other players playing level k − 1. We can learn the k and λ vol. 25, no. 3, pp. 309–327, 1994.
parameters from data using maximum likelihood estimation.
Since the validity of any downstream analysis of a system depends on the accuracy
of the models we use, it is important to rigorously validate our models. We can
use a variety of features to validate a model. Given a dataset, we can compare
characteristics of the model distribution to the empirical distribution of the data.
We can also compute features by comparing rollouts of the model to rollouts of the
true system. For example, given a model of aircraft collision avoidance behavior,
we can compare the average miss distance of the aircraft when using the model
to average miss distance of the true system trajectories. This section discusses
common model validation techniques that compare features of the model to the
true system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 43
k=2
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
44 c hapter 2. system modeling
and true distribution are represented as a set of samples. For example, we may
want to compare the distribution over airspeed from rollouts of an aircraft en-
counter model to the distribution over airspeed from trajectories of true aircraft
encounters.28 28
M. J. Kochenderfer, M. W. M. Ed-
One way to compare two feature distributions is to compare their probability wards, L. P. Espindle, J. K. Kuchar,
and J. D. Griffith, “Airspace En-
density functions. We can plot the probability density function of the model on the counter Models for Estimating Col-
same plot as the probability density function of the data. For distributions that are lision Risk,” AIAA Journal on Guid-
ance, Control, and Dynamics, vol. 33,
represented as a set of samples, we can plot an approximate probability density no. 2, pp. 487–499, 2010.
by creating a histogram of the samples. We may also compare the cumulative
distribution functions of a variable X (P( X ≤ x )) for the model and data. If we 1
do not have an analytical model of the cumulative distribution function, we can model
0.8
Empirical CDF
data
plot the empirical cumulative distribution function of the samples. Figure 2.12
0.6
compares the empirical cumulative distribution function of two sets of samples.
0.4
Another common visual diagnostic is the quantile-quantile plot (Q-Q plot). The
α-quantile of a distribution is the value q for which 0.2
0
P( X ≤ q) = α (2.22) −2 0 2
x
A Q-Q plot compares the quantiles of the model distribution to the quantiles of Figure 2.12. Comparison of the
the data. The horizontal axis of a Q-Q plot represents the quantiles of the model empirical cumulative distribution
function for two sets of samples
distribution, while the vertical axis represents the quantiles of the data. If the
represented by the blue and pur-
model distribution matches the data, the points in the Q-Q plot will lie on the ple dots. The function represents
line that passes through the origin with a slope of 1. the fraction of samples below each
value of x.
We can also compare distributions using calibration plots. The horizonal axis of a
calibration plot corresponds to values of α between 0 and 1, while the vertical axis
corresponds to the fraction of data points that lie below the α-quantile of the model.
Similar to the Q-Q plot, a well-calibrated model will produce a calibration plot that
lies on the line that passes through the origin with a slope of 1. Figure 2.13 shows
examples of probability density, cumulative distribution, Q-Q, and calibration
plots for a set of samples and four different analytical models.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 45
p( X ≤ x )
0.6 0.6
p( x )
αtrue
qtrue
0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
0.8 1 1
0.8 2 0.8
0.6
p( X ≤ x )
0.6 0.6
p( x )
αtrue
qtrue
0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
0.8 1 1
0.8 2 0.8
0.6
p( X ≤ x )
0.6 0.6
p( x )
αtrue
qtrue
0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
0.8 1 1
0.8 2 0.8
0.6
p( X ≤ x )
0.6 0.6
p( x )
αtrue
qtrue
0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
−2 0 2 −2 0 2 −2 0 2 0 0.2 0.4 0.6 0.8 1
x x qmodel αmodel
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
46 c hapter 2. system modeling
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 47
αdata
αdata
0.4 0.4 0.4
0 0 0
−2 0 2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x αmodel αmodel
produces misleading results. The model distribution does not match the data 1.2
1.1
points in two-dimensional space, but the individual feature distributions match 1
0.9
the data.
0.911.11.2
Several techniques allow us to check whether a model accurately captures
the relationships between features of the true system. One technique involves
creating a single feature that captures the relationships between the features of
the true system. We can then use techniques discussed in sections 2.5.1 and 2.5.2
to compare the single feature of the model to the single feature of the true sys-
tem. Figure 2.16 shows an example of creating a single feature that models the
Figure 2.15. Example in which per-
relationship between the features in figure 2.15. One drawback of this approach forming visual diagnostics on in-
is that it requires domain knowledge to create the single feature. dividual features can produce mis-
leading results. The model distri-
Another way to compare multiple features is to extend the metrics discussed bution (blue contours indicating
in section 2.5.2 to multivariate distributions. Many of the comparison metrics points of equal density) does not
for probability density functions have straightforward extensions. The K-L di- match the data points (gray) in
two-dimensional space. However,
vergence, for example, can be extended to multivariate distributions by using the individual feature distributions
the probability density of the joint distribution of the model and data for the of the model match the individual
feature distributions of the data.
variables of interest (figure 2.17). In constrast, the visual diagnostics and metrics
that use the cumulative distribution or quantile functions are less straightforward 35
Multiple definitions of the quan-
tile function in higher dimensions
to extend because the quantile function is not defined in multiple dimensions.
have been proposed. P. Chaudhuri,
Therefore, these metrics require extensions of the quantile function to higher “On a Geometric Notion of Quan-
dimensions.35 tiles for Multivariate Data,” Jour-
nal of the American Statistical Associ-
We may also want to compare the distribution over a feature conditioned on the ation, vol. 91, no. 434, pp. 862–872,
value of another feature. For example, we may want to compare the distribution 1996.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
48 c hapter 2. system modeling
DKL = 0.001 3
1.2 1.2
DKL = 0.001 3
1.2 Figure 2.17. Comparison of the fea-
1.1 1.1 1.1 tures of two possible models to a
1 1 1 set of data sampled from the true
0.9 0.9 0.9
1.2 system using K-L divergence. If
0.911.11.20.911.11.2 0.911.11.2 we calculate the K-L divergence of
DKL = 0.001 3
DKL = 0.001 3
1.1 each feature separately, the models
appear to match the data equally
1
well. However, if we calculate the
0.9 K-L divergence of the joint distri-
DKL = 0.692 7 DKL = 0.001 3 bution of the features, we see that
the pink model matches the data
0.911.11.2
better than the blue model.
over sensor observations conditioned on the true state. One way to make this
comparison is to partition the conditioning variable into a set of bins and then
compare the distribution over the feature in each bin using the metrics described 36
R. Bhattacharyya, S. Jung, L.
in this section. Figure 2.18 shows an example of this technique. It is also possible Kruse, R. Senanayake, and M. J.
to create a single calibration plot for the conditional distribution checking the Kochenderfer, “A Hybrid Rule-
Based and Data-Driven Approach
α-quantile for each x-value checking how often the corresponding y-value is
to Driver Modeling Through Par-
below the α-quantile of the model distribution. ticle Filtering,” IEEE Transactions
on Intelligent Transportation Systems,
no. 2108.12820, 2021.
2.5.4 Subjective Evaluation 37
This test was first proposed
by English mathematician and
We can also evaluate models based on expert knowledge. One evaluation metric computer scientist Alan Turing
is the ability of an expert to distinguish between samples produced by the model (1912–1954) in an 1950 essay. He
and samples produced by the true system.36 A model represents the true system originally called the test the imita-
tion game in which a human judge
well if an expert cannot distinguish between the generated samples and the true interacts with a machine and a hu-
samples. This idea is similar to the Turing test, which was proposed as a way to man and must determine which is
which. A. M. Turing, “Computing
test whether a machine has human intelligence.37 Machinery and Intelligence,” Mind,
vol. 59, pp. 433–460, 1950.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 49
1 1 1 1
0.8 0.8 0.8 0.8
p (Y ≤ y )
10 10 10 10
8 8 8 8
6 6 6 6
qdata
4 4 4 4
2 2 2 2
0 0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
qmodel qmodel qmodel qmodel
1 1 1 1
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
αdata
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
50 c hapter 2. system modeling
User
rectly identified the true system
and in red if they incorrectly identi-
fied the model. The top row shows
Accuracy: 100 %
an example where the model does
Test Failed 7 not represent the true system well,
and the expert is able to deter-
mine the true model in each pair.
The bottom row shows an exam-
ple where the model represents the
true system well, and the expert is
unable to distinguish between the
Good Model Representation 3 true system and the model.
User
Accuracy: 50 %
Test Passed 3
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.6. summary 51
We can evaluate this metric by showing an expert pairs consisting of one set
of rollouts from the true system and one set of rollouts from the model. We can
then ask the expert to identify which set of rollouts was produced by the true
system. We can quantify the performance of the model by measuring the expert’s
accuracy in distinguishing between the two sets. If the expert’s accuracy is around
50 %, their performance is no better than random guessing, and we can conclude
that the model is a good representation of the true system. Figure 2.19 shows an
example of this test for a model of a grid world agent.
f (θ )
in proceeding with the learned parameters. Sensitivity will be discussed in more
detail in section 11.3.1. A notional example is illustrated in figure 2.20.
θ1 θ2
2.6 Summary
θ
• To accurately model a system, we need to build models of the agent, environ- Figure 2.20. Sensitivity analysis of
ment, and sensor. a model parameter θ with respect
to downstream quantity of inter-
• The general process for creating a model involves selecting a model class, est f (θ ). If our learned parameter
is θ1 , we may be more confident
learning the parameters of the model, and validating the model.
in our downstream quantity com-
pared to θ2 where there is much
• Probability distributions are a common type of model class that assigns proba- greater sensitivity. A small pertur-
bilities to different outcomes. bation to θ1 is unlikely to change
f (θ ), but a small perturbation to θ2
• We can learn the parameters of from data using maximum likelihood estimation might.
or Bayesian estimation.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
52 c hapter 2. system modeling
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3 Property Specification
Using the grid world specification, we could define a metric that measures the
distance between the agent and the goal or obstacle.
We use metrics or specifications to evaluate individual trajectories, sets of
trajectories, or probability distributions over trajectories. The miss distance be- 400
tween two aircraft can be used to measure the performance of an aircraft collision 200
miss
h (m)
avoidance system in a single encounter scenario (figure 3.1), and the net return 0 distance
can be used to measure the performance of a one outcome of a financial trading −200
strategy over time. We can also create metrics or specifications that operate over
−400
a set of trajectories. For example, we can compute the average miss distance or
net gain over a set of possible trajectories or specify a threshold on the number
400
of trajectories that result in a collision. The remainder of this chapter discusses
200
techniques to formally express metrics and specifications.
h (m)
average
0 miss distance
where p(τ ) is the probability distribution over trajectories. While it is not always 0 1,000 2,000
possible to evaluate the expected value analytically, we can estimate it using a Miss Distance (m)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.2. metrics for stochastic systems 55
specification is the probability that a randomly sampled trajectory will satisfy the
specification. We could also derive a high-level specification from this probability
by requiring that the probability of satisfying the specification is greater than a
certain threshold.1 1
H. Hansson and B. Jonsson, “A
Logic for Reasoning about Time
and Reliability,” Formal Aspects of
3.2.2 Variance Computing, vol. 6, pp. 512–535,
1994.
Another common summary metric is the variance, which measures the spread of
the distribution. The variance of a metric f (τ ) is defined as
Intuitively, the variance measures how much the metric f (τ ) deviates from its
expected value. A low variance indicates that the metric tends to be consistent
across different trajectories, while a high variance indicates that the metric varies
significantly. It is important to consider both the expected value and variance of a
metric when evaluating system performance (figure 3.3).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
56 c hapter 3. property specification
α = 0.9 α = 0.7 α = 0.5 α = 0.3 α = 0.1 Figure 3.4. Effect of α on VaR and
CVaR. Higher values for α corre-
expected value spond to more conservative risk es-
VaR timates.
CVaR
In many real-world settings, we must select one of several system designs or 0.8
Collision Rate
strategies for final deployment, and metrics allow us to make an informed deci- 0.6
sion. For example, we might compare the performance of two aircraft collision
0.4
avoidance systems by computing the probability of collision over a set of aircraft
0.2
encounters for each system. In these cases, we are often concerned with multiple
metrics. For example, an aircraft collision avoidance system should minimize 0
0 0.2 0.4 0.6 0.8 1
collisions while issuing a small number of alerts to pilots, and a financial trading Alert Rate
strategy may aim to maximize return while minimizing risk.
Figure 3.5. Tradeoff between the
It is often the case that multiple metrics describing system performance are at
alert rate and collision rate for an
odds with one another, and some system designs may perform well on one metric aircraft collision avoidance system.
but poorly on another. For instance, an aircraft collision avoidance system that Each point represents a different
system design.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.3. composite metrics 57
Suppose a desired separation for the aircraft in an aircraft collision avoid- Example 3.1. VaR and CVaR for
the loss of separation metric for an
ance environment is 2,000 m. We can define a risk metric f (τ ) to summarize aircraft collision avoidance system.
the loss of separation as 2,000 m minus the miss distance. A higher loss of
separation indicates higher risk. The plots below show the VaR and CVaR for
the loss of separation metric for three different distributions over outcomes.
expected value
VaR
CVaR
Although all three distributions have the same expected value, the VaR and
CVaR decrease as we move from left to right. The distribution with the lowest
VaR and CVaR is the least risky because it has better worst-case outcomes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
58 c hapter 3. property specification
minimizes the number of collisions may also increase the number of alerts issued
to pilots, while one that minimizes alerts may increase the number of collisions 1
(figure 3.5). In such cases, we can combine multiple metrics into a single composite 0.8
Collision Rate
metric that captures the trade-offs between different objectives.
0.6
We can compare systems with multiple metrics using the concept of Pareto
0.4
optimality. A system design is Pareto optimal3 if we cannot improve one metric
without worsening another. Given a set of system designs, the Pareto frontier 0.2
consists of the subset of designs that are Pareto optimal. The Pareto frontier 0
0 0.2 0.4 0.6 0.8 1
illustrates the trade-offs between metrics. Figure 3.6 shows the Pareto frontier
Alert Rate
for the aircraft collision avoidance systems shown in figure 3.5. Composite metrics
allow system designers to select a single point on the Pareto frontier. Figure 3.6. Pareto frontier for a set
of aircraft collision avoidance sys-
tem designs. The points that com-
3.3.1 Weighted Metrics prise the Pareto frontier are high-
lighted in blue.
Weighted metrics combine multiple metrics using a vector of weights that re- 3
Pareto optimality is a topic that
flect the relative importance of each metric. Suppose we have a set of metrics was originally explored in the field
of economics. It is named after
f 1 (τ ), f 2 (τ ), . . . , f n (τ ) that we wish to combine into a single metric. The most Italian economist Vilfredo Pareto
basic weighted metric is the weighted sum, which is defined as (1848–1923).
n
f (τ ) = ∑ wi f i ( τ ) = w> f ( τ ) (3.4)
Collision Rate
i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.3. composite metrics 59
Suppose we want to create a composite metric for an aircraft collision avoid- Example 3.2. Using the weighted
sum composite metric to select an
ance system that balances the alert rate and collision rate. Using the weighted aircraft collision avoidance system
sum method, we define the composite metric as the weighted sum of the design along the Pareto frontier.
alert rate and collision rate. Selecting a weight vector then allows us to choose
a point on the Pareto frontier. The plots below show the Pareto frontier for
two different weight vectors. The first weight vector (w1 = [0.8, 0.2]) gives
more weight to minimizing the alert rate, while the second weight vector
(w2 = [0.2, 0.8]) gives more weight to minimizing the collision rate. The
points are colored according to the value of the composite metric.
0.8 w1
Collision Rate
0.6
w2
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Alert Rate Alert Rate
The weight vector will be perpendicular to the Pareto frontier at the best
design point. The weight vector w1 is shown in blue for the first design point
and w2 is shown in blue for the second design point. The best design points
are highlighted in green.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
60 c hapter 3. property specification
The weighted exponential sum is composite metric that combines the weighted
sum and goal metrics as follows:
n
f (τ ) = ∑ wi (fi (τ ) − fgoal ) p (3.6)
i =1
wcollision
0.6
them to select the preferred design. By repeating this process for multiple different
pairwise queries of system designs, we can infer the weight that the expert assigns 0.4
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.3. composite metrics 61
where we assume that lower values for the composite metric are preferable.7 In 7
If higher values are preferable,
effect, the response to the query further constrains the space of possible weight the inequality in equation (3.8)
should be reversed.
vectors (example 3.3).
Suppose we want to infer the weights for a composite metric that combines the Example 3.3. The effect of a prefer-
ence query on the space of possible
alert rate and collision rate for an aircraft collision avoidance system. When weight vectors for the aircraft colli-
we query a domain expert or stakeholder with system designs f1 = [0.8, 0.4] sion avoidance example.
and f2 = [0.4, 0.8], we find that the expert prefers f1 to f2 . In other words,
the expert prefers the system design with the higher alert rate and lower
collision rate. Since the weight vector must be consistent with this preference
(equation (3.8)), we can further constrain the space of possible weight vectors
as shown in the figure below.
1
f2 f2
0.8
w T f1 < w T f2
wcollision
0.6
f1 f1
0.4
0.2 w T f1 > w T f2
w T f1 = w T f2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
walert walert walert
The purple shaded region in the center plot shows the space of possible
weight vectors consistent with the expert’s preference. The plot on the right
shows the space of possible weight vectors consistent with the expert’s prefer-
ence and the constraint that the weights must sum to 1. We can further refine
the space of possible weight vectors by querying the expert with additional
pairs of system designs.
By querying the expert with multiple pairs of system designs, we can iteratively
refine the space of possible weight vectors (figure 3.9). To minimize the number
of times we must query the expert, it is common to select pairs of system designs 8
This method is known as Q-
that with maximally reduce the space of possible weights. For example, one Eval. V. S. Iyengar, J. Lee, and M.
Campbell, “Q-EVAL: Evaluating
method is to select the query that comes closest to bisecting the space of possible Multiple Attribute Items Using
weights.8 After querying the expert a desired number of times, we can select a Queries,” in ACM Conference on
Electronic Commerce, 2001.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
62 c hapter 3. property specification
set of weights from the refined weight space to create a composite metric that
reflects the expert’s preferences. While we could select any value for w that is
consistent with the expert’s responses, it is common to select the weight vector
that maximally separates the system designs that were presented to the expert.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.4. logical specifications 63
Suppose we wish to express the following statement using propositional Example 3.4. Constructing a
propositional logic formula from
logic: ‘‘If the agent is in a safe state, then the agent is not in a collision a statement.
state.’’ Let the variable S represent whether the agent is in a safe state and C
represent whether the agent is in a collision state. The propositional logic
statement is S → ¬C (read as ‘‘S implies not C’’). In this statement, S and C
are atomic propositions because they cannot be broken down further. The
logical formula S → ¬C is itself a proposition that can be combined with
other propositions to create more complex formulas.
negation (‘‘not’’) and conjunction (‘‘and’’). All other logical expressions such as
disjunction (‘‘or’’), implication (‘‘if-then’’), and biconditional (‘‘if and only if’’) can
be constructed using negation and conjunction. Example 3.4 demonstrates the
construction of a propositional logic formula from a statement.
Table 3.1 shows the propositional logic operators and their construction using
negation and conjunction. We can describe propositional logic formulas using
truth tables, which show the value of the formula as a function of its inputs.
Figure 3.10 shows truth tables for each of the basic propositional logic operators.
Logical operators can also be illustrated as logic gates (figure 3.11), which are
fundamental building blocks for digital circuits.10 Example 3.5 implements the 10
R. Page and R. Gamboa, Essen-
logical operators as functions in Julia. tial Logic for Computer Science. MIT
Press, 2019.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
64 c hapter 3. property specification
P Q P∨Q P Q P↔Q
false false false false false true
false true true false true false
true false true true false false
true true true true true true
P P
Figure 3.11. Logical operators rep-
P∧Q P∨Q P ¬P
Q Q resented using logic gates.
P P↔Q
P
P→Q Q
Q
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 65
Consider two atomic propositions, P and Q. The basic operations of negation Example 3.5. Julia implementa-
tions of propositional logic oper-
(!), conjunction (&&), and disjunction (||) are already implemented in most ators.
programming languages including Julia. Implication P → Q can be defined
as the operator ⟶ given the Boolean values of P and Q:
julia> ⟶(P,Q) = !P || Q # \longrightarrow<TAB>
⟶ (generic function with 1 method)
julia> P = true;
julia> Q = false;
julia> P ⟶ Q
false
Temporal logic extends first-order logic to specify properties over time. It is partic-
ularly useful for specifying properties of dynamical systems because it allows us
to describe how trajectories should evolve. This section outlines three common
types of temporal logic.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
66 c hapter 3. property specification
Let x be a variable that represents the state of the agent in the grid world Example 3.6. Universal and ex-
istential quantifiers for an obsta-
problem where we must avoid an obstacle (red), and define the domain X cle avoidance problem. The red re-
as the set of states that comprise a particular trajectory. We define a predicate gion indicates an obstacle while the
green region indicates the goal.
function O( x ) that evaluates to true if x is an obstacle state and false otherwise.
To define a specification ψ1 that states ‘‘for all states in the trajectory, the
agent does not hit an obstacle,’’ we can use the formula:
ψ1 = ∀ x ¬O( x )
ψ1 = true ψ1 = f alse
Suppose we also want the agent to reach a goal state while avoiding the
obstacle. We can create an additional predicate G ( x ) that evaluates to true
if x is a goal state and false otherwise. We then create ψ2 to represent the
statement ‘‘for all states in the trajectory, the agent does not hit an obstacle
and there exists a state in the trajectory in which the agent reaches the goal’’
using the following formula:
ψ2 = (∀ x ¬O( x )) ∧ (∃ x G ( x ))
ψ2 = true ψ2 = f alse
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 67
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
68 c hapter 3. property specification
For a navigation problem, let ψ be the LTL property specification that states Example 3.7. LTL formula for an
obstacle avoidance problem where
‘‘eventually reach the goal after passing through the checkpoint and always avoid the a blue checkpoint must be reached
obstacle.’’ First, we define the following predicate functions: before the green goal while avoid-
ing the red obstacle.
F (st ) : the state s at time t contains an obstacle ψ = true
G (st ) : the state s at time t is the goal
C (st ) : the state s at time t is the checkpoint
ψ = ♦ G ( s t ) ∧ ¬ C ( s t ) U G ( s t ) ∧ ¬ F ( s t )
This formula requires that the agent reaches the goal (♦G (st )) but that the
goal is not reached until the checkpoint (¬ G (st ) U C (st )). Additionally, the
agent must always avoid obstacles (¬ F (st )). The figure in the caption
shows an example trajectory that satisfies this specification. The following
code constructs the LTL specification:
F = @formula sₜ -> sₜ == [5, 5]
G = @formula sₜ -> sₜ == [7, 8]
C = @formula sₜ -> sₜ == [8, 3]
ψ = LTLSpecification(@formula ◊(G) ∧ 𝒰(¬G, C) ∧ □(¬F))
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 69
where µ(·) is a real-valued function that operates on the state (example 3.8).
Table 3.3 defines the specifications for the continuum world, inverted pendulum,
and collision avoidance example problems using STL. Algorithm 3.2 provides a
framework for evaluating STL specifications over a trajectory given a time interval.
Suppose we want to implement the following STL formula in code: Example 3.8. Julia implementation
of an STL formula.
‘‘eventually the signal will be greater than 0.5.’’ We can use the
SignalTemporalLogic.jl package to define the predicate µ and the formula
ψ as follows:
julia> using SignalTemporalLogic
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5];
julia> μ = @formula sₜ -> sₜ > 1.0;
julia> ψ = @formula ◊(μ);
julia> ψ(τ) # check if formula is satisfied
true
The formula is satisfied since the signal eventually becomes greater than 1.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
70 c hapter 3. property specification
Continuum World
‘‘Reach the goal without
hitting the obstacle’’ G = @formula s->norm(s.-[6.5,7.5])≤0.5
G (st ): st is in the goal region F = @formula s->norm(s.-[4.5,4.5])≤0.5
F (st ): st is in the obstacle region ψ = @formula ◊(G) ∧ □(¬F)
ψ = ♦ G ( s t ) ∧ ¬ F ( s t )
Inverted Pendulum
−π/4 π/4 ‘‘Keep the pendulum balanced’’
θ B = @formula s->abs(s[1])≤π/4
B(st ): |θt | ≤ π/4
ψ = @formula □(B)
ψ = B(st )
S = @formula s->abs(s[1])≥50
0
S(st ): |ht | ≥ 50 ψ = @formula □(40:41, S)
−200
ψ = [40,41] S(st )
−400
1 11 21 31 41
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 71
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
72 c hapter 3. property specification
Let µ0 (st ) be a predicate function that is true if st is greater than 0. The Example 3.9. Robustness of the for-
mulas ψ1 = ♦µ0 and ψ2 = µ0
following code computes the robustness of the formulas ♦µ0 and µ0 over a over a signal τ.
signal τ:
4
julia> using SignalTemporalLogic
ρ1
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5]; 2
julia> μ = @formula sₜ -> sₜ > 0.0;
julia> ψ₁ = @formula ◊(μ); 0
s
julia> ρ₁ = ρ(τ, ψ₁)
3.0 −2
julia> ψ₂ = @formula □(μ); ρ2
julia> ρ₂ = ρ(τ, ψ₂) −4
-4.0
2 4 6 8 10
Time
The robustness of the formula ♦µc is the maximum difference between
the signal and the threshold. We would have to decrease all of our signal
values by at least this value to make the formula false. The robustness of the
formula µc is the minimum difference between the signal and the threshold.
We would have to increase all of our signal values by at least this value to
make the formula true. The figure in the caption shows signal values that
determine the robustness for each formula.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.6. reachability specifications 73
operator because the signal must satisfy the property at all time steps in the
interval. Example 3.9 demonstrates this concept.
We can use the robustness metric to assess how close a given system trajectory is
to a failure. Furthermore, if we are able to compute the gradient of the robustness
metric with respect to certain inputs to the system, we can understand how these
inputs affect the overall safety of the system. We will use this idea throughout
the book to understand system behavior. For example, we can uncover the failure
modes of a system by using the robustnes metric to guide the simultor towards a
failure trajectory (see chapter 4 for more details).
Taking the gradient of the robustness metrics requires that the robustness
formula is differentiable over the input space. However, the min and max func-
tions that commonly occur in STL fomulas are not differentiable everywhere.
To address this challenge, we can use smooth approximations of the min and
max functions, such as the softmin and softmax functions, respectively.17 These 17
K. Leung, N. Aréchiga, and
functions are defined as M. Pavone, “Backpropagation
Through Signal Temporal Logic
∑id si exp(−si /w) Specifications: Infusing Logical
softmin(s; w) = (3.18) Structure into Gradient-Based
∑dj exp(−s j /w) Methods,” The International Journal
of Robotics Research, vol. 42, no. 6,
∑id si exp(si /w) pp. 356–370, 2023.
softmax(s; w) = (3.19)
∑dj exp(s j /w)
4
where s is a signal of length d and w is a weight. As w approaches infinity, the ρ1
2
softmin and softmax functions approach the mean function. As w approaches
ρ̃1
0
zero, the softmin and softmax functions approach the min and max functions
ρ̃
(figure 3.13). We call the robustness metric that uses the softmin and softmax −2
functions the smooth robustness metric. Figure 3.14 shows the gradient of the −4
smooth robustness metric for different values of w.
0 5 10 15
w
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
74 c hapter 3. property specification
Let S T be the set of states for the inverted pendulum system where the Example 3.10. Reachability specifi-
cation for the inverted pendulum
pendulum has tipped over. In other words, S T is the set of states where the system.
angle θ is outside the range [−π/4, π/4]. Our goal is to avoid reaching this
set of states, so we define the reachability specification as
ψ = ¬♦ R ( s t ) (3.22)
where R(st ) is the predicate function that checks if the state is in the target
set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.6. reachability specifications 75
a state and an instantiation of truth values for the atomic propositions to the next
state. The accepting states are the states that are visited infinitely often when the
automaton accepts an infinite sequence of states. Example 3.11 shows a simple
Büchi automaton with two states and two propositions.
The figure below shows a simple Büchi automaton that accepts an infinite Example 3.11. Example of a Büchi
automaton with two states and two
sequence of states if the sequence satisfies the LTL formula A ∧ B. atomic propositions.
¬( A ∧ B) >
A∧B
start q1 q2
The automation has two states Q = {q1 , q2 }, where q1 is the initial state and
q2 is the accepting state. The automation has two atomic propositions A and
B. The transition function is defined for all possible combinations of truth
values for the automic propositions:
δ ( q1 , A ∧ B ) = q2
δ ( q1 , A ∧ ¬ B ) = q1
δ ( q1 , ¬ A ∧ B ) = q1
δ ( q1 , ¬ A ∧ ¬ B ) = q1
δ(q2 , −) = q2
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
76 c hapter 3. property specification
Suppose we have an LTL formula that specifies that we need to visit a check- Example 3.12. Conversion of an
LTL formula to a Büchi automaton.
point before reaching a goal, written as
♦G ∧ ¬ G U C
The resulting automaton is shown below. It has 4 states and the same
atomic propositions as the LTL formula. The accepting state is q4 , and the
LTL formula is satisfied if the automaton visits q4 infinitely often, or in other
words, if a trajectory reaches q4 . The state q2 represents the state where the
agent has reached the goal but has not reached the checkpoint. Once this
state has been reached, the agent will remain in this state forever with no
chance of reaching the accepting state and satisfying the LTL formula. This
state is often ommited in practice to reduce the size of the automaton.
>
¬C ∧ ¬ G >
q2
¬C ∧ G
C∧G
start q1 q4
C ∧ ¬G G
q3
¬G
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.7. summary 77
The transition model for the new state space is defined by the transition model of
the system T and the transition model of the Büchi automaton δ:
T (s0 | s, a) if q0 = δ(q, L(s))
T ((s0 , q0 ) | (s, q), a) = (3.24)
0 otherwise
where L(s) is a labeling function that maps a state s to values for the atomic
propositions of the Büchi automaton. For example, a labeling function for the
system in example 3.12 would map the state st to the values of that specify whether
it is a goal state or checkpoint state.
We refer to the system with the augmented state space as the product system.
The reachability specification for the product system is
ψ = ♦ R((st , qt )) (3.25)
3.7 Summary
• Metrics and specifications allow us to quantify and express the desired behavior
of a system.
• For stochastic systems, we often compute metrics over the full distribution of
possible outcomes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
78 c hapter 3. property specification
( q3 , s )
Product System
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.7. summary 79
• Temporal logic extends first-order logic to express properties about how sys-
tems evolve over time.
• Linear temporal logic (LTL) and Signal Temporal Logic (STL) are two common
temporal logics used in control and verification.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4 Falsification through Optimization
The first set of validation algorithms we will explore relate to falsification. Fal-
sification is the process of finding trajectories of a system that violate a given
specification. Such trajectories are sometimes referred to as counterexamples, failure
trajectories, or falsifying trajectories. We will refer to them in this textbook as failures
for simplicity. The beginning of the chapter introduces a naïve algorithm for find-
ing failures based on direct sampling, with the rest of the chapter focused on more
sophisticated algorithms that use optimization techniques to guide the search
for failures. Optimization-based falsification relies on the concept of disturbances,
which control the behavior of the system. We demonstrate how to frame the
falsification problem as an optimization over disturbance trajectories and outline
several techniques to perform the optimization.
P(k)
0.1
where k ∈ N.
0.05
Equation (4.1) corresponds to the probability mass function of a geometric
0
distribution with parameter pfail . Figure 4.2 shows an example of a geometric 1 2 3 4 5 6 7 8
distribution. The expected value of this distribution, 1/pfail , corresponds to the k
average number of samples required to find a failure. Example 4.1 illustrates this
Figure 4.2. The probability mass
relationship for the aircraft collision avoidance problem. Systems with very low function of a geometric distribu-
failure probabilities will require a large number of samples for direct falsification. tion with parameter pfail = 0.2.
The expected value of this distri-
For example, some aviation systems have failure probabilities on the order of bution is 1/pfail = 5.
10−9 . These systems require 1 billion samples on average to observe a single
failure event. The remainder of the chapter discusses more efficient falsification
techniques.
4.2 Disturbances
We can systematically search for failures by taking control of the sources of ran-
domness in the system. We control these sources of randomness using disturbances.
To incorporate disturbances into a system, we rewrite its sensor, agent, and en-
vironment models by breaking up their stochastic and deterministic elements.
For example, the observation model o ∼ O(· | s) can be written as a deterministic
function of the current state s and a stochastic disturbance xo such that
o = O(s, xo ), xo ∼ Do (· | s) (4.2)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.2. disturbances 83
Suppose we want to find failures of an aircraft collision avoidance system Example 4.1. Direct falsification ap-
plied to the aircraft collision avoid-
using direct falsification. In this scenario, a failure is a collision between two ance problem with different lev-
aircraft, which occurs when the relative altitude to the intruder aircraft h is els of noise applied to the transi-
tions. There are four state variables
within ±50 m and the time to collision tcol is zero. The collision avoidance for the collision avoidance problem.
environment applies additive noise with standard deviation σ to the relative These plots show how two of these
vertical rate of the intruder aircraft ḣ at each time step. This noise accounts for state variables evolve for each tra-
jectory. The horizontal axis is the
variation in pilot response to advisories and the intruder flight path. The plots time to collision tcol , and the ver-
below use different values of σ and show the trajectory samples produced tical axis is the altitude relative to
the intruder aircraft h.
before finding the first failure with the first failure trajectory highlighted in
red.
σ = 5m σ = 3m σ = 2m
400
200
h (m)
−200
−400
40 30 20 10 0 40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s) tcol (s)
As σ decreases, failures become less likely, and more trajectories are required
to find a failure. In this example, the first failure is found after 41 samples
with σ = 5 m, 84 samples with σ = 3 m, and 522 samples with σ = 2 m.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
84 c hapter 4. falsification through optimization
Suppose we model a sensor using a Gaussian noise model such that O(o | Example 4.2. Separating the sto-
chastic and deterministic elements
s) = N (o | s, Σ). We can rewrite this sensor model as of a sensor with a Gaussian noise
model.
o = s + xo , xo ∼ N (· | 0, Σ)
The agent’s policy and the environment’s transition model can also be decom-
posed:
a = π (o, x a ), x a ∼ Da (· | s) (4.3)
s0 = T (s, a, xs ), xs ∼ Ds (· | s, a) (4.4)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.3. fuzzing 85
4.3 Fuzzing
Unlike direct sampling, which samples from the nominal distribution over system 2
Fuzzing is a well-known concept
trajectories, we can find failures more efficiently by sampling from a trajectory in testing of traditional software.
It refers to the generation of off-
distribution designed to stress the system. We refer to this process as fuzzing.2 nominal inputs to a program to un-
Before we can perform fuzzing, we need to define the components of a trajectory cover potential bugs or failures and
was first introduced in B. P. Miller,
distribution. There are two sources of randomness in a trajectory rollout: the L. Fredriksen, and B. So, “An Em-
initial state and the disturbances applied at each time step. Therefore, we can fully pirical Study of the Reliability of
capture the distribution over trajectories by specifying an initial state distribution UNIX Utilities,” Communications of
the ACM, vol. 33, no. 12, pp. 32–44,
and a disturbance distribution for each time step (algorithm 4.3). 1990.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
86 c hapter 4. falsification through opti mization
uses the default initial state and disturbance distributions for the components
of the system. Nominal trajectory distributions are stationary, meaning that the
disturbance distribution does not depend on time.
initial_state_distribution(p::NominalTrajectoryDistribution) = p.Ps
disturbance_distribution(p::NominalTrajectoryDistribution, t) = p.D
depth(p::NominalTrajectoryDistribution) = p.d
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.3. fuzzing 87
Suppose we want to find failures of the inverted pendulum system with Example 4.3. Fuzzing applied
to the inverted pendulum system.
an additive noise sensor with Do (o | s) = N (o | 0, Σ) and Σ = 0.01I. If The plots in the top row show
we collect 100 samples with this nominal distribution, we do not find any the sampled disturbances for the
failures. However, if we define a new distribution and increase the standard sensor noise on each state vari-
deviation of the sensor noise on each variable from 0.1 to 0.15 (referred to able. The initial state distribution
as fuzzing), we are able to find two failures of the system in the first 100 is the same as the nominal initial
state distribution (algorithm A.6).
samples. The following code can be used to define the fuzzing distribution: The plots on the bottom row show
struct PendulumFuzzingDistribution <: TrajectoryDistribution the corresponding trajectories for
Σₒ # sensor disturbance covariance θ with failures highlighted in red.
d # depth By slightly increasing the standard
end deviation of the simulated sensor
function initial_state_distribution(p::PendulumFuzzingDistribution) noise, we are able to uncover two
return Product([Uniform(-π / 16, π / 16), Uniform(-1., 1.)]) failures.
end
function disturbance_distribution(p::PendulumFuzzingDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(zeros(2), p.Σₒ))
return D
end
depth(p::PendulumFuzzingDistribution) = p.d
The plots show the disturbances and trajectories for both distributions.
Nominal Fuzzing
0.5 0.5
xo,ω
xo,ω
0 0
−0.5 −0.5
−0.5 0 0.5 −0.5 0 0.5
xo,θ xo,θ
1 1
θ (rad)
θ (rad)
0 0
−1 −1
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
88 c hapter 4. falsification through optimization
The falsification problem can be reformulated as a search over the space of initial
states and disturbances. Algorithm 4.6 performs a trajectory rollout given an initial
state and a sequence of disturbances. We refer to this sequence of disturbances as
a disturbance trajectory x = ( x1 , . . . , xd ). Unlike algorithm 4.5, algorithm 4.6 is
deterministic. The initial state s and disturbance trajectory x fully determine the
resulting trajectory τ.
minimize f (τ )
s,x
(4.5)
subject to τ = Rollout(s, x)
The rest of this chapter discusses different objective functions and optimization
techniques for solving the optimization problem in equation (4.5).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.5. objective functions 89
Objective functions guide the search for failure trajectories. In general, a good
objective function should output lower values for trajectories that are closer to a
failure. The specific measure of closeness used is dependent on the application.
For example, in the aircraft collision avoidance problem, we may use the vertical
miss distance between the aircraft as the objective value.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
90 c hap ter 4. falsification through optimizati on
Suppose we want to compute the robustness objective for the inverted pendu- Example 4.4. Extracting an ini-
tial state and disturbance trajectory
lum system where the initial state is always s = [0, 0]. We write the extract from a vector of real values for the
function as follows: inverted pendulum system.
function extract(env::InvertedPendulum, x)
s = [0.0, 0.0]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 1:2:length(x)]
return s, 𝐱
end
The function extracts the sensor disturbances from the real-valued vector x
to create a disturbance trajectory 𝐱. It then returns the fixed initial state s and
the disturbance trajectory.
failure requires specifying the distribution over trajectories and using its prob-
ability density function to evaluate likelihoods. Assuming that the initial state
and disturbances are sampled independently from one another, the probability
density function of a trajectory distribution p is
d
p ( τ ) = p ( s1 ) ∏ D ( x i | s i , a i , o i ) (4.6)
i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.5. objective functions 91
If the input trajectory does not produce a failure, equation (4.7) uses the robust-
ness to guide the search toward any failure. If the input does produce a failure
trajectory, it uses the negative likelihood of the trajectory to guide the search
toward more likely failures. Figure 4.3 compares a search for failures with a
search for the most likely failure on the grid world problem. While the robustness
objective finds failures that move directly toward the obstacle, the most likely
failure objective finds a failure that stays close to the nominal path.
The objective function in equation (4.7) leads to multiple practical challenges.
For example, to encourage the optimization algorithm to find failures, we must
ensure that failures never have a higher objective value than successes. Since
ρ(τ, ψ) ≥ 0 and − p(τ ) ≤ 0, equation (4.7) satisfies this condition. However, p(τ )
can be very small for long trajectories, which can lead to numerical stability issues.
Using log likelihood improves numerical stability but breaks the condition that
failures never have a higher objective value than successes.
This numerical instability as well as the discontinuity at the point of a failure cre-
ates challenges for first- and second-order optimization algorithms (section 4.6).
Furthermore, while the global minimum of the objective function in equation (4.7)
corresponds to the most likely failure of the system, many optimizers are only
guaranteed to find local minima. Due to this fact and the numerical stability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
92 c hapter 4. falsification through opti mization
issues, other objective functions may lead to the discovery of more likely failures
in practice.
Another common objective for most likely failure analysis is
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.6. optimization algorithms 93
methods start from an initial design point and incrementally improve it until
θ (rad)
some convergence criteria is met. At each iteration, they use a local model of 0
the objective function at the current design point to determine a direction of
improvement. They then take a step in this direction to compute the next design −1
point. Some methods use the gradient or Hessian of the objective function with 0 0.2 0.4 0.6 0.8 1
respect to the current design point to create the local model. These methods are Time (s)
called first-order and second-order methods, respectively. Figure 4.4 shows the
Figure 4.4. First-order method ap-
result of applying a first-order method called gradient descent to find failures for plied to falsify the inverted pendu-
the inverted pendulum example. lum example. The plot shows suc-
cessive iterations of the algorithm,
While the gradient and Hessian provide a very powerful signal for optimiza- with darker trajectories indicating
tion algorithms, they are not always available. 6 Some simulators do not provide later iterations. Failures are high-
access to the internal model of the system, making exact computation of the lighted in red. The algorithm gets
closer to a failure with each iter-
gradient infeasible. We often refer to such simulators as black-box simulators. An- ation until it eventually begins to
other category of optimization algorithms called direct methods is better suited find failures.
for systems with black-box simulators. They traverse the input space using only
6
Gradient information is a strong
information from function evaluations, eliminating the need for access to the enough signal to effectively op-
system’s internal model. timize machine learning models
with billions of parameters.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
94 c hapter 4. falsification through opti mization
The Optim.jl package provides implementations of several optimization al- Example 4.5. Applying a second-
order method called L-BFGS to
gorithms. In this example, we show how to use the Optim.jl implementation falsify the inverted pendulum ex-
of a second-order method called L-BFGS to falsify the inverted pendulum ample. We use the open-source
system. We define the optimizer function for algorithm 4.11 and run the implementation of L-BFGS in the
algorithm using the robustness objective as follows: Optim.jl package. The plot shows
the trajectory of the pendulum
using Optim
for the initial point (green) and
function lbfgs(f, sys, ψ)
the failure trajectory discovered af-
x₀ = zeros(42)
alg = Optim.LBFGS() ter one iteration (red). For more
options = Optim.Options(store_trace=true, extended_trace=true) information on the L-BFGS algo-
results = optimize(f, x₀, alg, options; autodiff=:forward) rithm, see J. Nocedal, “Updating
τs = [rollout(sys, extract(sys.env, iter.metadata["x"])...) Quasi-Newton Matrices with Lim-
for iter in results.trace] ited Storage,” Mathematics of Com-
return filter(τ->isfailure(ψ, τ), τs) putation, vol. 35, no. 151, pp. 773–
end 782, 1980.
objective(x, sys, ψ) = robustness_objective(x, sys, ψ, smoothness=1.0)
alg = OptimizationBasedFalsification(objective, lbfgs) 1
failures = falsify(alg, inverted_pendulum, ψ) Failure
Trajectory
θ (rad)
In this implementation, we are optimizing over a disturbance trajectory with 0
Initial
depth d = 21. Since each sensor disturbance is two-dimensional, the length of Trajectory
each design point is 42. The lbfgs function starts with an initial design point
−1
of all zeros, specifies options to store the results of each iteration, and runs 0 0.2 0.4 0.6 0.8 1
the algorithm using ForwardDiff.jl to compute gradients. It then extracts Time (s)
the initial state and disturbance trajectory from each iteration and performs
a rollout of the system. Finally, it filters the resulting trajectories to return
failure trajectories. It is important that we specify the objective as smoothed
robustness so that the gradients are well-defined. The plot on the right shows
the progression of the algorithm. L-BFGS converges to a failure trajectory
after a single iteration.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.7. summary 95
Local descent methods often get stuck in local optima. Population methods at-
tempt to overcome this drawback by performing optimization using a collection of
design points. The points in a population are sometimes referred to as individuals.
Population methods begin with an initial population that is spread out over the
design space. At each iteration, they use the current function value of each indi-
vidual to move the population toward the optimum. Because population methods
spread samples over the entire design space rather than incrementally improving
a single point, they may find a more diverse set of failures. For example, the
population method in figure 4.5 is able to find failures for the pendulum in both
directions. High-dimensional problems with long time horizons may require a
large number of samples to cover the design space. However, population methods
are often easy to parallelize, which can improve efficiency.
4.7 Summary
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
96 c hapter 4. falsification through optimization
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5 Falsification through Planning
Shooting methods use optimization to find a feasible path between two points,1 1
The term shooting method is
and they can be used in the context of falsification to produce feasible failure based on the analogy of shooting
at a target from a cannon. Shoot-
trajectories. These methods break the trajectory optimization problem into a set ing methods start at an initial point
of smaller problems by optimizing over a sequence of trajectory segments. A and ‘‘shoot’’ trajectories toward a
target point until a feasible path
trajectory τ can be partitioned into n segments such that τ = (τ1 , . . . , τn ). Each between the initial point and tar-
trajectory segment τi is defined by an initial state si and a sequence of disturbances get is found. Shooting methods
xi of length di . Given si and xi , we can compute the resulting trajectory τi by orginated from research on bound-
ary value problems. A more de-
performing a rollout. tailed review with an implemena-
The defect between two trajectory segments is the distance between the final tion can be found in section 18.1 of
the reference by W. H. Press, S. A.
state of the first segment and the initial state of the second segment. A set of Teukolsky, W. T. Vetterling, and B. P.
trajectory segments forms a feasible trajectory if the defect of all consecutive Flannery, Numerical Recipes 3rd Edi-
trajectory segments is 0. In other words, the final state of τi must match the initial tion: The Art of Scientific Computing.
Cambridge University Press, 2007.
state of τi+1 for all i ∈ {1, . . . , n − 1}. This requirement leads to the following
98 c hapter 5. falsification through planning
optimization problem
minimize f (τ1 , . . . , τn )
s1 ,x1 ,...,sn ,xn
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.2. tree search 99
Iteration 2 Iteration 4 Iteration 6 Iteration 8 Converged Figure 5.1. Multiple shooting ap-
plied to the continuum world ex-
ample to find a path from an initial
point to the obstacle. We use four
trajectory segments, and the colors
denote which segment end points
should connect. The plots show the
trajectory segments at different iter-
ations of the L-BFGS optimization
algorithm.
where λ is a weighting parameter that controls how heavily the defect is penalized.
Algorithm 5.1 implements this objective when f is the temporal logic robustness.
We can apply any of the optimization algorithms discussed in section 4.6 to the
optimization problem in equation (5.2). Compared to the optimization problems
in the previous chapter, minimizing the defect between the trajectory segments
adds complexity to the problem. This added complexity can make it more diffi-
cult to find a feasible failure trajectory. Figure 5.1 shows an example that uses a
gradient-based optimization technique called L-BFGS2 to find a failure trajectory 2
J. Nocedal, “Updating Quasi-
for the continuum world problem. For systems with black-box simulators, the Newton Matrices with Limited
Storage,” Mathematics of Computa-
direct methods described in section 4.6 may struggle to find feasible failure tra- tion, vol. 35, no. 151, pp. 773–782,
jectories. Instead, we can use direct methods that were designed specifically for 1980.
multiple shooting.3 3
For an example of a multiple
shooting algorithm designed for
systems with black-box simulators,
5.2 Tree Search see A. Zutshi, J. V. Deshmukh, S.
Sankaranarayanan, and J. Kapinski,
“Multiple Shooting, CEGAR-Based
Tree search algorithms iteratively construct a tree structure that represents the Falsification for Hybrid Systems,”
space of possible trajectories. Each node in the tree represents a state, and each in International Conference on Embed-
ded Software, 2014.
edge represents a transition between states that is the result of applying a partic-
ular disturbance. Each path through the tree corresponds to a feasible trajectory
for the system. Tree search algorithms start in an initial state and iteratively grow
the tree in an attempt to find feasible failure trajectories.
At each iteration, these algorithms perform the steps illustrated in figure 5.2.
They first select a node from the tree to extend. This selection is typically based
on a heuristic designed to grow the tree toward failures. Next, they extend the
selected node by choosing a disturbance and adding a new child node at the
resulting next state. We can terminate the algorithm after a fixed number of
iterations or when a failure trajectory is discovered.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
100 chapter 5. falsification through pla nning
Algorithm 5.2 implements the generic tree search algorithm. It runs for a fixed
number of iterations before returning all failures in the tree. Algorithm 5.3 extracts
failure trajectories from a tree by enumerating all paths in the tree and checking
for failures. Specific implementations of tree search algorithms differ in how they
implement the select and extend functions. We discuss two categories of tree
search algorithms in the next two sections.
s s s s
Some tree search algorithms use heuristics to explore the space of possible tra-
jectories. The rapidly exploring random trees (RRT) algorithm, for example, uses
heuristics to iteratively extend the search tree toward randomly selected states in
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 101
the state space.4 In the context of falsification, we use RRT to efficiently explore 4
RRT was designed to efficiently
the space of possible disturbance trajectories in search of a failure trajectory. enumerate trajectories in high-
dimensional spaces, particularly
Algorithm 5.4 implements the select and extend steps for the RRT algorithm. for systems with complex dynam-
In the select step, RRT randomly samples a goal state and computes an objective ics. The algorithm was originally
proposed in the context of robotic
value for each node in the current tree based on the sampled goal state. This path planning. For more informa-
objective is typically related to the distance between each node and the goal state. tion on path planning algorithms,
The algorithm then selects the node with the lowest objective value to pass to see S. LaValle, “Planning Algo-
rithms,” Cambridge University Press,
the extend step. In the extend step, RRT selects a disturbance, simulates one step vol. 2, pp. 3671–3678, 2006.
forward in time from the selected node, and adds the resulting edge and child
node to the tree.
Several variants of RRT differ in how they sample goal states, compute objec-
tives, and select disturbances. Algorithm 5.5 implements a version of the RRT
algorithm that samples goal states uniformly from the state space. It then uses
the Euclidean distance between the each node and the goal state as the objective.
In the extend step, the disturbance is randomly sampled from the nominal dis-
turbance distribution for the system. Example 5.1 applies this algorithm to the
continuum world problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
102 chapter 5. falsification through pla nning
random_goal(tree, lo, hi) = rand.(Distributions.Uniform.(lo, hi)) Algorithm 5.5. Functions for the
RRT algorithm. The first function
function distance_objectives(tree, sgoal) samples a goal state uniformly
return [norm(sgoal .- node.state) for node in tree] from the state space. The lo and
end hi inputs specify the lower and up-
per bounds of the state variables.
function random_disturbance(sys, node) The second function computes the
D = DisturbanceDistribution(sys) Euclidean distance between each
o, a, s′, x = step(sys, node.state, D) node in the tree and the goal state.
return x The third function samples a dis-
end turbance from the nominal distur-
bance distribution for the system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 103
Suppose we want to apply RRT to search for failures for the continuum world Example 5.1. Basic RRT applied to
the continuum world example. The
system. We can use the following code to run the basic RRT algorithm for plots show snapshots of the search
100 iterations. tree after 5, 15, and 100 iterations.
select_goal(tree) = random_goal(tree, [0.0, 0.0], [10.0, 10.0]) The stars show the next goal state
compute_objectives(tree, sgoal) = distance_objectives(tree, sgoal) and highlighted nodes show the
select_disturbance(tree, node) = random_disturbance(tree, node) node selected to extend next.
alg = RRT(select_goal, compute_objectives, select_disturbance, 100)
failures = falsify(alg, cw, ψ)
The plots below show two snapshots of the search tree after 5 and 15 iterations
as well as the final tree after 100 iterations. After 100 iterations, RRT did not
find any failure trajectories. Although goal states are sampled throughout
the state space, the disturbances are sampled from the nominal disturbance
distribution. Since the nominal disturbance distribution represents only small
deviations from the nominal path, the tree closely follows the nominal path
toward the goal. We can improve the performance of the tree search using
the heuristics discussed in section 5.3.1.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
104 chapter 5. falsification through pla nning
using the goal state to select the disturbance. Specifically, we want to select the
disturbance that leads to the next state that is closest to the goal state.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 105
The failure region for the continuum world example is the set of states within Example 5.2. Example of sampling
goal states from the failure region
the red obstacle, which is a circle centered at (4.5, 4.5) with radius 0.5. We of the continuum world problem.
can uniformly sample from this region using the following code:
function failure_goal(tree)
r = rand(Uniform(0, 0.5))
θ = rand(Uniform(0, 2π))
return [4.5, 4.5] .+ [r*cos(θ), r*sin(θ)]
end
The code uniformly samples a radius between 0 and 5 and an angle between
0 and 2π. It then converts these samples to a state in the failure region.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
106 chapter 5. fa lsification through planning
Since dispersion considers only the largest ball that can be placed in S , it tends
to be a conservative measure of coverage. Furthermore, it is difficult to compute
for high-dimensional spaces. An approximate metric called average dispersion
overcomes these drawbacks.6 Average dispersion is computed on a grid of n 6
J. M. Esposito, J. Kim, and V. Ku-
points with spacing δ in each dimension. It is calculated as mar, “Adaptive RRTs for Validat-
ing Hybrid Robotic Control Sys-
n min(d j (V ), δ)
1 tems,” in Algorithmic Foundations
average dispersion =
n ∑ δ
(5.4) of Robotics, Springer, 2005, pp. 107–
j =1 121.
where d j (V ) is the distance from the jth grid point to the nearest point in V .
function average_dispersion(points, lo, hi, lengths) Algorithm 5.7. Algorithm for com-
points_norm = [(point .- lo) ./ (hi .- lo) for point in points] puting average dispersion of a set
ranges = [range(0, 1, length) for length in lengths] of points on a space bounded by
δ = minimum(Float64(r.step) for r in ranges) lo and hi. It uses a grid speci-
grid_dispersions = [] fied by lengths, which contains
for grid_point in Iterators.product(ranges...) the number of grid points in each
dmin = minimum(norm(grid_point .- p) for p in points_norm) dimension. The algorithm first nor-
push!(grid_dispersions, min(dmin, δ) / δ) malizes the points to lie in the unit
end hypercube. It then creates the grid
return mean(grid_dispersions) over the unit hypercube and com-
end putes the average dispersion using
equation (5.4).
Algorithm 5.7 computes average dispersion given a set of points and a bounded
region. The term in the numerator of equation (5.4) is the radius of the largest
ball centered at each grid point that does not contain any points in V or other grid
points. Dividing by δ ensures that the values for average dispersion range between
0 and 1, and subtracting the average dispersion from 1 results in a coverage metric 7
The average dispersion coverage
that ranges between 0 and 1.7 Figure 5.6 shows the difference between dispersion metric will be 1 if V contains all of
the grid points. A finer grid will
and average dispersion. result in better coverage estimates
Another common coverage metric is discrepancy. The key insight behind dis- but at a greater computational cost.
crepancy is that if a set of points covers a space evenly, then a randomly chosen
subset of the space should contain a fraction of samples proportional to the frac-
tion of volume occupied by the subset. Discrepancy is defined as the worst case
hyperrectangular subset:
#(V ∩ H) vol(H)
discrepancy = sup − (5.5)
H⊆S #(V ) vol(S)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 107
where H is a hyperrectangular subset of S and #(V ∩ H) and #(V ) are the number
of points in V that lie in H and the total number of points in V respectively. We use
vol(H) and vol(S) to denote the n-dimensional volume of H and S respectively,
which can be obtained by multiplying the side lengths.
The worst-case hyperrectangle that determines the discrepancy of a set of
points is typically a small region containing many points or a large region with
few points. Figure 5.7 visualizes the discrepancy metric. Discrepancy approaches
1 when all points overlap and approaches 0 when all possible hyperrectangular Figure 5.7. Visualization of the
discrepancy metric. The rectangles
subsets have their proper share of points. In general, discrepancy is difficult to indicate two candidates for the
compute exactly, especially in high dimensions. worst case rectangle used to define
discepancy. Discrepancy is deter-
Star discrepancy is a special case of discrepancy that is easier to compute and is mined by a rectangle with small
often used in practice. Instead of considering all possible hyperrectangular subsets, area and many points (top) or a
star discrepancy considers only hyperrectangular subsets of the unit hypercube rectangle with large area and few
points (bottom).
that have a vertex at the origin. We can always normalize any hyperrectangular
space S to the unit hypercube by dividing by the side length in each dimension. 8
E. Thiémard, “An Algorithm to
Compute Bounds for the Star Dis-
Given these constraints, it is possible to compute lower and upper bounds on
crepancy,” Journal of Complexity,
star discrepancy.8 We first partition the unit hypercube B into a finite number of vol. 17, no. 4, pp. 850–880, 2001.
Examples of other approximations
can be found in Y.-D. Zhou, K.-T.
Fang, and J.-H. Ning, “Mixture Dis-
crepancy for Quasi-Random Point
Sets,” Journal of Complexity, vol. 29,
no. 3-4, pp. 283–301, 2013.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
108 chapter 5. falsification through planning
Star Discrepancy
We can use average dispersion or star discrepancy as a metric to select the goal 0.8
state in RRT. In particular, we want to select the goal state that would result in 0.6
the greatest increase in coverage if added to the current tree. While it is difficult 0.4
to determine the goal state exactly, we can approximate this process by drawing 0.2
samples from the state space, computing the difference in coverage for each
0
sample, and selecting the sample with the largest increase.9 The samples may 20 40 60 80 100
be selected from a grid (figure 5.10) or drawn uniformly from the state space Grid Resolution
function star_discrepancy(points, lo, hi, lengths) Algorithm 5.8. Algorithm for com-
n, dim = length(points), length(lo) puting upper and lower bounds
𝒱 = [(point .- lo) ./ (hi .- lo) for point in points] on the star discrepancy of a set
ranges = [range(0, 1, length)[1:end-1] for length in lengths] of points on a space bounded
steps = [Float64(r.step) for r in ranges] by lo and hi. It uses a partition
ℬ = Hyperrectangle(low=zeros(dim), high=ones(dim)) specified by lengths, which con-
lbs, ubs = [], [] tains the number of subrectangles
for grid_point in Iterators.product(ranges...) in each dimension. The algorithm
h⁻ = Hyperrectangle(low=zeros(dim), high=[grid_point...]) first normalizes the points to lie in
h⁺ = Hyperrectangle(low=zeros(dim), high=grid_point .+ steps) the unit hypercube. It then creates
𝒱h⁻ = length(filter(v -> v ∈ h⁻, 𝒱)) the partition over the unit hyper-
𝒱h⁺ = length(filter(v -> v ∈ h⁺, 𝒱)) cube and computes the upper and
push!(lbs, max(abs(𝒱h⁻ / n - volume(h⁻) / volume(ℬ)), lower bounds on star discrepancy
abs(𝒱h⁺ / n - volume(h⁺) / volume(ℬ)))) using equation (5.6). We use the
push!(ubs, max(𝒱h⁺ / n - volume(h⁻) / volume(ℬ), LazySets.jl package to represent
volume(h⁺) / volume(ℬ) - 𝒱h⁻ / n)) hyperrectangles.
end
return maximum(lbs), maximum(ubs)
end
Average Dispersion Star Discrepancy Figure 5.10. Selecting the next goal
state for RRT applied to the con-
tinuum world problem using av-
erage dispersion and star discrep-
ancy coverage metrics. The plots
show the grid points used as can-
didates for the next goal state. The
color of each grid point indicates
the increase in coverage that would
result from adding that grid point
to the tree with darker colors indi-
cating a greater increase. For star
discepancy, the colors represent
the lower bound. The star indicates
the goal state selected by RRT. Be-
cause star discrepancy only focuses
on the worst-case hyperrectangle,
it is not as smooth as average dis-
persion.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
110 chapter 5. falsification through planning
Suppose we want to apply coverage heuristics when using RRT on the con- Example 5.3. RRT applied to the
continuum world problem using
tinuum world problem. The following code implements a version of the coverage heuristics. The plots illus-
select_goal function that uses coverage based on average dispersion to trate the effect of selecting the next
guide the search. goal state based on coverage rather
function select_goal(tree; m=5) than randomly selecting it.
a, b, lengths = [0, 0], [10, 10], [10, 10]
points = [node.state for node in tree]
sgoals = [rand.(Distributions.Uniform.(a, b)) for _ in 1:m]
dispersions = [average_dispersion([points..., sgoal], a, b, lengths)
for sgoal in sgoals]
coverages = 1 .- dispersions
return sgoals[argmax(coverages)]
end
We first collect the states visited so far from the nodes of the tree and sample
m potential goal states uniformly from the state space. We then compute the
new average dispersion if each goal state were added to the tree. The goal
state that results in the greatest increase in coverage is selected. The plots
show the resulting trees when using random goals and coverage-based goals.
Using the coverage-based goal selection results in a wider tree that covers
more of the state space.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 111
We can terminate the tree search when the growth metric is sufficiently small.
It is important to note that this growth metric does not provide any guarantees
about the coverage of the search tree. Even when growth is small, there may still be
a number less than 1 (see figure 5.11). Some states are extremely unlikely to be 0.4
reached under the nominal disturbance model. 0.2
0
100 200 300
5.3.3 Alternative Objectives
Iteration
As noted in the previous chapter, we may want to go beyond a simple search for
Figure 5.11. The average disper-
failures and incorporate other objectives into the search process. For example, we sion coverage metric over iterations
may be interested in finding the shortest path to failure or the most likely failure. of RRT applied to the continuum
world problem.
We can incorporate these objectives into RRT by modifying how we compute the
objectives in the select step (algorithm 5.9).
First, we define a cost function c that maps a node to a cost of transitioning
to the node from its parent. For example, the cost might be a measure of the
distance between the node’s state and its parent’s state. To ensure that the tree
search algorithm is still encouraged to reach the goal, all costs must be positive.
The total cost of a path is the sum of the costs of all nodes in the path. Our goal is
to find the path to the goal with the lowest total cost.
We compute an objective for each node consisting of two components: the
total cost of the current path from the root to the node and an estimate of the
remaining cost to get from the node to the goal state. The remaining cost estimate 10
This algorithm is a simplified
comes from a heuristic function h. One potential heuristic is the distance from the version of the RRT∗ algorithm.
S. Karaman and E. Frazzoli, “In-
current node to the goal state. Algorithm 5.9 implements this process given a cost cremental Sampling-Based Algo-
function and heuristic function.10 It provides default cost and heuristic functions rithms for Optimal Motion Plan-
ning,” Robotics Science and Systems
that will guide the search toward the shortest path. VI, vol. 104, no. 2, pp. 267–274,
To search for the most likely failure, we can use a cost function related to the 2010.
negative log likelihood of the disturbance for the current node. We add a constant
factor of the maximum possible log likelihood according to the disturbance dis-
tribution to ensure that the cost is positive. For the heuristic function, we need to
estimate the log likelihood of the remaining path required to reach the goal state.
One option is to use the distance to the goal state as a proxy for this value since
longer paths tend to result in lower negative log likelihoods. Adding a scaling
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
112 chapter 5. falsification through planning
factor to the cost function to balance between the heuristic and cost may improve Figure 5.12. The nominal path
performance. Figure 5.12 shows the results from using RRT to find the shortest for the continuum world problem
path to failure and most likely failure for the continuum world problem. compared to the shortest path to
failure and the most likely failure
While algorithm 5.9 will often find a low cost path to failure, it is not necessarily path found by RRT. The most likely
guaranteed find the path with the lowest possible cost. Certain conditions on failure path stays closer to the nom-
inal path before moving toward the
the nature of the problem and the heuristic function are required to guarantee obstacle.
optimality. Algorithm 5.9 will converge to the optimal path if the state space 11
When these conditions are met,
and disturbance space are discrete and the heuristic function is admissible.11 A the algorithm is the same as the
A∗ search algorithm. P. E. Hart, N. J.
heuristic is admissible if it is guaranteed to never overestimate the cost of reaching Nilsson, and B. Raphael, “A For-
the goal state. In shortest path problems, the straight-line distance to the goal mal Basis for the Heuristic Deter-
mination of Minimum Cost Paths,”
state is an admissible heuristic. Example 5.4 demonstrates this result on the grid IEEE Transactions on Systems Sci-
world problem. ence and Cybernetics, vol. 4, no. 2,
pp. 100–107, 1968.
12
5.4 Monte Carlo Tree Search For a survey, see C. B. Browne, E.
Powley, D. Whitehouse, S. M. Lu-
cas, P. I. Cowling, P. Rohlfshagen, S.
Monte Carlo tree search (MCTS) (algorithm 5.10) is a tree search algorithm that Tavener, D. Perez, S. Samothrakis,
balances between exploration and exploitation.12 It explores by selecting nodes and S. Colton, “A Survey of Monte
Carlo Tree Search Methods,” IEEE
that have not been visited many times and exploits by biasing the search tree Transactions on Computational Intel-
toward paths that seem most promising. MCTS determines which paths are most ligence and AI in Games, vol. 4, no. 1,
pp. 1–43, 2012.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.4. monte carlo tree search 113
Since the state space and disturbance space for the grid world problem are Example 5.4. Example of using
RRT to find the shortest path to
discrete, we are guaranteed to find the shortest path to failure and the most failure and most likely failure for
likely failure path as long as we select an admissible heuristic function. For the grid world problem. The plots
show the search tree at different it-
the shortest path to failure, an admissible heuristic is the Euclidean distance erations of the algorithm. The most
between the current state and the goal state. This distance will always be less likely failure path stays closer to
than or equal to the actual cost of reaching the goal state since the shortest the nominal path (highlighed in
gray) before moving toward the ob-
path between two points is a straight line. For the most likely failure path, stacle.
we can use the likelihood of a straight line trajectory from the current state to
the goal state assuming that it used the most likely disturbance at each step.
The plots show the results. As in the continuum world problem (figure 5.12),
the most likely failure path stays closer to the nominal path before moving
toward the obstacle.
Iteration 10 Iteration 25 Converged
Shortest Path
Most Likely Path
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
114 chapter 5. fa lsification through planning
promising by maintaining a value function Q(s, x ) for each node in the tree. Given
a failure objective (section 4.5), Q(s, x ) represents the expected future objective
value when applying disturbance x from state s. MCTS searches for the path with
the lowest objective value.13 13
When the objective function is
In the select step, MCTS traverses the tree starting at the root node. At each the most likely failure objective,
this technique is sometimes re-
node, we determine whether to select it for the extend step based on its current ferred to as adaptive stress testing. R.
number of children and number of visits N (s). Specifically, we extend the node if Lee, O. J. Mengshoel, A. Saksena,
R. W. Gardner, D. Genin, J. Silber-
the number of children is less than or equal to kN (s)α , where k and α are algorithm mann, M. Owen, and M. J. Kochen-
hyperparameters. This process is referred to as progressive widening. If the number derfer, “Adaptive Stress Testing:
Finding Likely Failure Events with
of children exceeds this value, we continue to traverse the tree using a heuristic
Reinforcement Learning,” Journal
that balances between exploration and exploitation. of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020.
Iteration 100 Iteration 200 Failure Found Figure 5.13. MCTS applied to find
a failure in the continuum world
problem. Darker nodes and edges
were visited more often. MCTS
finds a failure (highlighted in red)
after 258 iterations.
A common heuristic is the lower confidence bound (LCB) (algorithm 5.11), which
is defined as s
log N (s)
Q(s, x ) − c (5.8)
N (s, x )
where N (s, x ) is the number of times we took the path corresponding to distur-
bance x from the node corresponding to state s. The first term in equation (5.8)
exploits our current estimate of how promising a particular path is based on the
value function, and the second term is an exploration bonus. The exploration
constant c controls the amount of exploration. Higher values will lead to more
exploration. We move to the child node with the lowest LCB value and repeat the
process until we reach a node that we can extend.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.4. monte carlo tree search 115
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
116 chapter 5. falsification through planning
In the extend step, MCTS samples a disturbance and simulates the system one
step forward in time from the current node. It then estimates the value at the
new node and adds it to the tree. A common technique to estimate this value is
to perform rollouts from the new node and evaluate their robustness. We can
also estimate the value using a heuristic such as distance to failure. Finally, we
propagate this information back up the tree to update the visit counts and mean
value estimate for each node in the path. Figure 5.13 shows the result of using
MCTS to find failures in the continuum world problem. The algorithm gradually
expands the tree toward the obstacle and visits promising nodes more often.
The tree search algorithms we have presented so far assumed deterministic
transitions between nodes. In other words, simulating disturbance x from state s
will always lead to the same next state s0 . However, we may not have control over
all sources of randomness for some real-world simulators, resulting in stochastic
transitions between nodes. One advantage of MCTS is that it can handle this
stochasticity. A technique called double progressive widening can be used to extend
the tree in these cases. Double progressive widening applies the progressive
widening condition to both the disturbance and next state.14 14
A. Couëtoux, J.-B. Hoock, N.
Sokolovska, O. Teytaud, and N.
Bonnard, “Continuous Upper Con-
5.5 Reinforcement Learning fidence Trees,” in Learning and In-
telligent Optimization (LION), 2011.
Reinforcement learning algorithms train agents to perform a task while they interact
with an environment.15 We can use reinforcement learning for falsification by 15
For an introduction to reinforce-
training an agent to cause a system to fail. To avoid confusing the reinforcement ment learning, see R. S. Sutton and
A. G. Barto, Reinforcement Learning:
learning agent with the agent in the system under test, we call the reinforcement An Introduction, Second Edition.
learning agent an adversary. MIT Press, 2018.
Figure 5.14 shows the overall setup. At each time step, the adversary interacts
with the system by selecting a disturbance x. The system then steps forward in
time and produces a reward r for the adversary related to the failure objective.
We refer to a series of these time steps as an episode. Reinforcement learning
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.6. simulator requirements 117
algorithms train the adversary to maximize reward using data gathered over a
System
series of episodes. Specifically, the adversary learns a policy πadv (s) that maps
states to disturbances. Once the adversary is trained, we can use it to search for
failures by performing rollouts of the system using disturbances selected by the x r
adversary’s policy.
Similar to MCTS, reinforcement learning algorithms balance between explo- Adversary
ration and exploitation. The adversary explores by trying different disturbances
in each state, and exploits by selecting disturbances that are likely to lead to a Figure 5.14. Reinforcement learn-
failure. Typically, the adversary will explore more at the beginning of training to ing for falsification. We train an ad-
versary to select disturbances that
gather data that it can later on exploit. Reinforcement learning algorithms balance will cause a system to fail. The ad-
between these two objectives to maximize sample efficiency. Sample efficient algo- versary receives feedback in the
form of a reward signal.
rithms require as few episodes as possible to learn an effective policy. A number
of sample efficient reinforcement learning algorithms have been developed, and
we can use off-the-shelf implementations of them to efficiently find failures of
complex systems.16 16
Off-the-shelf reinforcement
Another advantage of a reinforcement learning approach is its ability to gener- learning packages provide im-
plementations of a variety of
alize. The shooting methods and tree search algorithms discussed in this chapter reinforcement learning algorithms.
all required a specific initial state from which to find a failure path. Using rein- For example, see the Crux.jl
package in the Julia ecosystem.
forcement learning to find failures removes this necessity. Because the adversary
learns a policy over the entire state space, we can perform a rollout from any
initial state to search for a failure. Example 5.5 demonstrates this result on the
continuum world problem using an off-the-shelf reinforcement learning package.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
118 chapter 5. falsification through planning
To apply the reinforcement learning algorithms implemented in the Crux.jl Example 5.5. Example of using re-
inforcement learning to find fail-
package to the continuum world problem, we need to define the following: ures in the continuum world prob-
initial_state_dist = Product([Distributions.Uniform(0, 10), lem. The plots show rollouts of the
Distributions.Uniform(0, 10)]) adversary policy starting from dif-
function interact(s, x, rng) ferent initial states after different
_, _, s′ = step(cw, s, Disturbance(0, x, 0)) numbers of training episodes. Fail-
r = Float32(robustness(s, ψ.formula) - robustness(s′, ψ.formula)) ure trajectories are highlighted in
norm(s′ - [4.5, 4.5]) < 0.5 ? r += 10.0 : nothing red. The adversary is able to find
return (sp=s′, r=r) failures from most initial states af-
end
ter 50,000 training episodes. For
more information on the solving
We first define an initial state distribution that covers the entire state space,
code, see the Crux.jl documenta-
allowing us to find failures starting from any state. The interact function tion.
defines how the adversary interacts with the system. Given a state s and a
disturbance x, the function simulates the system one step forward in time
and returns a tuple with the next state s′ and reward r. The random number
generator rng is a required input for Crux.jl but is not used in this case since
the function is deterministic.
The reward is based on the change in robustness for the current step. We
also add a large reward for reaching a failure state. With these definitions, we
can apply any of the reinforcement learning algorithms in the Crux.jl pack-
age to find failures. The plots show rollouts of the adversary policy starting
from different initial states after different numbers of training episodes using
an algorithm called Proximal Policy Optimization (PPO). The adversary is
able to find failures from most initial states after 50,000 training episodes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.6. simulator requirements 119
Extend
tree search
multiple shooting
s, x s0 , c
step
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
120 chapter 5. falsification through pla nning
5.7 Summary
• Planning algorithms account for the temporal aspect of the falsification problem
and break it into a series of smaller problems.
• Tree search algorithms search the space of possible trajectories as a tree and
iteratively grow the tree in search of a failure trajectory.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6 Failure Distribution
While the falsification algorithms in the previous chapters search for single failure
events, it is often desirable to understand the distribution over failures for a
given system and specification. This distribution is difficult to quantify exactly for
many real-world systems. Instead, we can approximate the failure distribution by
drawing samples from it. This chapter discusses methods for sampling from the
failure distribution. We present two categories of sampling methods. First, we
discuss rejection sampling, which produces samples from a target distribution
by accepting or rejecting samples from a different distribution. We then present
Markov chain Monte Carlo (MCMC) methods. MCMC methods generate samples p(τ | τ ∈
/ ψ)
from a target distribution using a chain of correlated samples. We conclude with
a discussion of probabilistic programming, which allows us to scale MCMC
p(τ )
methods to complex, high-dimensional systems.
−4 −2 0 2 4
6.1 Distribution over Failures τ
The distribution over failures for a given system with specification ψ is represented Figure 6.1. The distribution over
failures for a simple system where
by the conditional probability p(τ | τ ∈ / ψ). We can write this probability as
trajectories consist of only a single
state that is sampled from a normal
1{ τ 6 ∈ ψ } p ( τ ) distribution (black). A failure oc-
p(τ | τ ∈
/ ψ) = R (6.1)
1{τ 6∈ ψ} p(τ ) dτ curs when the sampled state is less
than −1. The area of the shaded re-
where 1{·} is the indicator function and p(τ ) is the probability density of the gion corresponds to the integral in
equation (6.1). The failure distribu-
nominal trajectory distribution for trajectory τ. Figure 6.1 shows the failure distri- tion (red) is the probability density
bution for a simple system where trajectories consist of only a single state that is function of the nominal distribu-
sampled from a normal distribution. For most systems, the failure distribution is tion in the failure region scaled by
this value.
difficult to compute exactly because doing so requires solving the integral in the
denominator of equation (6.1) to compute the normalizing constant. The value of
122 chapter 6. failure distribution
this integral corresponds to the probability of failure for the system. We discuss 1
For a detailed overview, see C. P.
Robert and G. Casella, Monte Carlo
methods to estimate this quantity in chapter 7.
Statistical Methods. Springer, 1999,
While we cannot compute the probability density of the failure distribution vol. 2.
exactly, we can use its unnormalized probability density p̄(τ | τ ∈ / ψ) to draw
samples from it. The unnormalized probability density is given by
p̄(τ | τ ∈
/ ψ ) = 1{ τ 6 ∈ ψ } p ( τ ) (6.2)
Computing this density for a given trajectory only requires determining whether
it is a failure trajectory and evaluating its probability density under the nominal
trajectory distribution. The rest of this chapter discusses several methods for
sampling from this unnormalized distribution.1 With enough samples, we can
implicitly represent the distribution over failures (see figure 6.2).
Figure 6.2. Distribution over fail-
ures for the grid world prob-
6.2 Rejection Sampling lem represented implicitly through
samples. The probability of slip-
ping is set to 0.8.
Rejection sampling produces samples from a complex target distribution by ac-
cepting or rejecting samples from a different distribution that is easier to sample
from. It is inspired by the idea of throwing darts uniformly at a rectangular dart
board that encloses the graph of the density of the target distribution. If we keep
only the darts that land inside the target density, we produce samples that are
distributed according to the target distribution (see figure 6.3).
Figure 6.3. Sampling from a
In the dart board example, we are using samples from a uniform distribution truncated normal distribution by
to produce samples from an arbitrary target density. The efficiency of this process throwing darts uniformly at a rect-
angular dart board that encloses
depends on the area of the dart board that lies outside the target distribution. the graph of its density function.
If there is a large area outside the target distribution, many of the darts will The samples on the bottom are ob-
be rejected, and we will require more darts to accurately represent the target tained by moving all of the darts
that land inside the target distribu-
distribution. One way to improve efficiency is to use a different dart board that tion to the bottom of the dart board.
more closely matches the shape of the target distribution. In other words, we may These samples are distributed ac-
cording to the target distribution.
want to draw samples from a different distribution that is still easy to sample
from but more closely matches the target distribution. We call this distribution a 2
In the dart board analogy, we can
proposal distribution. think of this acceptance criteria as a
Algorithm 6.1 implements the rejection sampling algorithm given a target two step process. First, we sample
the x-coordinate of the dart from
distribution with density function p̄(τ ) and a proposal distribution with density the proposal distribution. Second,
function q(τ ). At each iteration, we draw a sample τ from the proposal distribution we select its y-coordinate randomly
and accept it with probability proportional to p̄(τ )/(cq(τ )).2 To ensure that between the bottom of the board
and cq(τ ). If it falls under p(τ ), it
the proposal distribution fully encloses the target distribution, we require that is accepted.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.2. rejection sampling 123
q(τ ) > 0 whenever p̄(τ ) > 0 and that c is selected such that p̄(τ ) ≤ cq(τ ) for all
τ. The density function of the target distribution does not need to be normalized.
p̄(τ | τ ∈
/ ψ)
To sample from the failure distribution, we use the unnormalized density in
equation (6.2) as the target density. A common choice for the proposal distribution −2
−4 0 2 4
is the nominal trajectory distribution. To use this proposal, we must select a value
p(τ | τ ∈
/ ψ)
for c such that 1{τ 6∈ ψ} p(τ ) ≤ cp(τ ). Selecting c = 1 satisfies this condition and
causes the acceptance ratio to reduce to 1{τ 6∈ ψ}. In other words, we will accept
a sample if it is a failure trajectory and reject it otherwise. Figure 6.4 shows an
−4 −2 0 2 4
example that uses the nominal trajectory distribution to sample from the failure τ
distribution shown in figure 6.1.
Figure 6.4. Rejection sampling us-
If failures are unlikely under the nominal distribution, we will require many
ing the nominal trajectory distribu-
samples to produce a representative set of samples from the failure distribution. tion as the proposal distribution to
In this case, we may be able to improve efficiency by using domain knowledge to sample from the failure distribu-
tion shown in figure 6.1. The plot
select a proposal distribution that more closely matches the shape of the failure on the top shows the target den-
distribution. For example, failures occur at negative values in the simple system sity (red) and the proposal den-
sity (gray). Accepted samples are
shown in figure 6.1, so we may be able to improve efficiency by shifting the
highlighted in red. The plot on the
proposal distribution to the left. bottom shows a histogram of the
When we select the proposal distribution for rejection sampling, we must also accepted samples compared to the
density function of the failure dis-
select a value for c to ensure that the proposal distribution fully encloses the target tribution.
distribution for all τ. Figure 6.5 shows an example that uses a shifted proposal
distribution to sample from the failure distribution shown in figure 6.1 for two
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
124 chapter 6. failure distribution
p(τ )
Mq(τ ) p̄(τ | τ ∈
/ ψ)
c = 0.6065
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
p(τ | τ ∈
/ ψ)
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
τ τ τ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.2. rejection sampling 125
Suppose we want to use rejection sampling to sample failures from an in- Example 6.1. Example of the chal-
lenges of using rejection sampling
verted pendulum system where the standard deviation of the sensor noise for high-dimensional systems with
for each state variable is 0.1. From example 4.3, we know that failures are rare long time horizons. In this exam-
ple, we compute the tightest value
under the nominal trajectory distribution, so rejection sampling using the we can select for c based on domain
nominal trajectory distribution as a proposal will be inefficient. We also saw knowledge for the inverted pendu-
in example 4.3 that when we instead sampled trajectories from a distribution lum system and show that it is pro-
hibitively large.
where the standard deviation of the sensor noise was 0.15, we were able to
find failures. Therefore, we may want to use this distribution as a proposal
for rejection sampling.
We must then select a value for c such that
p(τ ) ≤ cq(τ )
d d
p(s1 ) ∏ N ( xt | 0, (0.1)2 I ) ≤ cp(s1 ) ∏ N ( xt | 0, (0.15)2 I )
t =1 t =1
d
N ( xt | 0, 0.01I )
∏ N ( xt | 0, 0.0225I )
≤c
t =1
where we assume that the initial state distribution is the same for the proposal
and target. The term in the product will be maximized when xt = [0, 0] for
all t. Plugging this result into the product and assuming a depth of 40, we
find that
N (0 | 0, 0.01I ) 40
≤c
N (0 | 0, 0.0225I )
1.2226 × 1014 ≤ c
Therefore, the tightest value we can select for c is 1.2226 × 1014 . Using this
value, our acceptance probabilities end up being very small (on the order of
10−15 ), and rejection sampling is inefficient.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
126 chapter 6. failure distribution
Markov chain Monte Carlo (MCMC) algorithms generate samples from a target
distribution by sampling from a Markov chain.3 A Markov chain is a sequence of 3
A detailed overview of MCMC
techniques is provided in C. P.
random variables where each variable depends only on the previous one. MCMC
Robert and G. Casella, Monte Carlo
algorithms begin by initializing a Markov chain with an initial sample τ. At each Statistical Methods. Springer, 1999,
iteration, they use the current sample τ to generate a new sample τ 0 by sampling vol. 2.
6.3.1 Metropolis-Hastings
One of the most common MCMC algorithms is the Metropolis-Hastings algorithm.5 5
W. K. Hastings, “Monte Carlo
The Metropolis-Hastings algorithm accepts a new sample τ 0 given the current Sampling Methods Using Markov
Chains and Their Applications,”
sample τ with probability Biometrika, vol. 57, no. 1, pp. 97–97,
p̄(τ 0 ) g(τ | τ 0 ) 1970.
(6.3)
p̄(τ ) g(τ 0 | τ ) 6
When the kernel is symmet-
where p̄ is the unnormalized target density. To sample from the failure distribution, ric, the algorithm is called the
Metropolis algorithm: N. Metropo-
we set p̄ = 1{τ 6∈ ψ} p(τ ). Since we are taking a ratio of the densities, the target lis, A. W. Rosenbluth, M. N. Rosen-
density does not need to be normalized. The kernel g(· | τ ) is often chosen to bluth, A. H. Teller, and E. Teller,
“Equation of State Calculations by
be a symmetric distribution, meaning that g(τ 0 | τ ) = g(τ | τ 0 ).6 In this case, Fast Computing Machines,” Journal
the acceptance criteria reduces to p̄(τ 0 )/ p̄(τ ). Intuitively, if τ 0 is more likely than of Chemical Physics, vol. 21, no. 6,
τ, it is always accepted. If τ 0 is less likely than τ, it is accepted with probability pp. 1087–1092, 1953. A common
choice of a symmetric kernel is a
proportional to the ratio of the densities. Gaussian distribution centered at
the previous sample.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 127
6.3.2 Smoothing
When we use algorithm 6.2 to sample from the failure distribution, we will not
accept any samples that are not failures because p̄(τ ) = 1{τ 6∈ ψ} p(τ ) will be 0
for those samples. While this behavior is necessary for the algorithm to converge
to the failure distribution in the limit of infinite samples, it can create challenges
in practice. For example, if we initialize the Markov chain to a safe trajectory, the
algorithm will reject all samples from g(· | τ ) until it samples a failure. Since
g(· | τ ) typically produces trajectories similar to τ, we may require many samples
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
128 chapter 6. fa ilure distribution
To define a Gaussian kernel for the inverted pendulum system, we must first Example 6.2. Example of a Gaus-
sian kernel for the inverted pendu-
define a trajectory distribution type (algorithm 4.3) for the pendulum. The lum system.
following code defines a trajectory distribution for the pendulum system
that uses a Gaussian distribution for the initial state and a vector of Gaussian
distributions for the observation disturbance distributions:
struct PendulumTrajectoryDistribution <: TrajectoryDistribution
μ₁ # mean of initial state distribution
Σ₁ # covariance of initial state distribution
μs # vector of means of length d
Σs # vector of covariances of length d
end
function initial_state_distribution(p::PendulumTrajectoryDistribution)
return MvNormal(p.μ₁, p.Σ₁)
end
function disturbance_distribution(p::PendulumTrajectoryDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(p.μs[t], p.Σs[t]))
return D
end
depth(p::PendulumTrajectoryDistribution) = length(p.μs)
We can then define a kernel for the pendulum system that returns an instan-
tiation of this distribution as follows:
function inverted_pendulum_kernel(τ; Σ=0.01I)
μ₁ = τ[1].s
μs = [step.x.xo for step in τ]
return PendulumTrajectoryDistribution(μ₁, Σ, μs, [Σ for step in τ])
end
The new distribution is centered at the initial state and observation distur-
bances of the current sample. We can use this kernel with algorithm 6.2 to
sample from the failure distribution of the inverted pendulum system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 129
before we sample a failure to accept, especially if τ is far from the failure region.7 7
One way to avoid this behavior is
We see this behavior during the burn-in period in figure 6.6. to ensure that the initial trajectory
is a failure. The algorithms in chap-
Another challenge arises when the failure distribution has multiple modes. To ters 4 and 5 can be used to search
move between modes, the algorithm must sample a failure from one failure mode for an initial failure trajectory.
using a kernel conditioned on a trajectory from another. If the failure modes are
spread out in the trajectory space, the algorithm may require a large number of
samples before moving from one mode to another. Example 6.3 illustrates these
challenges on a simple Gaussian system.
Smoothing is a technique that addresses these challenges by modifying the target
density to make it easier to sample from.8 It relies on a notion of the distance 8
H. Delecki, A. Corso, and M. J.
to failure, which we will write as ∆(τ ) for a given trajectory τ. This distance Kochenderfer, “Model-Based Vali-
dation as Probabilistic Inference,”
is a nonnegative number that measures how close τ is to a failure. For failure in Conference on Learning for Dynam-
trajectories, ∆(τ ) should be 0. We can rewrite the target density in terms of this ics and Control (L4DC), 2023.
distance as
/ ψ ) = 1{ ∆ ( τ ) ≤ 0} p ( τ )
p̄(τ | τ ∈ (6.4)
The indicator function causes sharp boundaries between safe and unsafe trajecto-
ries. To create a smooth version of this density, we replace the indicator function
with a Gaussian distribution with mean 0 and a small standard deviation. The
resulting smoothed density is
e = 0.8
e = 0.5
/ ψ) ≈ N ∆(τ ) | 0, e2 p(τ )
p̄(τ | τ ∈ (6.5) e = 0.2
no smoothing
where e is the standard deviation.
For systems with temporal logic specifications, we can specify the distance −4 −2 0 2 4
function using temporal logic robustness (section 3.5.2). Since robustness is τ
positive when the formula is satisfied and negative when it is violated, we can
Figure 6.7. Smoothed versions
write the distance function as of the failure distribution in fig-
ure 6.1 for different values of e. As
∆(τ ) = max(0, ρ(τ )) (6.6) e decreases, the smoothed distribu-
tion approaches the failure distri-
bution.
where ρ(τ ) is the robustness of the trajectory τ. Figure 6.7 shows the smoothed
version of the failure distribution in figure 6.1 for different values of e. As e
approaches 0, the smoothed density approaches the shape of failure distribution.
As e approaches infinity, the smoothed density approaches the shape of the
nominal distribution.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
130 chapter 6. failure distribution
Suppose we want to sample from the failure distribution shown in the plot Example 6.3. Example of the chal-
lenges of using MCMC to sample
on the left and we initialize our Markov chain with τ = 1. We will not accept from the failure distribution given
a new sample until we draw a sample with a value less than −1. If we use a a finite sample budget. The plot
on the left demonstrates the chal-
Gaussian kernel with standard deviation 1, we have that lenges with initialization, and the
plot on the right shows the chal-
g(τ 0 | τ ) = N τ 0 | 1, 12 lenges of sampling from failure dis-
tributions with multiple modes.
The probability of drawing a sample less than −1 from this distribution is
0.02275 (corresponding to the shaded region in the plot on the left). Therefore,
we will require 44 samples on average before the algorithm accepts a sample.
If we were to initialize the algorithm with a sample even further from the
failure region, we would require even more samples to the point where
MCMC may not converge within a finite sample budget.
The plot on the right demonstrates the challenge of using MCMC to sample
from a failure distribution with multiple modes. In this case, the current
sample is in the mode on the left at −2.2. Using the same Gaussian kernel,
we have
g(τ 0 | τ ) = N τ 0 | −2.2, 12
The probability of moving to the other mode from this point is 1.3346 × 10−5 .
Therefore, we will require a large number of samples before we switch modes.
g(τ 0 | τ )
g(τ 0 | τ )
p̄(τ | τ ∈
/ ψ)
p̄(τ | τ ∈
/ ψ)
τ τ
−4 −2 0 2 4 −4 −2 0 2 4
τ τ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 131
algorithm toward different failure modes. Figure 6.9 compares the path taken by
a Gaussian kernel with the path taken by MALA on a simple target density. The
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
132 chapter 6. failure distribution
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
τ τ τ
p
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
τ τ τ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 133
The inverted pendulum system has two main failure modes: tipping over in Example 6.4. Applying smoothing
to sample from the failure distri-
the negative direction and tipping over in the positive direction. To observe bution of the inverted pendulum
the effect of smoothing on the performance of MCMC, we define the following system. Smoothing allows MCMC
two unnormalized target densities: to sample from both failure modes
p = NominalTrajectoryDistribution(inverted_pendulum, 21) # depth = 21 given a finite sample budget. The
p̄(τ) = isfailure(ψ, τ) * pdf(p, τ) plot below shows the result of run-
function p̄_smooth(τ; ϵ=0.15) ning MCMC without smoothing.
Δ = max(robustness([step.s for step in τ], ψ.formula), 0)
return pdf(Normal(0, ϵ), Δ) * pdf(p, τ) 1
end
θ (rad)
The plot in the margin shows the results when we run algorithm 6.2 using p̄ 0
as the target density. We will not accept any samples that are not failures, and
we only observe failures from one failure mode. The plots below show the
−1
results when we use p̄_smooth combined with rejection sampling. Smoothing 0 0.2 0.4 0.6 0.8
allows us to sample failures from both failure modes. However, we now draw Time (s)
some samples that are not failures during the MCMC (left), so we must reject
them after the algorithm has terminated to recover the failure distribution
(right).
1
θ (rad)
−1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
134 chapter 6. failure distribution
MALA kernel enables MCMC to move more efficiently toward regions of high
likelihood.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.4. probabilistic programming 135
Algorithm 6.3 writes the rollout function as a probabilistic program that can be 16
We use a probabilistic program-
used to sample from the smoothed failure distribution.16 Similar to algorithm 4.5, ming package written for the Julia
language called Turing.jl.
the probabilistic programming model samples an initial state from the initial
state distribution and steps the system forward in time by sampling from the
disturbance distribution at each time step. However, rather than explicitly drawing
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
136 chapter 6. fa ilure distribution
the samples, the model only specifies the distributions from which the samples
are drawn. The probabilistic programming tool handles the sampling and keeps
track of the probability associated with each draw automatically.
To specify that we want to sample failure trajectories, we add a log proba-
bility term for the smoothed indicator function in equation (6.5). Probabilistic
programming tools often perform operations in log space for numerical stability.
Adding this term in log space is equivalent to multiplying the target density by
the smoothed indicator function. Example 6.5 demonstrates how to use algo-
rithm 6.3 to sample from the failure distribution of the inverted pendulum system.
It runs the algorithm twice to produce two chains that capture two distinct failure
modes. In addition to smoothing, running multiple MCMC chains from different
starting points is another method to improve performance of MCMC for failure
distributions with multiple modes.
To use probabilistic programming to sample from the failure distribution of Example 6.5. Sampling from
the failure distribution of the in-
the inverted pendulum system, we can use the following code to set up the verted pendulum system using al-
MCMC algorithm and distance function: gorithm 6.3. The plot shows the re-
mcmc_alg = Turing.NUTS(10, 0.65, max_depth=6) sult of running the algorithm twice
Δ(𝐬) = max(robustness(𝐬, ψ.formula, w=1.0), 0) to produce two MCMC chains. The
initial samples that are not failures
The code sets up the No U-Turn Sampler (NUTS) MCMC algorithm. Since are discarded during the burn-in
NUTS relies on the gradient of the target density, we use smoothed robustness period.
in the distance function so that the gradient exists. The first two parame- 1
ters in the NUTS constructor are the number of adaptation steps and the Chain 1
θ (rad)
target acceptance rate. The plot shows the result of running algorithm 6.3
0
with the specified parameters. We run the algorithm twice to produce two
MCMC chains. Running multiple chains from different starting points is an- Chain 2
−1
other method to improve performance for failure distributions with multiple
0 0.2 0.4 0.6 0.8
modes.
Time (s)
6.5 Summary
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.5. summary 137
• Markov chain Monte Carlo (MCMC) algorithms sample from the target dis-
tribution by drawing samples from a Markov chain and scale well to high-
dimensional systems.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7 Failure Probability Estimation
After searching for the potential failure modes of a system, we may also want to
estimate its probability of failure. This chapter presents several techniques for
estimating this quantity from samples. We begin by discussing a direct estimation
approach that uses samples from the nominal trajectory distribution to estimate
the probability of failure. If failures are rare, this approach may be inefficient and
require a large number of samples to produce a good estimate. The remainder of
the chapter discusses more efficient estimation techniques based on importance
sampling. Importance sampling techniques artificially increase the likelihood of
failure trajectories by sampling from a proposal distribution. We discuss several
variations of importance sampling and conclude by presenting a nonparametric
algorithm that estimates the probability of failure from a sequence of samples.
The probability of failure for a given system and specification is defined mathe-
matically as
1
Z Note that the right-hand side of
equation (7.1) is equivalent to the
pfail = Eτ ∼ p(·) [1{τ 6∈ ψ}] = 1{τ 6∈ ψ} p(τ ) dτ (7.1) denominator in equation (6.1). In
other words, the probability of fail-
where 1{·} is the indicator function. The expectation is taken over the nominal ure is the normalizing constant for
the failure distribution.
trajectory distribution for the system.1 Given a set of m trajectories from this
distribution, we can produce an estimate p̂fail of the probability of failure by
treating the problem as a parameter learning problem, where where the parameter
of interest is the parameter of a Bernoulli distribution. We can then apply the
maximum likelihood or Bayesian methods from chapter 2 to calculate p̂fail .
140 chapter 7. failure probability estimation
where n is the number of samples that resulted in a failure and m is the total
number of samples. Algorithm 7.1 uses direct sampling to implement this estima-
tor. It performs m rollouts and computes the probability of failure according to
equation (7.2).
We can evaluate the accuracy of an estimator using metrics such as bias, consis-
tency, and variance (example 7.1). Equation (7.2) provides an empirical estimate
of the probability of failure by computing the sample mean of a set of samples
drawn from a Bernoulli distribution with parameter pfail . The sample mean is an
unbiased estimator of the true mean of a Bernoulli distribution, so the estimator is
unbiased. We can calculate the variance of this estimator by dividing the variance
of a Bernoulli distribution by the number of samples:
pfail (1 − pfail )
Var[ p̂fail ] = (7.3)
m
The square root of this quantity is known as the standard error of the estimator. A
lower variance means that the sample mean will be closer to the true mean on
average and therefore indicates a more accurate estimator. In the limit of infinite
samples, the variance approaches zero, so the estimator is consistent.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.1. direct estimation 141
Bias, consistency, and variance are three common properties used to evaluate Example 7.1. Common metrics
used to evaluate estimators. The
the quality of an estimator. An estimator that produces p̂fail is unbiased if it plots show predictions of three dif-
predicts the true value in expectation: ferent estimators with shaded re-
gions to represent the variance.
E[ p̂fail ] = pfail
For example, given a set of samples drawn independently from the same
distribution, the sample mean is an unbiased and consistent estimator of
the distribution’s true mean. The variance of the estimator quantifies the
spread of the estimates around the true value. For the sample mean example,
the variance will decrease as the number of samples increases. The plots
below illustrate these concepts. The shaded regions reflect the variance of
the estimator.
high
variance
p̂fail
low
variance
m m m
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
142 chapter 7. failure probability estimation
We demonstrate the effect of equation (7.3) empirically by running 10 trials Example 7.2. The empirical mean
and variance of the direct estima-
of algorithm 7.1 on the grid world problem. We compute the empirical mean tor for the grid world problem
and variance of p̂fail across all 10 trials after each new sample. The plot below computed over 10 trials of algo-
rithm 7.1. The depth d is set to 50
shows the results of this experiment. and the probability of slipping is
set to 0.8. The blue line shows the
×10−2
2 mean of p̂fail for all 10 trials, and
the shaded region represents one
standard deviation above and be-
1.5
low the mean.
p̂fail
0.5
0
0 1,000 2,000 3,000 4,000 5,000
m
In addition to the number of samples, the true probability of failure pfail also
has an impact on the relative accuracy of the estimator. As the true probability
of failure decreases, the number of samples required to achieve a given level of
accuracy increases (see exercise 7.1). For systems in which failure events are rare,
we may require a large number of samples to produce an accurate estimate for
the probability of failure using algorithm 7.1. Section 7.2 introduces importance
sampling, which can be used to improve the efficiency in these scenarios.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.1. direct estimation 143
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
144 chapter 7. failure probability estimation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.2. importance sampling 145
We are interested in the quantity p( pfail < δ), which is the probability that the
true probability of failure is less than or equal to δ. This quantity is given by the
cumulative distribution function of the posterior distribution over the probability
of failure.2 The quantiles of the posterior distribution can be used to compute 2
The cumulative distribution func-
confidence intervals in a similar manner. Example 7.3 demonstrates this process. tion of a Beta distribution is the
regularized incomplete beta func-
tion. Software packages such as
Distributions.jl provide imple-
7.2 Importance Sampling mentations of both the cumulative
distribution function and the quan-
Importance sampling algorithms increase the efficiency of sampling-based estima- tile function for the Beta distribu-
tion.
tion techniques. Instead of sampling from the nominal trajectory distribution
p, they sample from a proposal distribution q that assigns higher likelihood to
areas of greater ‘‘importance.’’3 To estimate the probability of failure using these 3
This proposal distribution has
similar properties to the proposal
samples, we must transform the expectation in equation (7.1) to an expectation
distribution introduced in sec-
over q: tion 6.2 for rejection sampling.
pfail = Eτ ∼ p(·) [1{τ 6∈ ψ}]
Z
= p(τ )1{τ 6∈ ψ}dτ
q(τ )
Z
= p(τ ) 1{τ 6∈ ψ}dτ
q(τ ) (7.7)
p(τ )
Z
= q(τ ) 1{τ 6∈ ψ}dτ
q(τ )
p(τ )
= Eτ ∼q(·) 1{ τ 6 ∈ ψ }
q(τ )
For equation (7.7) to be valid, we require that q(τ ) > 0 wherever p(τ )1{τ 6∈
ψ} > 0. This condition is satisfied as long as the proposal distribution assigns a
nonzero likelihood to all failure trajectories that are possible under p.
Given samples from q(·), we can estimate the probability of failure based on
equation (7.7) as
1 m p(τi )
m i∑
p̂fail = 1{τi 6∈ ψ} (7.8)
=1
q(τi )
Algorithm 7.3 implements this estimator. Equation (7.8) is an unbiased estimator
of the true probability of failure. It corresponds to a weighted average of samples
from the proposal distribution:
m
1
p̂fail =
m ∑ wi 1{τi 6∈ ψ} (7.9)
i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
146 chapter 7. failure probability estimation
Suppose that we run algorithm 7.2 on the collision avoidance problem with Example 7.3. Quantifying uncer-
tainty in the probability estimate
m = 100 samples and observe no failures. Assuming we begin with a uni- produced by algorithm 7.2. The
form prior, the posterior distribution over the probability over failure is plots show the posterior distribu-
Beta(1, 101). Suppose we are also given a safety requirement for the sys- tion Beta(1, 101). The shaded re-
tem stating that pfail must not exceed 0.01. We can compute p( pfail < 0.01) gion in the first plot represents the
from the cumulative distribution function of the beta distribution using the probability that the true probabil-
ity of failure is less than or equal
following code: to 0.01. The shaded region in the
using Distributions second plot shows the 95 % confi-
posterior = Beta(1, 101) dence bound.
confidence = cdf(posterior, 0.01)
p( pfail )
50 50
0.64 0.95
0 0
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
pfail pfail
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.2. importance sampling 147
where the weights are wi = p(τi )/q(τi ). These weights are sometimes referred to
as importance weights. Trajectories that are more likely under the nominal trajectory
distribution have higher importance weights.
p ( τ )1{ τ ∈ / ψ}
q∗ (τ ) = (7.11)
pfail
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
148 chapter 7. failure probability estimation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.2. importance sampling 149
Consider the simple Gaussian problem shown below where failures occurs Example 7.4. Performance com-
parison of two hand-designed pro-
at values less than −2 (red shaded region). The plot on the left shows two posal distributions for the simple
proposal distributions we could use for importance sampling. The first pro- Gaussian problem where failures
occur at values less than −2 (red re-
posal distribution q1 shifts the nominal distribution toward the failure region gion). The first plot shows the nom-
and assigns high likelihood to likely failure trajectories. The second proposal inal distribution and two possible
distribution q2 is shifted toward the failure region but it still does not assign proposal distributions. The second
plot shows the estimation error
high likelihood to likely failures. Therefore, we expect q1 to result in better for direct estimation compared to
estimates than q2 . the estimation error of importance
sampling for the two distributions.
The plot on the right shows the estimation error when performing impor-
tance sampling with each proposal distribution compared to direct estimation.
The shaded region represents the 90 % empirical confidence bounds on the
error. As expected, q1 results in a lower estimation error and a lower vari-
ance than q2 and direct estimation. Performing importance sampling with q2
results in worse performance than direct estimation.
0.06
q2 ( τ )
Absolute Estimation Error
0.04
q1 ( τ )
p(τ )
0.02
0
−4 −2 0 2 4 0 200 400 600 800 1,000
τ Number of Samples
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
150 chapter 7. failure probability estimation
where the weight for each sample is computed using only the proposal that was
used to generate it. This weighting scheme, which we refer to as standard MIS (s-
MIS), is most similar to the importance sampling estimator for a single proposal
distribution (equation (7.8)).
Instead of considering each proposal individually, we can also view the sam-
ples as if they were drawn in a deterministic order from a mixture distribution
composed of all proposal distributions. This paradigm leads to the deterministic
mixture weighting scheme (DM-MIS):
p(τi )
wi = 1 m
(7.14)
m ∑ j=1 q j (τi )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.3. adaptive importance sampling 151
τ τ τ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
152 chapter 7. failure probability estimation
Suppose we want to estimate the probability of failure for the two- Example 7.5. Performance com-
parison of importance sampling to
dimensional Gaussian system shown below. The nominal distribution is multiple importance sampling for
a multivariate Gaussian distribution with a mean at the center of the figure, a two-dimensional Gaussian prob-
lem. The plot on the left shows a
and the failure region is composed of the two shaded red regions. The plots single proposal distribution that
below show the log density of both distributions. can be used for importance sam-
pling on this problem, while the
Nominal Distribution Failure Distribution plot on the right shows a set of pro-
posal distributions that can be used
for MIS. The plot below shows
the estimation error for IS com-
pared to the estimation error of
MIS with the two different weight-
τ2
τ2
ing schemes.
×10−6
8
IS
Estimation Error
τ1 τ1 6 s-MIS
DM-MIS
4
Most of the probability mass for the failure distribution is concentrated in the
central corners of the two modes, and a good proposal distribution should 2
assign high likelihood in those areas. If we only use one multivariate Gaus- 0
0 500 1,000
sian proposal distribution, we need to select a wide distribution to ensure
Number of Samples
that it covers both failure modes (left). We can improve performance by
selecting multiple proposal distributions that together cover both failure
modes (right). The plot in the caption compares the performance of impor-
tance sampling (IS) to multiple importance sampling with the two different
weighting schemes. The shaded region represents the 90 % empirical confi-
dence bounds on the error. The DM-MIS weighting scheme results in better
performance than the s-MIS weighting scheme.
τ2
τ1 τ1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.3. adaptive importance sampling 153
density and failure density.6 In these cases, the weight for a given sample τ drawn 6
Minimizing the cross entropy is
from the distribution q is equivalent to weighted maximum
likelihood estimation for distribu-
1{ τ ∈/ ψ} p(τ ) tions in the natural exponential
w= (7.15)
q(τ ) family. The natural exponential
family includes many common dis-
where p is the nominal trajectory distribution. tributions such as the Gaussian, ge-
If failures are rare under the initial proposal distribution, it is possible that no ometric, exponential, categorical,
and Beta distributions.
samples will be failures, and the weights computed in equation (7.15) will all be
zero. To address this challenge, the cross entropy algorithm iteratively solves a
relaxed version of the problem that relies on an objective function f . Similar to
the objective functions introduced in section 4.5, the objective function should
assess how close a trajectory is to a failure.7 The objective value must be greater 7
For example, an objective func-
than zero for trajectories that are not failures and less than or equal to zero for tion for the aircraft collision avoid-
ance problem might output the
failure trajectories. For systems with temporal logic specifications, we can use the miss distance between the two air-
robustness as the objective function. craft.
We can rewrite the goal of the cross entropy method in terms of the objective
function as finding the set of parameters that minimizes the cross entropy between
the proposal distribution and p(τ | f (τ ) ≤ 0). For systems with rare failure
events, we gradually make progress toward this goal by solving a series of relaxed
problems where we instead minimize the cross entropy between the proposal
and p(τ | f (τ ) ≤ γ) for a given threshold γ > 0. The weights used in maximum
likelihood estimation for the relaxed problem are
τ2
1{ f ( τ ) ≤ γ } p ( τ )
w= (7.16)
q(τ )
τ1
At each iteration, we select the threshold γ based on our current set of samples to
ensure that a fraction of the weights will be nonzero (figure 7.4). Figure 7.4. Threshold selection for
Algorithm 7.5 implements the cross entropy method. At each iteration, we a two-dimensional Gaussian prob-
lem with two failures modes. The
draw samples from the current proposal distribution and compute their objective red shaded region shows the fail-
values. We then select the threshold γ as the highest objective value from a set ure region. None of the current
of elite samples. The elite samples are the melite samples with the lowest objective samples overlap with the failure re-
gion, so we relax the problem by
values. Since our ultimate goal is to approach the failure distribution, we ensure expanding to the blue region that
that the threshold does not become negative by clipping it at zero. Given this contains a desired fraction of the
samples. The blue samples are the
threshold, we compute the weights using equation (7.16) and fit a new proposal top 10 % of samples with the lowest
distribution to the samples. After repeating this process for a fixed number of objective values.
iterations, algorithm 7.5 performs importance sampling (algorithm 7.3) with the
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
154 chapter 7. failure probability estimation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.3. adaptive importance sampling 155
using the samples produced across all iterations of the algorithm to estimate
the probability of failure. They keep track of the proposal distribution used to
generate the samples at each iteration and view the problem as an instance of
MIS. They produce an estimate using the weighting schemes from section 7.2.3. p(τ )
It is important to note in this case, however, that the proposal for each iteration
depends on the previous proposal. Since the proposals are not independent from −4 −2 0 2
one another, the DM-MIS weighting scheme will no longer be unbiased. τ
The performance of the cross entropy algorithm is sensitive to the form of
Figure 7.6. The cross entropy
the proposal distribution. The algorithm may perform poorly if the proposal method for a one-dimensional
distribution is not expressive enough to adequately capture the shape of the Gaussian problem. The plot shows
the Gaussian proposal distribution
failure distribution. This behavior is particularly apparent for complex systems at each iteration of the algorithm
with high-dimensional, multimodal failure distributions. For example, if we with darker distributions repre-
select a Gaussian proposal distribution for a system with two failure modes, the senting later iterations. The distri-
butions start at the nominal distri-
algorithm will struggle to find a proposal distribution that captures both failure bution and gradually move toward
modes (figure 7.5). One solution is to use a mixture of Gaussians for multimodal the failure distribution (red).
failure distributions (figure 7.7), but this approach requires knowing the number
of failure modes in advance, which is often not possible in practice. In these cases,
an adaptive MIS approach such as population Monte Carlo may perform better.
−4 −2 0 2 4
7.3.2 Population Monte Carlo τ
Population Monte Carlo (PMC) is an adaptive MIS algorithm that maintains a set, Figure 7.7. Example of a Gaussian
or population, of proposal distributions (algorithm 7.6).8 Figure 7.8 shows a single mixture model proposal distribu-
tion for a one-dimensional prob-
step of the algorithm. We begin with an initial population of m proposals that is
lem with two failure modes. The
spread across the space of proposal distributions. For example, we could use a proposal distribution is a mixture
set of multivariate Gaussian distributions with a fixed covariance and different of two Gaussians (blue) that ap-
proximates the multimodal failure
means. It is important to ensure that the initial population is sufficiently diverse distribution (red).
to capture all failure modes. 8
O. Cappé, A. Guillin, J.-M. Marin,
At each iteration, the algorithm draws a single sample from each proposal dis- and C. P. Robert, “Population
Monte Carlo,” Journal of Compu-
tribution in the population. It then computes a weight for each sample in the same tational and Graphical Statistics,
way the weights are computed for the cross entropy method (equation (7.15)). vol. 13, no. 4, pp. 907–929, 2004.
Samples in regions of high likelihood under the failure distribution will receive
higher weights. To adapt the proposal distributions, PMC uses the weights to
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
156 chapter 7. fa ilure probability estimation
Initial Proposals Sampling Weighting Resampling Figure 7.8. One iteration of the
population Monte Carlo algorithm.
For the weighting step, gray sam-
ples have zero weight, and the size
of the blue samples is proportional
to their weight. The resampling
τ2
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.4. sequential monte carlo 157
perform a resampling step. In this step, we redraw m samples from the population
of samples with probability proportional to their weights. We then reconstruct the
population of proposal distributions using the resulting samples. For example,
if we are using proposals in the form of multivariate Gaussian distributions, we
could create new proposals with the same fixed covariance and means centered
at each sample.
Over time, the population of proposal distributions should cover high likeli-
hood regions of the failure distribution. After a fixed number of iterations, we
perform MIS using the final population to estimate the probability of failure. We
can use either of the weighting schemes from section 7.2.3 to produce the estimate.
Similar to the cross entropy method, we could instead use the samples produced
during all iterations of the algorithm to estimate the probability of failure, noting
that the estimate in this case may no longer be unbiased.
Using multiple proposal distributions allows us to represent complex, multi-
modal failure distributions. However, the performance of PMC is still dependent
on the number of proposal distributions and their ability to cover the space of
possible proposals. If the number of proposal distributions is too small or the ini-
tial proposals are not sufficiently diverse, the algorithm may miss failures modes
and produce an inaccurate estimate. Furthermore, the stochastic nature of the
resampling procedure can lead to a loss of diversity in the proposal distributions
over time. For example, the proposals may collapse to a single failure mode or a
subset of the failure modes.
The sampling, weighting, and resampling components of algorithm 7.6 form the
basis for a more general framework used in the field of Bayesian inference called
sequential Monte Carlo (SMC).9 In SMC, we start with samples from the nominal 9
SMC is also known as particle fil-
trajectory distribution and gradually adapt these samples to move toward the tering in the context of state es-
timation. M. S. Arulampalam, S.
failure distribution. We then use the path of each sample to estimate the probability Maskell, N. Gordon, and T. Clapp,
of failure. “A Tutorial on Particle Filters for
Online Nonlinear/non-Gaussian
One way to adapt the samples in SMC is to move them through a sequence of Bayesian Tracking,” IEEE Transac-
intermediate distributions that gradually transition from the nominal distribu- tions on Signal Processing, vol. 50,
tion to the failure distribution. Specifically, we create a sequence of distributions no. 2, pp. 174–188, 2002.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
158 chapter 7. fa ilure probability estimation
distributions. The first method uses the smoothing technique introduced in sec-
tion 6.3.2. We can move from the nominal distribution to the failure distribution by
gradually decreasing the value of the standard deviation e in the smoothed den-
sity.10 The second method uses the same thresholding technique used in the cross 10
A similar technique is the expo-
entropy method. The intermediate distributions take the form p(τ | f (τ ) ≤ γ) nential tilting barrier presented in
A. Sinha, M. O’Kelly, R. Tedrake,
where f (τ ) is the objective function and γ is a threshold. We move from the and J. C. Duchi, “Neural Bridge
nominal distribution to the failure distribution by gradually decreasing the value Sampling for Evaluating Safety-
Critical Autonomous Systems,” Ad-
of γ. vances in Neural Information Pro-
At each iteration of SMC, our goal is to transition samples from the current cessing Systems (NeurIPS), vol. 33,
distribution g` to the next distribution in the sequence g`+1 . We typically only pp. 6402–6416, 2020.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.4. sequential monte carlo 159
τ1 τ1
The accuracy of the estimator in equation (7.18) depends on how well the
samples at each iteration represent the corresponding intermediate distribution.
If the samples are not representative, the weights will be small, and the estimator
will be inaccurate. However, we may require a large number of MCMC steps
to transition samples from one distribution to the next, especially for samples
that are unlikely under the next distribution. One technique used to address this
challenge is to resample the trajectories based on their importance weights.12 12
P. Del Moral, A. Doucet, and
This step is similar to the resampling step in PMC and tends to result in better A. Jasra, “Sequential Monte Carlo
Samplers,” Journal of the Royal Sta-
coverage of the intermediate distributions (see example 7.6). After resampling, tistical Society Series B: Statistical
we reset the weights to the mean of the weights before resampling to ensure that Methodology, vol. 68, no. 3, pp. 411–
436, 2006.
the estimator in equation (7.18) remains accurate.
Algorithm 7.7 implements SMC given a nominal trajectory distribution and
a set of intermediate distributions. At each iteration, it perturbs the current set
of samples to represent the next distribution in the sequence using MCMC. Ex-
ample 7.7 provides an implementation of this step for the inverted pendulum
problem. The algorithm then updates the importance weights and performs the
resampling step. Finally, it returns an estimate of the probability of failure based
on equation (7.18).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
160 chapter 7. failure probability estimation
Consider the scenario shown below in which we want to transition samples Example 7.6. The benefit of resam-
pling in SMC. The first set of plots
from the blue distribution to the purple distribution using 10 MCMC steps illustrates the resampling step, and
per sample. The plots below illustrate the weighting and resampling steps. the second set of plots shows the
improvement in the samples at the
The plot in the middle shows the weights of the samples, with darker points next iteration after performing re-
having higher weights. The plot on the right shows the samples after resam- sampling.
pling according to these weights. After resampling, the samples are more
representative of the purple distribution.
τ1 τ1 τ1
On the next iteration, we perform MCMC starting at these samples with the
purple distribution as the target to complete the transition. The plots below
show the result of this step with and without resampling. The results without
resampling start the MCMC chains at the blue samples shown above. The
resampling step results in a set of samples that better represents the target
distribution.
Without Resampling With Resampling
τ2
τ1 τ1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constant s 161
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
162 chapter 7. failure probability estimation
To estimate the probability of failure for the inverted pendulum system using Example 7.7. Application of SMC
to the inverted pendulum problem.
SMC, we implement the following function that uses 10 MCMC steps to The plots show the samples from
transition samples between intermediate distributions: the intermediate smoothed failure
function perturb(samples, ḡ) distributions.
function inverted_pendulum_kernel(τ; Σ=0.05^2 * I)
μs, Σs = [step.x.xo for step in τ], [Σ for step in τ]
return PendulumTrajectoryDistribution(τ[1].s, Σ, μs, Σs)
end
k_max, m_burnin, m_skip, new_samples = 10, 1, 1, []
for sample in samples
alg = MCMCSampling(ḡ, inverted_pendulum_kernel, sample,
k_max, m_burnin, m_skip)
mcmc_samples = sample_failures(alg, inverted_pendulum, ψ)
push!(new_samples, mcmc_samples[end])
end
return new_samples
end
−1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s) Time (s) Time (s)
Using 1,000 samples per iteration, SMC estimates the probability of failure to
be approximately 0.0001. The direct estimate for probability of failure based
on one million simulations is approximately 0.0005.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 163
R
z2 = ḡ2 (τ ) dτ, and our goal is to estimate the ratio of the normalizing constants
z1 /z2 using samples from g2 .
First, we rewite z1 in terms of an expectation over g2 :
Z
z1 = ḡ1 (τ ) dτ (7.19)
g2 ( τ )
Z
= ḡ1 (τ ) dτ (7.20)
g2 ( τ )
ḡ (τ )
Z
= g2 ( τ ) 1 dτ (7.21)
ḡ2 (τ )/z2
ḡ (τ )
Z
= z 2 g2 ( τ ) 1 dτ (7.22)
ḡ2 (τ )
ḡ (τ )
= z2 Eτ ∼ g2 (·) 1 (7.23)
ḡ2 (τ )
Dividing both sides of equation (7.23) by z2 gives us the ratio of the normalizing
constants, which we can approximate using m samples from g2 :
1 m ḡ1 (τi )
z1 ḡ (τ )
m i∑
= Eτ ∼ g2 (·) 1 ≈ (7.24)
z2 ḡ2 (τ ) ḡ (τ )
=1 2 i
where τi ∼ g2 (·) and g2 (τ ) > 0 whenever g1 (τ ) > 0. Note that the estimator in 15
We could also use equa-
equation (7.24) only requires evaluating the unnormalized densities ḡ1 (τ ) and tion (7.24) to estimate the
reciprocal of the probability of
ḡ2 (τ ). Since pfail is the normalizing constant of the failure distribution, we can failure using samples from the
use equation (7.24) to estimate the probability of failure by setting ḡ1 (τ ) equal to failure distribution by setting
ḡ2 (τ ) equal to the unnormalized
the unnormalized failure density and ḡ2 (τ ) equal to any normalized proposal failure density and ḡ1 (τ ) equal to
density q(τ ). In fact, these choices of ḡ1 (τ ) and ḡ2 (τ ) cause equation (7.24) to any normalized density whose
reduce to the importance sampling estimator in equation (7.8) (see exercises 7.2 support is contained within the
support of failure distribution.
and 7.3).15 However, selecting ḡ1 (τ ) to
If g1 and g2 have little overlap in terms of probability mass, the estimator in satisfy this condition is often
difficult in practice and can lead to
equation (7.24) may perform poorly. One technique to improve performance estimators with infinite variance.
is called umbrella sampling (also known as ratio importance sampling). Umbrella This technique is called reciprocal
sampling introduces a third density, called an umbrella density, that has signif- importance sampling. In general,
this estimator should not be used
icant overlap with both g1 and g2 . We use this density to estimate the ratio of for failure probability estimation.
normalizing constants by applying equation (7.24) twice:
h i
ḡ1 (τ ) m ḡ1 (τi )
z1 z1 /zu E τ ∼ gu (·) ḡu (τ )
1
m ∑i =1 ḡu (τi )
= = h i ≈ ḡ (τ )
(7.25)
z2 z2 /zu E
ḡ2 (τ ) 1
∑m 2 i
τ ∼ gu (·) ḡu (τ ) m i =1 ḡu (τi )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
164 chapter 7. failure probability estimation
z1
ḡu∗ (τ ) ∝ ḡ1 (τ ) − ḡ2 (τ ) (7.26)
z2
Similar to the optimal proposal for importance sampling, the optimal umbrella
density is expressed in terms of the quantity we are trying to estimate, so we
cannot compute it exactly. In general, we want to select an umbrella density that
is as close as possible to this density.
Another technique to estimate the ratio of normalizing constants when g1 and
g2 have little overlap is called bridge sampling. Similar to umbrella sampling, bridge
sampling introduces a third density called a bridge density. However, instead of
using samples from this density to estimate the ratio of normalizing constants,
bridge sampling uses samples from both g1 and g2 . Assuming we produce m1
samples from g1 and m2 samples from g2 , we again apply equation (7.24) twice
to obtain the bridge sampling estimator:
m2 ḡb (τj )
h i
ḡb (τ ) 1
z1 zb /z2 E τ ∼ g 2 (·) ḡ2 (τ ) m2 ∑ j=1 ḡ2 (τj )
= = h i ≈ (7.27)
z2 zb /z1 ḡb (τ ) 1 m ḡ (τ )
E τ ∼ g1 (·) ḡ1 (τ ) ∑ 1 b i
m1 i =1 ḡ1 (τi )
ḡ1 (τ ) ḡ2 (τ )
ḡb∗ (τ ) ∝ (7.28)
m1 ḡ1 (τ ) + m2 zz12 ḡ2 (τ )
which is again written in terms of the quantity we are trying to estimate. Given
samples from both g1 and g2 , we can use a simple iterative procedure to estimate
the optimal bridge density (algorithm 7.8). At each iteration, we apply equa-
tion (7.27) using the current bridge density to estimate the ratio of normalizing
constants. We then plug this ratio into equation (7.28) to obtain a new bridge
density. We repeat this process for a fixed number of iterations.
While umbrella sampling and bridge sampling both introduce a third density to
improve efficiency, they have different properties. For example, umbrella sampling
only requires samples from one density, while bridge sampling requires samples
from two different densities. Furthermore, the optimal umbrella density and the
optimal bridge density are very different (see figure 7.11). The optimal umbrella
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 165
function bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb) Algorithm 7.8. Algorithm for es-
ḡ₁s, ḡ₂s = ḡ₁.(g₁τs), ḡ₂.(g₂τs) timating the optimal bridge den-
ḡb₁s, ḡb₂s = ḡb.(g₁τs), ḡb.(g₂τs) sity ḡb using samples from ḡ₁
return mean(ḡb₂s ./ ḡ₂s) / mean(ḡb₁s ./ ḡ₁s) and ḡ₂. We iteratively apply equa-
end tion (7.27) to estimate the ratio of
normalizing constants and use this
function optimal_bridge(g₁τs, ḡ₁, g₂τs, ḡ₂, k_max) ratio to update the bridge density
ratio = 1.0 using equation (7.28).
m₁, m₂ = length(g₁τs), length(g₂τs)
ḡb(τ) = (ḡ₁(τ) * ḡ₂(τ)) / (m₁ * ḡ₁(τ) + m₂ * ratio * ḡ₂(τ))
for k in k_max
ratio = bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb)
end
return ḡb
end
ḡu∗ (τ )
τ τ
density covers regions of high likelihood for both distributions, while the optimal
bridge density bridges the gap between the two distributions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
166 chapter 7. failure probability estimation
where wi = p(τi )/q̄(τi ) and τi ∼ q(·). This estimator (algorithm 7.9) is similar to
the estimator in equation (7.9) for normalized proposal distributions with the
extra step of dividing the unnormalized importance weights by their sum.
The optimal proposal for self-IS is different from the optimal proposal for
p(τ )
importance sampling. Based on equation (7.26), the optimal proposal for self-IS
p̄(τ | τ ∈
/ ψ)
is
q∗ (τ ) ∝ p(τ ) |1{τ 6∈ ψ} − pfail | (7.30) q∗ (τ )
Sampling from this density should result in half of the samples coming from the τ
failure distribution and half coming from the success distribution. The optimal
proposal for IS, on the other hand, is the failure distribution itself. Figure 7.12 Figure 7.12. The optimal pro-
posal for self-normalized impor-
shows the optimal proposal distribution for self-IS on a simple Gaussian system. tance sampling for a simple Gaus-
In practice, we can plug a guess for pfail into equation (7.30) to obtain a proposal sian problem with a failure thresh-
old of −1.
distribution that is close to the optimal proposal. However, drawing samples
from this proposal is often difficult in practice, especially for systems with rare
failure events and multiple failure modes (see example 7.8). Furthermore, the
performance of the algorithm tends to be sensitive to incorrect guesses for pfail
when creating the proposal distribution. Bridge sampling, which we discuss in
the next section, is less sensitive to these choices.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 167
Suppose we want to use self-IS to estimate the probability of failure for the Example 7.8. The challenges asso-
ciated with sampling from the op-
simple Gaussian system. We know that the optimal proposal is of the form timal self-IS proposal density. The
plots show the proposal distribu-
q ∗ ( τ ) ∝ p ( τ ) |1{ τ 6 ∈ ψ } − α | tion for three different failure prob-
abilities along with histograms of
samples drawn from these distri-
where α is our guess for the probability of failure. The plots below show the butions using MCMC. As the prob-
proposal distribution for three different values of α along with histograms of ability of failure decreases, the dis-
samples drawn from these distributions using MCMC. For each distribution, tributions become more difficult to
accurately sample from.
we use 5,100 MCMC steps with a burn-in of 100 steps, keeping every 10th
sample.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
168 chapter 7. failure probability estimation
where we draw m1 samples from g`+1 (·) and m2 samples from g` (·). The in-
termediate distributions in the chain should be chosen such that the ratio of
normalizing constants between two consecutive distributions is easy to estimate.
In other words, consecutive intermediate distributions should have significant
overlap with each other. For example, we can create the intermediate distributions
using either of the two methods in figure 7.9.17 17
If we use the thresholding tech-
Algorithm 7.10 implements bridge sampling estimation using a sequence of in- nique, the algorithm reduces to the
multilevel splitting algorithm pre-
termediate distributions. It begins by drawing samples from the nominal trajectory sented in section 7.6.
distribution. At each iteration, it perturbs the samples to match the next interme-
diate distribution and estimates the optimal bridge density using algorithm 7.8.
It then applies equation (7.32) to compute the ratio of normalizing constants
between the two distributions. Finally, the algorithm applies equation (7.31) to
compute an estimate for the probability of failure.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 169
To estimate to probability of failure from equation (7.27), we set ḡ1 (τ ) equal Example 7.9. Proof that a bridge
sampling estimator that sets ḡ1 (τ )
to the unnormalized failure density and ḡ2 (τ ) equal to the density of the to the unnormalized failure den-
nominal trajectory distribution: sity and ḡ2 (τ ) to the nominal tra-
jectory distribution can perform no
m ḡb (τj ) better than the direct estimator for
1
pfail m2 ∑ j=21 p(τj ) estimating the probability of fail-
≈ m ḡb (τi )
ure.
1 1
∑i=11
m1 1{τi 6∈ψ} p(τi )
The optimal bridge density is zero for all samples that are not failures. Since
all the samples from the failure density will be failure samples, we have
m1 m1
1 ḡ (τ ) 1 p(τi )
m1 ∑ 1{τi 6∈b ψi} p(τi ) =
m1 ∑ p(τi )
=1
i =1 i =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
170 chapter 7. failure probability estimation
The perturb step in algorithm 7.10 produces samples from the next distribution
in the sequence and can be performed using the MCMC algorithms presented in
chapter 6. However, these distributions may be difficult to sample from, especially
as we get closer to the failure distribution. In practice, we can greatly increase
efficiency by using the samples from the previous distribution as a starting point
to produce samples from the next distribution. This process is similar to the
process used in SMC (algorithm 7.7), in which we weight and resample the
trajectories from the previous distribution before applying MCMC. The ability
to adapt samples from the previous distribution is another benefit of using a
sequence of intermediate distributions. Figure 7.13 shows the samples from the
intermediate distributions for the continuum world problem.
As long as the thresholds gradually decrease in a way that ensures that the condi-
tional probabilities remain large, we can efficiently estimate these intermediate
probabilities using direct estimation.
To ensure that the conditional probabilities remain large, it is common to se-
lect the thresholds adaptively.19 Algorithm 7.11 implements adaptive multilevel 19
F. Cérou and A. Guyader, “Adap-
splitting. Adaptive multilevel splitting begins by drawing samples from the nom- tive Multilevel Splitting for Rare
Event Analysis,” Stochastic Analy-
inal trajectory distribution. At each iteration, it computes the objective value for sis and Applications, vol. 25, no. 2,
each sample and selects a threshold γ such that a fixed number of samples have pp. 417–443, 2007.
objective values less than γ. It then uses this threshold and the current samples
to estimate p( f (τ ) ≤ γ` | f (τ ) ≤ γ`−1 ).
The algorithm produces the next set of samples by perturbing the current
samples to represent the distribution p(τ | f (τ ) ≤ γ` ). As with SMC and bridge
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
172 chapter 7. failure probability estimation
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.7. summary 173
sampling, this step can be performed using the MCMC algorithms presented in
chapter 6. To improve the efficiency of the MCMC, we first resample by drawing
m samples uniformly from the elite samples.
To accurately estimate the probability of failure, the last iteration of the algo-
rithm must use a threshold of zero. Algorithm 7.11 iterates until the threshold
reaches zero, at which point all elite samples are failures. If we reach the maxi-
mum number of iterations before this criterion is met, the algorithm will force
the final threshold to be zero. However, if there are no failure samples in the final
iteration, the final conditional probability will be zero, causing the algorithm
to return an estimate of zero. Therefore, it is important to ensure that we allow
enough iterations for the algorithm to reach the final threshold.
Multilevel splitting is considered a nonparametric algorithm in that we estimate
the probability of failure without assuming a specific form for the conditional
distributions. This feature allows multilevel splitting to extend to systems with
complex, multimodal failure distributions. Furthermore, the adaptive nature
of algorithm 7.11 allows us to smoothly transition from the nominal trajectory
distribution to the failure distribution without specifying the intermediate dis-
tributions ahead of time. Figure 7.14 shows an example of adaptive multilevel
splitting applied to the inverted pendulum system, which has two modes in its
failure distribution.
7.7 Summary
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
174 chapter 7. failure probability estimation
• For systems with rare failure events, we can use importance sampling to es-
timate the probability of failure by sampling from a proposal distribution
assigns higher likelihood to failure trajectories.
7.8 Exercises
Exercise 7.1. The coefficient of variation of a random variable is defined as the ratio of
the standard deviation to the mean and is a measure of relative variability. Compute the
coefficient of variation for the estimator in equation (7.2). For a fixed sample size m, how
does the coefficient of variation change as pfail increases? For a fixed pfail , how does the
coefficient of variation change as the sample size m increases?
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.8. exercises 175
For a fixed m, the coefficient of variation will decrease as the true probability of failure
pfail increases. For a fixed pfail , the coefficient of variation will decrease as the sample size
m increases. The plots here show an example of these relationships.
Coefficient of Variation
0.8 8
0.6 6
0.4 4
0.2 2
0 0
0 0.2 0.4 0.6 0.8 1 0 100 200
pfail m
Exercise 7.2. Show that equation (7.24) reduces to equation (7.2) when q̄1 (τ ) = 1{τ 6∈
ψ} p(τ ) and q̄2 (τ ) = p(τ ).
Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is the
probability of failure (z1 = pfail ). Since q̄2 is the normalized nominal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives
1{ τ 6 ∈ ψ } p ( τ )
pfail
= Eτ ∼ p(·)
1 p(τ )
pfail = Eτ ∼ p(·) [1{τ 6∈ ψ}]
Given samples from the nominal distribution τi ∼ p(·), we can approximate the above
equation as
1 m
m i∑
p̂fail = 1{τi 6∈ ψ}
=1
Exercise 7.3. Show that equation (7.24) reduces to equation (7.8) when q̄1 (τ ) = 1{τ 6∈
ψ} p(τ ) and q̄2 (τ ) = q(τ ).
Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is
the probability of failure (z1 = pfail ). Since q̄2 is a normalized proposal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives
1{ τ 6 ∈ ψ } p ( τ )
pfail
= Eτ ∼ p(·)
1 q(τ )
p(τ )
pfail = Eτ ∼ p(·) 1{ τ 6 ∈ ψ }
q(τ )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
176 chapter 7. failure probability estimation
Given samples from the proposal distribution τi ∼ q(·), we can approximate the above
equation as
1 m p(τi )
m i∑
p̂fail = 1{τi 6∈ ψ}
=1
q(τi )
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8 Reachability for Linear Systems
Forward reachability algorithms compute the set of states a system could reach
over a given time horizon. To perform this analysis, we need to make some
assumptions about the initial state and disturbances for the system. In the previous
chapters, we sampled initial states and disturbances from probability distributions, 1
One way to convert a probability
often with support over the entire real line. However, to perform reachability distribution to a bounded set is to
use the support of the distribution.
computations, we need to restrict the initial states and disturbances to bounded If the support of the distribution
sets.1 We assume that the initial state comes from a bounded set S and that the spans the entire real line, we can
select a region that contains most
disturbances at each time step come from a bounded set X . The disturbance set of the probability mass.
178 chapter 8. reachability for linear systems
X is defined as follows:
xa
X = xo x a ∈ X a , xo ∈ X o , xs ∈ X s (8.1)
x
s
where X a , Xs , and Xo are the disturbance sets for the agent, environment, and
sensor, respectively. 0.5
Given an initial state s and a disturbance trajectory x1:d = (x1 , . . . , xd ) with
v (m/s)
depth d, we can compute the state of the system at time step d by performing a 0
rollout (algorithm 4.6) and taking the final state. We denote this operation as
sd = Reach(s, x1:d ). By performing this operation on various initial states and −0.5
disturbances sampled from S and X , we find a set of points in the state space
that the system could reach at time step d. Figure 8.1 demonstrates this process −0.4 −0.2 0 0.2 0.4
p (m)
on the mass-spring-damper system.
We define the reachable set at depth d as the set of all states that the system Figure 8.1. Samples from R5 for
could reach at time step d given all possible initial states and disturbances. We the mass-spring-damper system
with initial position between −0.2
write this set as
and 0.2 and initial velocity set to
zero. The disturbance sets for the
Rd = {sd | sd = Reach(s, x1:d ), s ∈ S , xt ∈ Xt , t ∈ 1 : d} (8.2) observation noise are bounded be-
tween −1 and 1. The gray points
where Xt represents the set of possible disturbances at time step t. We are often represent the initial states, the gray
lines show the trajectories, and the
interested in the full set of states that the system might reach in a given time blue points represent the states af-
horizon rather than at a specific depth d. We denote this set as R1:h and represent ter 5 time steps.
it as the union of the reachable sets at each depth up to the time horizon:
h
[
R1:h = Rd (8.3)
d =1
Figure 8.2 shows the reachable sets in R1:4 for the mass-spring-damper system.
Computing reachable sets allows us to understand the behavior of a system
over time. For example, we can use reachable sets to determine if a system remains
within a safe region of the state space.2 We call the set of states that make up 2
We could also determine if the
the unsafe region of the state space the avoid set and use this set to define a system reaches a goal region in the
state space. In this case, we would
specification for the system (algorithm 8.1). If the reachable set intersects with want to check if the reachable set is
the avoid set, the system violates the specification. contained within the goal region.
The reachability algorithms we discuss in this chapter apply to linear systems.
Linear systems are a class of systems for which the sensor, agent, and environment
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation techniques 179
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
180 chapter 8. reachability for linear systems
A common example of a linear system is a mass-spring-damper system (see Example 8.1. A common exam-
ple of a linear system. The mass-
diagram in caption), which can be used to model mechanical systems such as spring-damper system consists of
a car suspension or a bridge. The state of the system is the position (relative a mass m attached to a wall by a
spring with spring constant k and a
to the resting point) p and velocity v of the mass (s = [ p, v]), the action is damper with damping coefficient c.
the force β applied to the mass, and the observation is a noisy measurement The system is controlled by a force
of state. The equations of motion for a mass-spring-damper system are β applied to the mass. The plots
show simulated trajectories of the
system for different levels of obser-
p0 = p + (v)∆t vation noise. With enough noise,
the system becomes unstable.
0 k c 1
v = v + − p − v + β ∆t
m m m k
β
where m is the mass, k is the spring constant, c is the damping coefficient, m
and ∆t is the discrete time step. Rewriting the dynamics in the form of
c
equation (8.6), we have
" #" # " #
1 ∆t p 0
T (s, a, xs ) = + 1 β + xs = Ts s + T a a + xs
− mk ∆t 1 − mc ∆t v m ∆t
0.4
0.2
p (m)
−0.2
−0.4
0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation technique s 181
Minkowski Sum
⊕ =
first focus on set propagation for the one step reachability problem. We assume
we are given a set of initial states S and a set of disturbances X . Our goal is to
compute the set of states S 0 that the system could reach at the next time step.
Given a single initial state s and disturbance x, we can compute the next state s0
by applying equations (8.4) to (8.6) sequentially:
We can compute the reachable set at the next time step by applying equation (8.9)
to S and X . To perform this computation, we must define the operations in
equation (8.9) as set operations. In particular, we must be able to apply a linear
transformation, or matrix multiplication, to a set and add two sets together.
The multiplication of a set P by a matrix A is defined as
AP = {Ap | p ∈ P } (8.10)
where the result is the set of all points obtained by multiplying each point in P
by A. The addition of two sets P and Q is defined as
P ⊕ Q = {p + q | p ∈ P , q ∈ Q} (8.11)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
182 chapter 8. reachability for linear systems
where the result is the set of all points obtained by adding each point in P to
each point in Q. This operation is referred to as the Minkowski sum of two sets
and is often denoted using the ⊕ symbol.4 Figure 8.3 shows these operations in 4
The Minkowski sum is named af-
two-dimensional space. As we will discuss in the next section, we can efficiently ter Polish mathematician Hermann
Minkowski (1864–1909).
compute linear transformations and Minkowski sums for many common set types
such as hyperrectangles and polytopes.5 5
The LazySets.jl package in Julia
With these definitions in place, we can rewrite equation (8.9) using set opera- provides implementations of these
operations for many common sets.
tions as
S 0 = (Ts + Ta Πo Os )S ⊕ Ta Πo Xo ⊕ Ta X a ⊕ Xs (8.12)
where S 0 is the one step reachable set. It is important that we simplify the system
dynamics into the form of equation (8.9) before applying set operations. If we
apply the equations without simplification, we may encounter a phenomenon
called the dependency effect, which occurs when a variable appears more than once
in a formula. Set operations fail to model this dependency, leading to conservative
reachable sets (see example 8.2). Algorithm 8.2 implements equation (8.12).
To compute reachable sets over a given time horizon using set propagation
techniques, we rely on the fact that the reachable set at time step d is a function of
the reachable set at time step d − 1. Specifically, we can compute the reachable set
at time step d by applying equation (8.12) to the reachable set at time step d − 1:
Algorithm 8.3 implements this recursive algorithm for computing the reachable
set at each time step. The algorithm terminates when it reaches the desired time
horizon h and returns R1:h .
In addition to gaining insight into the behavior of a system, we can use reach-
able sets to verify that a system satisfies a given specification. For a given spec-
ification, we want to ensure that the reachable set does not intersect with its
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation techniques 183
Consider a simple system with the following component models: Example 8.2. Example of the de-
pendency effect on a simple sys-
tem.
O(s, xo ) = s
π (o, xa ) = −Io
T (s, a, xs ) = s + a
where the state, action, and observation are two-dimensional and I is the
identity matrix. Suppose we want to compute the one-step reachable set S 0
when the initial set is a square centered at the origin with side length 1. If we
apply the sensor, agent, and environment models on the initial set without
simplification, we get O = S , A = −IO = −IS , and S 0 = S ⊕ A = S ⊕ −IS .
The resulting set S ⊕ −IS is a square with side length 2 centered at the origin.
However, if we first simplify before switching to set operations, we get that
s0 = s − s = 0. Thus, the true reachable set contains only the origin. The
plots below show this result.
1 1 1
−2 −1 1 2 −2 −1 1 2 −2 −1 1 2
−1 −1 −1
−2 −2 −2
This mismatch is due to an effect called the dependency effect, which leads
to conservative reachable sets. Because applying the set operations in order
does not account for the fact that the action depends on the state, it considers
worst-case behavior. For this reason, it is important to simplify the system
models into the form of equation (8.9) before applying set operations to avoid
unnecessary conservativeness. While this simplification is always possible
for linear systems, it is not always possible for the nonlinear systems we
discuss in the next chapter.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
184 chapter 8. reachability for linear systems
Suppose we want to compute R1:20 for the mass-spring-damper system with Example 8.3. Computing the
reachable sets for the mass-spring-
initial position between −0.2 and 0.2 and initial velocity set to zero. We damper system over a time horizon
assume the observation noise is bounded between −0.2 and 0.2. To perform of 20 steps. The reachable sets are
shown below, switching from light
reachability, we must implement the following functions for the system: blue to dark blue over time.
𝒮₁(env::MassSpringDamper) = Hyperrectangle(low=[-0.2,0], high=[0.2,0])
function disturbance_set(sys)
Do = sys.sensor.Do 0.4
low = [support(d).lb for d in Do.v]
high = [support(d).ub for d in Do.v] 0.2
v (m/s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation techniques 185
−0.1 ≤ xo ≤ 0.1 −1.0 ≤ xo ≤ 1.0 −2.5 ≤ xo ≤ 2.5 Figure 8.4. Reachable sets (bottom
row) for the mass-spring-damper
system with varying levels of ob-
0.5 servation noise compared to sam-
ples from a finite number of tra-
v (m/s)
−0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4
p (m) p (m) p (m)
To ensure that algorithms 8.2 to 8.4 are tractable, we must select set representations
that are computationally efficient. Desirable properties include:
• Finite representations: We should be able to specify the points that are contained
in the set without needing to enumerate all of them.
• Closure under set operations: A set representation is closed under a particular set
operation if applying the operation results in a set of the same type.
In this chapter, we will focus on convex set representations, which tend to have
these properties.7 A convex set is a set for which a line drawn between any two 7
Some nonconvex sets can also be
points in the set is contained entirely within the set. Mathematically, a set P is efficiently represented and manip-
ulated. A detailed overview is pro-
convex if we have vided in M. Althoff, G. Frehse, and
αp + (1 − α)q ∈ P (8.15) A. Girard, “Set Propagation Tech-
niques for Reachability Analysis,”
for all p, q ∈ P and α ∈ [0, 1]. Figure 8.6 illustrates this property. The rest of this Annual Review of Control, Robotics,
and Autonomous Systems, vol. 4,
section discusses a common convex set representation called polytopes.
pp. 369–395, 2021.
8.3.1 Polytopes
A polytope is defined as the bounded intersection of a set of linear inequalities.8 A 8
We can also define convex sets
linear inequality has the form a> x ≤ b where a is a vector of coefficients, x is a such as ellipsoids using nonlinear
inequalities. O. Maler, “Computing
vector of variables, and b is a scalar. We refer to the set of points that satisfy a given Reachable Sets: An Introduction,”
linear inequality as a half space. A polyhedron is the intersection of a finite number French National Center of Scientific
Research, pp. 1–8, 2008.
of half spaces. If the polyhedron is bounded, we call it a polytope. Figure 8.7
illustrates these concepts in two dimensions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.3. set representations 187
a3
such that ∑in=1 λi = 1 and λi ≥ 0 for all i. Intuitively, the convex hull of a set of
points is the smallest convex set that contains all the points (figure 8.8).
It is always possible to convert between the two polytope representations; how-
ever, the calculation is nontrivial.9 Each representation has different advantages. 9
A detailed overview is provided
For example, H-polytopes are more efficient for checking whether a point belongs in G. M. Ziegler, Lectures on Poly-
topes. Springer Science & Business
to the set because we can simply check if it satisfies all the linear inequalities. In Media, 2012, vol. 152. In Julia,
contrast, V -polytopes are more efficient for set operations such as linear trans- LazySets.jl provides functional-
ity to convert between the two rep-
formations. To compute a linear transformation of a polytope represented as a resentations.
V -polytope, we can apply the transformation to each vertex to obtain the vertices
of the transformed polytope.
The Minkowski sum of two V -polytopes is
P1 ⊕ P2 = conv({v1 + v2 | v1 ∈ V1 , v2 ∈ V2 }) (8.17)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
188 chapter 8. reachability for linear systems
P1
of all pairs of vertices from the two polytopes. To determine which candidates are
actual vertices, we must determine which candidate vertices are on the boundary
of the convex hull. Figure 8.9 illustrates this process.
We can use these results to reason about the complexity of algorithm 8.3 if
we were to represent our sets as polytopes. We apply equation (8.12) at each
iteration, which involves four linear transformations and three Minkowski sums.
The number of candidate vertices resulting from computing the one step reachable
set using equation (8.12) is |S1 ||Xo ||X a ||Xs | where |P | represents the number of
vertices in polytope P . The number of candidate vertices for the reachable set
at depth d is then |S1 |(|Xo ||X a ||Xs |)d . We can prune the candidate vertices that
are not actual vertices by computing the convex hull of the candidate vertices,
but this operation can be expensive. 10 Therefore, the exponential growth in the 10
The most efficient algorithms for
number of candidate vertices creates tractability challenges for high-dimensional computing the vertices of the con-
vex hull of a set of points have
systems with long time horizons.11 a complexity of O(mv) where m
is the number of candidate ver-
tices and v is the number of ac-
8.3.2 Zonotopes tual vertices. In general, the num-
ber of actual vertices grows super-
A zonotope is a special type of polytope that avoids the exponential growth in linearly. For more details, see R. Sei-
candidate vertices for Minkowski sums. It is defined as the Minkowski sum of a del, “Convex Hull Computations,”
in Handbook of Discrete and Com-
set of line segments centered at a point c: putational Geometry, Chapman and
Hall, 2017, pp. 687–703.
m
Z = {c + ∑ αi gi | αi ∈ [−1, 1]}
11
Other polytope representations
(8.18)
such as the Z -representation and
i =1
M-representation perform Min-
kowski sums more efficiently. More
where g1:m are referred to as the generators of the zonotope.12 We represent zono-
details are provided in S. Sigl
topes by a center point and list of generators: and M. Althoff, “M-Representation
of Polytopes,” ArXiv:2303.05173,
Z = (c, hg1:m i) (8.19) 2023.
12
Zonotopes can also be viewed as
linear transformations of the unit
hypercube.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.3. set representations 189
To compute the Minkowski sum of two zonotopes, we sum the centers and con-
catenate the generators:
Z ⊕ Z 0 = (c + c0 , hg1:m , g1:m
0
0 i) (8.21)
Note that the number of generators in the resulting zonotope grows linearly
with the number of generators in each zonotope. Therefore, if we represent our
sets as zonotopes, the number of generators for the reachable set at depth d is
|S1 | + d|Xo ||X a ||Xs |. This linear growth represents a significant improvement
over the exponential growth in candidate vertices for generic polytopes.
8.3.3 Hyperrectangles
A hyperrectangle is a generalization of a rectangle to higher dimensions (fig-
ure 8.12). It is a special type of zonotope in which the generators are aligned with Polytopes
the axes. We may also work with linear transformations of hyperrectangles, which
Zonotopes
can always be transformed back to an axis-aligned representation. All hyperrect-
angles are zonotopes, and all zonotopes are polytopes; however, the reverse does
Hyperrectangles
not hold (figure 8.11). Hyperrectangles can be compactly represented as a center
point and a vector of half-widths. They can also be represented as a set of intervals
with one for each dimension. Unlike zonotopes, hyperrectangles are not closed
Figure 8.11. Zonotopes are a sub-
under linear transformations and Minkowski sums. class of polytopes, and hyperrect-
angles are a subclass of zonotopes.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
190 chapter 8. reachability for linear systems
As noted in section 8.3.1, the number of candidate vertices for the reachable
sets in algorithm 8.3 grows exponentially with the time horizon and causes
computational challenges for high-dimensional systems. There are multiple ways
to reduce this computational burden. One way is to represent the initial state and
disturbance sets using zonotopes since the number of generators scales linearly
with the time horizon (see section 8.3.2). In this section, we will discuss another
technique to reduce the computational cost that relies on overapproximation.
The set P̃ represents an overapproximation of the set P if P ⊆ P̃ . Typically, we
select the overapproximated set P̃ such that it is easier to compute or represent.
For example, we can use overapproximation to reduce the computational cost of
algorithm 8.3 by overapproximating the reachable set at each iteration with a set
that has fewer vertices (figure 8.13). We can then use this overapproximated set
as the initial set for the next iteration.
As long as the overapproximated reachable set does not intersect with the avoid
set, we can still use to it make claims about the safety of the system. However, if
the overapproximated reachable set does intersect with the avoid set, the results Figure 8.13. Overapproximating
are inconclusive. The violation could be due to unsafe behavior or the overap- the blue polytope with the purple
polytope. The purple polytope has
proximation itself. In this case, we could move to a tighter overapproximation or fewer vertices.
use a different method to verify safety (example 8.4).
Algorithm 8.5 modifies algorithm 8.3 to include overapproximation. Depend-
ing on the complexity of the reachable sets, we may not need to overapproximate
at every iteration, so we set a frequency parameter to control how often we overap-
proximate. Figure 8.14 demonstrates this idea on the mass-spring-damper system.
A more frequent overapproximation will result in greater computational efficiency
at the cost of extra overapproximation error in the reachable sets. We define over-
approximation error as the difference in volume between the overapproximated
reachable set and the true reachable set. 13
The Hausdorff distance is named
The overapproximation tolerance e places a bound on the Hausdorff distance after German mathematician Felix
between the overapproximated set and the original set.13 The Hausdorff distance Hausdorff (1868–1942).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.4. reducing computational cost 191
between two sets P and P̃ is the maximum distance from a point in P to the
nearest point in P̃ . A lower value for e results in a less conservative overap-
proximation but may require more computation and result in a more complex
representation. The rest of this section discusses a technique for computing this
overapproximation.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
192 chapter 8. reachability for linear systems
Suppose we want to determine if the mass-spring-damper system could Example 8.4. The effect of overap-
proximation on accuracy and com-
reach the avoid set within 40 time steps. To reduce computational cost, we use putational cost for the mass-spring-
algorithm 8.5 with an overapproximation frequency of 5 time steps. The plots damper system. The plots show
the reachable sets R1:40 using three
below show the reachable set R1:40 using three different overapproximation different overapproximation toler-
tolerances e. The plot on the right shows the number of vertices in Rd for ances. The plot below shows the
each depth d. number of vertices in the reachable
sets at each depth. If the tolerance
is too high, the reachable set may
e=0 e = 0.001 e=1 overlaps with the avoid set, and the
analysis is inconclusive.
0.5
Number of Vertices in Rd
100
v (m/s)
0 e=0
80 e = 0.001
60 e=1
−0.5
40
−0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4
20
p (m) p (m) p (m)
0
0 10 20 30 40
The first tolerance of e = 0 results in no overapproximation, but the number of Depth (d)
vertices grows quickly. The highest tolerance of e = 1 results in significantly
fewer vertices, but it is too conservative to the point where the reachable set
overlaps with the avoid set. Therefore, the results of the analysis with e = 1
are inconclusive. The middle tolerance of e = 0.001 strikes a balance between
the two extremes. With this tolerance, we are still able to verify safety while
reducing the computational cost.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.4. reducing computational cost 193
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
194 chapter 8. reachability for linear systems
3. Compute the distance between each facet of the inner approximation and the
nearest vertex of the outer approximation.
4. Add the direction of the face that is furthest from the nearest vertex to D and
return to step 1.
The process is repeated until the maximum distance between the inner and outer
approximations is less than a specified tolerance e. Figure 8.18 shows the steps
involved in a single iteration of the algorithm, and figure 8.19 demonstrates the
process over multiple iterations.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.5. linear programming 195
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
196 chapter 8. re achability for linear systems
and constraints are all linear. The linear program for equation (8.26) is
minimize d> sd
s1:d ,x1:d
subject to s1 ∈ S
(8.27)
xt ∈ Xt for all t ∈ 1 : d
st+1 = Step(st , xt ) for all t ∈ 1 : d − 1
where
Step(s, x) = (Ts + Ta Πo Os )s + Ta Πo xo + Ta xa + xs (8.28)
The decision variables in equation (8.27) are the state and disturbances at each
time step. The constraints enforce that the state and disturbances are within
their respective sets and that the state evolves according to equation (8.9). The
optimization problem in equation (8.27) can be solved efficiently using a variety
of algorithms.17 17
Modern linear programming
For the optimization problem in equation (8.27) to be a linear program, the solvers can solve problems with
thousands of variables and
sets S and Xt must be polytopes. We can write them as a set of linear inequalities constraints. H. Karloff, Linear
using their H-polytope representations. Algorithm 8.6 implements the linear Programming. Springer, 2008.
program for computing the support function of a reachable set at a particular
depth d. Given a desired time horizon h and a set of directions D , we can compute
an overapproximation of R1:h by evaluating the support function at each direction
for each depth. Algorithm 8.7 implements this process.
Similar to the polytope overapproximation in section 8.4, the choice of the
directions in D affects the tightness of the reachable set overapproximation. We
could select the directions to align with the axes or use more sophisticated meth-
ods like the iterative refinement algorithm in section 8.4.2. Since linear program
solvers are computationally efficient, another option is to simply evaluate the
support function at many randomly sampled directions. We could also select the 18
H. Abdi and L. J. Williams, “Prin-
directions using trajectory samples. Given a set of samples from the reachable set, cipal Component Analysis,” Wi-
ley Interdisciplinary Reviews: Com-
we can use principal component analysis (PCA)18 to determine the directions putational Statistics, vol. 2, no. 4,
that best capture the shape of the set.19 pp. 433–459, 2010.
19
The overapproximate reachable sets improve our understanding of the be- O. Stursberg and B. H. Krogh,
“Efficient Representation and Com-
havior of the system. However, if our ultimate goal is to check intersection with putation of Reachable Sets for Hy-
a convex avoid set U , we can solve the problem exactly without the need for brid Systems,” in Hybrid Systems:
Computation and Control, 2003.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.5. linear programming 197
function ρ(model, 𝐝, d)
𝐬 = model.obj_dict[:𝐬]
@objective(model, Max, 𝐝' * 𝐬[:, d])
optimize!(model)
return objective_value(model)
end
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
198 chapter 8. reachability for linear systems
minimize k sd − u k
s1:d ,x1:D
subject to u∈U
s1 ∈ S (8.29)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.6. summary 199
8.6 Summary
• We can compute reachable sets for linear systems by propagating sets through
the system dynamics.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
200 chapter 8. reachability for linear systems
The avoid set for the mass-spring-damper system can be written as the Example 8.5. Checking whether
the mass-spring-damper system
union of two convex sets. Specifically, we require that | p| < 0.3. The first can reach the avoid set using con-
set is therefore represented by the linear inequality [1, 0]> s ≤ −0.3, and the vex programming.
second set is represented by the linear inequality [−1, 0]> s ≤ −0.3. To check
whether the system could reach the avoid set, we run algorithm 8.8 for each
component of the avoid set. The system does not satisfy the specification if
the algorithm returns false for either component.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.6. summary 201
• If the number of vertices in the reachable set grows too large, we can produce
overapproximate representations by evaluating the support function on a set
of directions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9 Reachability for Nonlinear Systems
This chapter extends the set propagation and optimization techniques discussed
in chapter 8 to perform reachability on nonlinear systems. A system is nonlin-
ear if its agent, environment, or sensor model contains nonlinear functions. The
reachable sets of nonlinear systems are often nonconvex and difficult to compute
exactly. This chapter begins by discussing several set propagation techniques
for nonlinear systems that overapproximate the reachable set.1 We then discuss 1
For more details on set propaga-
optimization-based nonlinear reachability methods. To minimize the overap- tion through nonlinear systems, re-
fer to M. Althoff, G. Frehse, and
proximation error introduced by these methods, we introduce a technique for A. Girard, “Set Propagation Tech-
overapproximation error reduction that involves partitioning the state space. We niques for Reachability Analysis,”
Annual Review of Control, Robotics,
conclude by discussing reachability techniques for nonlinear systems represented and Autonomous Systems, vol. 4,
by a neural network. pp. 369–395, 2021.
For nonlinear systems, the reachability function r (s, x1:d ) is a nonlinear function.
In contrast with the linear systems in chapter 8, we cannot directly propagate
arbitrary polytopes through nonlinear systems. We can, however, propagate
hyperrectangular sets2 using a technique called interval arithmetic.3 Interval arith- 2
We can also propagate sets that
metic extends traditional arithmetic operations and other elementary functions are linear transformations of hyper-
rectangles by reversing the linear
to intervals. An interval is a set of real numbers written as transformation to obtain an axis-
aligned hyperrectangle and per-
[ x ] = [ x, x ] = { x | x ≤ x ≤ x } (9.1) forming the analysis in the trans-
formed space.
where x and x are the lower and upper bounds of the interval, respectively. A 3
L. Jaulin, M. Kieffer, O. Didrit,
and É. Walter, Interval Analysis.
hyperrectangle, also known as an interval box, is the Cartesian product of a set of
Springer, 2001.
n intervals:
[x] = [ x1 ] × [ x2 ] · · · × [ x n ] (9.2)
204 chapter 9. reachability for nonlinear systems
[ x2 ]
[x] = [ x1 ] × [ x2 ]
[ x ] ◦ [y] = { x ◦ y | x ∈ [ x ], y ∈ [y]} (9.3)
f ([ x ]) = [{ f ( x ) | x ∈ [ x ]}] (9.8)
where the [·] operation takes the interval hull of the resulting set. The interval
hull of a set is the smallest interval that contains the set. Therefore, the interval
counterpart of a function returns the smallest interval that contains all possible
function evaluations of the points in the input interval.
We can define an interval counterpart for a variety of elementary functions.4 4
IntervalArithmetic.jl defines
For monotonically increasing functions such as exp, log, and square root, the the interval counterpart of many
elementary functions such as sin,
interval counterpart is cos, exp, and log in Julia.
f ([ x ]) = [ f ( x ), f ( x )] (9.9)
The interval counterpart for monotonically decreasing functions is similarly de-
fined. Nonmonotonic elementary functions such as sin, cos, and square require
multiple cases to define their interval counterparts. For example, the interval
counterpart for the square function is
[min( x2 , x2 ), max( x2 , x2 )] if 0 ∈
/ [x]
[ x ]2 = (9.10)
[0, max( x2 , x2 )] otherwise
Figure 9.2 shows example evaluations of the interval counterparts for the exp,
square, and sin functions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion functions 205
6
f (x)
0
4 2
2 −1
0 0
0 1 2 −2 −1 0 1 2 −2 0 2
x x x
For complex functions, it is not always possible to define a tight interval counter-
part. In these cases, we instead define an inclusion function. An inclusion function
[ f ]([ x ]) outputs an interval that is guaranteed to contain the interval from the
interval counterpart:
f ([ x ]) ⊆ [ f ]([ x ]) (9.11)
In other words, inclusion functions output overapproximate intervals. We can
also define an inclusion function for multivariate functions that map from Rk to
R where k ≥ 1.
For reachability analysis, our goal is to propagate intervals through the function
r (s, x1:d ), which maps its inputs to Rn where n is the dimension of the state space.
We can rewrite r (s, x1:d ) as a vector of functions that map to R as follows:
r1 (s, x1:d )
..
s0 = r (s, x1:d ) =
.
(9.12)
rn (s, x1:d )
where ri (s, x1:d ) outputs the value of the ith component of s0 . We can then define
the inclusion function for each ri (s, x1:d ) as [ri ]([s], [x1:d ]). By evaluating each
inclusion function for the input intervals [s] and [x1:d ], we obtain an overapproxi-
mate hyperrectangular reachable set. The rest of this section discusses techniques
to create these inclusion functions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
206 chapter 9. reachability for nonlinear systems
f (x)
0
sion function is known as a natural inclusion function. For example, the natural
inclusion function for f ( x ) = x − sin( x ) is [ f ]([ x ]) = [ x ] − sin([ x ]) (figure 9.3). −2
By replacing the elementary nonlinear components of the agent, environment,
−2 −1 0 1 2
and sensor models with their interval counterparts, we can create the natural
x
inclusion function for ri (s, x1:d ). We can then use interval arithmetic to propagate
hyperrectangular sets through the natural inclusion function. This computation Figure 9.3. Example evaluation of
the natural inclusion function for
will result in overapproximate reachable sets for nonlinear systems. Algorithm 9.1 f ( x ) = x − sin( x ). The inclusion
implements the natural inclusion reachability algorithm and computes over- function produces an overapproxi-
mate interval.
approximate reachable sets up to a desired time horizon. Example 9.1 applies
algorithm 9.1 to the inverted pendulum problem.
As shown in figure 9.3 and example 9.1, natural inclusion functions tend to
be overly conservative. This property is due to the dependency effect, in which
multiple occurrences of the same variable are treated independently (see ex-
ample 8.2). In chapter 8, we were able to eliminate this effect by simplifying
equations to algebraically combine all repeated instances of a variable. However,
this simplification is not always possible for nonlinear functions such as the one
shown in figure 9.3. We can instead mitigate the dependency effect by using more
f (x)
sophisticated techniques for generating inclusion functions, which we discuss in
the remainder of this section.
In other words, there exists a point in [ x ] where the slope of the tangent line
is equal to the slope of the secant line between the endpoints of the interval
(figure 9.4).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion functions 207
Suppose we want to compute reachable sets for the pendulum system with Example 9.1. Computing reach-
able sets for the inverted pendulum
bounded sensor noise on the angle and angular velocity using algorithm 9.1. system using its natural inclusion
We define the intervals and extract functions as follows: function. The plot shows the over-
approximated reachable set R2
function intervals(sys, d) computed using algorithm 9.1 com-
disturbance_mag = 0.01 pared to a set of samples from R2 .
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
𝐈 = [interval(θmin, θmax), interval(ωmin, ωmax)] 2
for i in 1:2d
ω (rad/s)
push!(𝐈, interval(-disturbance_mag, disturbance_mag))
end 0
return 𝐈
end
function extract(env::InvertedPendulum, x) −2
s = x[1:2]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 3:2:length(x)] −1 0 1
return s, 𝐱 θ (rad)
end
The intervals function returns the initial state intervals followed by the
disturbance intervals for each time step. The extract function extracts these
intervals into the state and disturbance components. The plot in the caption
shows the overapproximated reachable set for after two time steps.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
208 chapter 9. reachability for nonlinear systems
The mean value theorem implies that for any subinterval of [ x ], there exists a f (x)
point in [ x ] where the slope of the tangent line is equal to the slope of the secant
line between the endpoints of the subinterval. Therefore, given the center c of the
interval [ x ], there exists a point x 0 ∈ [ x ] such that
f ( x ) − f (c)
= f 0 (x0 ) (9.14) f (x)
x−c
x c x0 x x
for any x ∈ [ x ] (figure 9.5). Rearranging equation (9.14) gives
Figure 9.5. For a given subinter-
f ( x ) = f (c) + f 0 ( x 0 )( x − c) (9.15) val [c, x ], there exists a point in [ x ]
where the slope of the tangent line
is equal to the slope of the secant
Because we know that x 0 ∈ [ x ], we can use equation (9.15) to create an inclusion
line between the endpoints of the
function for f ( x ) as follows: subinterval.
the natural inclusion function for f 0 ( x ). For multivariate functions, equation (9.16)
f (x)
generalizes to 0
the input interval to include nonlinear regions, the mean value inclusion function
f (x)
−2
9.2.3 Taylor Inclusion Functions
−2 −1 0 1 2
Natural inclusion functions and mean value inclusion functions are special cases x
of a more general type of inclusion function known as a Taylor inclusion function.
Figure 9.7. Mean value inclusion
These inclusion functions use Taylor series expansions about the center of the
function for f ( x ) = x − sin( x ) over
a wider interval.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion functions 209
Natural Inclusion First Order Second Order Third Order Figure 9.8. Evaluation of Taylor
inclusion functions of different or-
ders for f ( x ) = x − sin( x ) over
2
the interval [ x ] = [−1, 1] (top row)
and [ x ] = [−1.5, 1.5] (bottom row).
f (x)
−2
imations (figure 9.8). However, the benefit of using higher-order terms depends
on the behavior of the function over the input interval. If the function is nearly
linear over the input interval, moving beyond a first-order model may not be
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
210 chapter 9. reachability for nonlinear systems
worth the additional computational cost. In contrast, if the function is highly non-
linear over the input interval, a higher-order model may significantly decrease
overapproximation error.
Algorithm 9.2 implements first- and second-order Taylor inclusion functions
for reachability analysis. The algorithm computes overapproximate reachable sets
up to a desired time horizon by evaluating the Taylor inclusion function for each 7
Because Taylor inclusion func-
subfunction ri (s, x1:d ). Taylor inclusion functions can be used to create tighter tions can only be applied to func-
tions that are continuous and dif-
overapproximations of the reachable set than natural inclusion functions, espe- ferentiable, we use a modified ver-
cially for short time horizons (figure 9.9).7 However, the nonlinearities compound sion of the pendulum problem
in this chapter that does not ap-
for each time step, so Taylor models can be computationally expensive and result ply clamping in the environment
in significant overapproximation error for long time horizons (example 9.2). model.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion function s 211
Natural Inclusion First Order Second Order Figure 9.9. Comparison of the one-
step overapproximated reachable
1 sets for the inverted pendulum sys-
tem using natural, first-order Tay-
lor, and second-order Taylor inclu-
ω (rad/s)
The plots below show the overapproximate reachable sets for the inverted Example 9.2. Overapproximate
reachable sets for the inverted pen-
pendulum system produced by a first-order Taylor inclusion function at dif- dulum system using first-order
ferent depths. As the depth increases, the overapproximation error increases. Taylor inclusion functions at dif-
ferent depths. As the depth in-
This result is due to the increasing presence of nonlinearities in the system creases, the overapproximation er-
dynamics as we increase the depth. For the one-step reachable set (R2 ), the ror increases.
only nonlinearity present is the sine function in the pendulum dynamics.
As the depth increases, this nonlinearity will be repeated for each time step,
leading to larger overapproximation error.
R2 R3 R4 R5
1
ω (rad/s)
−1
−1 0 1 −1 0 1 −1 0 1 −1 0 1
θ (rad) θ (rad) θ (rad) θ (rad)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
212 chapter 9. reachability for nonlinear systems
While inclusion functions only operate over interval inputs and output reachable
sets in the form of hyperrectangles, Taylor models operate over other types of input
sets and are able to represent more expressive reachable sets.8 Similar to Taylor 8
K. Makino and M. Berz, “Taylor
inclusion functions, Taylor models are based on Taylor series expansions. An Models and Other Validated Func-
tional Inclusion Methods,” Interna-
nth-order Taylor model is a set represented as tional Journal of Pure and Applied
Mathematics, vol. 4, no. 4, pp. 379–
T = { p(x) + [α] | x ∈ X , α ∈ [α]} (9.20) 456, 2003.
f 00 (c) f ( n −1) ( x )
p( x ) = f (c) + f 0 (c)( x − c) + ( x − c )2 + · · · + ( x − c ) ( n −1)
2! ( n − 1) !
(9.21)
where c is the center of the input interval. The interval remainder term, also
9
One way to handle this noncon-
known as the Lagrange remainder, bounds the sum of the rest of the terms in the
vexity is to represent sets using an
Taylor expansion over the input interval [ x ] so that the Taylor model is guaranteed extension of zonotopes called poly-
to contain the true output of the function. It is calculated as nomial zonotopes. More details can
be found in M. Althoff, “Reachabil-
[ f (n) ]([ x ]) ity Analysis of Nonlinear Systems
[α] = ([ x ] − c)n (9.22) Using Conservative Polynomializa-
n! tion and Non-Convex Sets,” in In-
and is equivalent to the last term in a Taylor inclusion function of order n. In fact, ternational Conference on Hybrid Sys-
tems: Computation and Control, 2013.
passing an interval through a Taylor model performs the same computation as a Another representation called star
Taylor inclusion function of the same order. sets can also be used to repre-
sent nonconvex sets and has been
As the order of a Taylor model increases, overapproximation error tends to used for reachability. H.-D. Tran,
decrease (figure 9.10). Producing a zero-order Taylor model is equivalent to evalu- D. Manzanas Lopez, P. Musau, X.
ating the natural inclusion function, while producing a first-order Taylor model is Yang, L. V. Nguyen, W. Xiang, and
T. T. Johnson, “Star-Based Reacha-
equivalent to evaluating the mean value inclusion function. Taylor models begin bility Analysis of Deep Neural Net-
to deviate from inclusion functions for orders of two or higher. Second-order works,” in International Symposium
on Formal Methods, 2019.
Taylor models represent arbitrary polytopes, while second-order inclusion func-
tions only produce hyperrectangles. Higher-order Taylor models correspond to 10
M. Althoff, O. Stursberg, and
nonconvex sets, which are more difficult to understand and manipulate.9 For this M. Buss, “Reachability Analysis
of Nonlinear Systems with Uncer-
reason, we focus the remainder of this section on second-order Taylor models. tain Parameters Using Conserva-
Creating a second-order Taylor model for a function f (x) is a process known as tive Linearization,” in IEEE Confer-
ence on Decision and Control (CDC),
conservative linearization.10 Given an input set X and a center point c, the second- 2008.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.3. taylor models 213
Zero Order First Order Second Order Third Order Figure 9.10. Taylor models of dif-
ferent orders for f ( x ) = x − sin( x )
1 over the interval [ x ] = [−1.5, 0.0].
The dashed purple lines show re-
0 sults from a Taylor inclusion func-
f (x)
−2
−2 −1 0 −2 −1 0 −2 −1 0 −2 −1 0
where J is the Jacobian of f evaluated at c and [α] is the interval remainder term.
The Jacobian is a generalization of the gradient to functions with multidimensional
outputs and is computed as
∇ f 1 (c) >
..
J= .
(9.24)
∇ f n (c) >
where ∇ f i (c) is the gradient of the ith component of f evaluated at c. The interval
remainder term is calculated using interval arithmetic as
1
[α] = ([X ] − c)> [∇2 f ]([X ])([X ] − c) (9.25)
2
where [X ] is the interval hull of X .11 11
If the input set X is represented
Equation (9.23) represents a linear approximation of the nonlinear function f as a zonotope, it is also possible
to overapproximate the remainder
with a remainder term that bounds the error of the approximation. Because all of term directly without taking the in-
the operations in equation (9.23) are linear, we can use it to propagate convex terval hull. This approach can re-
duce overapproximation error. M.
sets. In other words, if X is convex, we can rewrite the Taylor model in terms of Althoff, O. Stursberg, and M. Buss,
linear transformations and Minkowski sums as “Reachability Analysis of Nonlin-
ear Systems with Uncertain Pa-
T = f (c) + J(X ⊕ −c) ⊕ [α] (9.26) rameters Using Conservative Lin-
earization,” in IEEE Conference on
Decision and Control (CDC), 2008.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
214 chapter 9. reachability for nonlinear systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.4. concrete reachability 215
Suppose we want to compute reachable sets for the pendulum system with Example 9.3. Computing the one-
step reachable set for the inverted
bounded sensor noise on the angle and angular velocity using algorithm 9.3. pendulum system using conserva-
We define the sets function as follows: tive linearization. Conservative lin-
earization better approximates the
function sets(sys, d) reachable set than a second-order
disturbance_mag = 0.01 Taylor inclusion function.
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
low = [θmin, ωmin]
high = [θmax, ωmax]
for i in 1:d
append!(low, [-disturbance_mag, -disturbance_mag])
append!(high, [disturbance_mag, disturbance_mag])
end
return Hyperrectangle(low=low, high=high)
end
The sets function returns the initial state set followed by the disturbance sets
for each time step. The plots below compare the one-step reachable set pro-
duced by conservative linearization with the set produced by a second-order
Taylor inclusion function. While conservative linearization still produces an
overapproximation, it captures the shape of the true reachable set better than
a Taylor inclusion function.
1
ω (rad/s)
−1
−1 0 1 −1 0 1
θ (rad) θ (rad)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
216 chapter 9. re achability for nonlinear systems
r (s1 , x1:3 ) R3
the inverted pendulum system.
1
The symbolic reachability algo-
0.9 rithm directly computes R3 with-
out explicity computing R2 by con-
sidering r (s1 , x1:3 ) as a single func-
−0.5 0 0.5 1
tion. The concrete reachability al-
gorithm computes R2 and R3 sep-
R1
arately by considering r (s1 , x1:2 )
R2 R3 and r (s2 , x2:3 ) as separate func-
Concrete
input dimension causes the size of the gradient and Hessian to increase, leading
to more expensive computations. Furthermore, the nonlinearities in the agent,
environment, and sensor models compound over time, causing the accuracy of a
linearized model to degrade as the depth increases.
Concrete reachability algorithms address these issues by decomposing the reach-
ability function into a sequence of simpler functions. Instead of overapproximating
the reachable set over the entire depth at once, they compute the overapproximate
reachable set for each time step individually. At each iteration, they use the over-
approximate reachable set from the previous time step as the input set for the next
time step. We refer to this process as concrete reachability because we concretize
the reachable set at each time step by explicitly computing an overapproximate
representation. In contrast, the algorithms presented thus far maintain a sym-
bolic representation of the reachable set at each time step and only concretize the
reachable set at depth d. For this reason, we refer to these algorithms as symbolic
reachability algorithms. Figure 9.11 illustrates the difference between symbolic and
concrete reachability algorithms.
Algorithms 9.4 and 9.5 implement concrete versions of the symbolic reach-
ability algorithms presented in algorithms 9.2 and 9.3, respectively. For each
depth in the time horizon, they compute the overapproximate reachable set for
the next step using the overapproximate reachable set from the previous step.
Algorithm 9.4 concretizes the reachable set into a hyperrectangle at each time
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.4. concrete reachability 217
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
218 chapter 9. reachability for nonlinear systems
step, while algorithm 9.5 concretizes the reachable set into a polytope at each
time step.
Concrete reachability algorithms are generally more computationally efficient
than symbolic reachability algorithms. However, it is not always clear whether
they will produce tighter overapproximations because there are multiple factors
that contribute to the overapproximation error. The only source of overapproxima-
tion error in symbolic reachability algorithms is the error introduced by linearizing
the reachability function and bounding the remainder term. We expect this lin-
earization error to be smaller for concrete reachability algorithms because they
linearize over a single time step rather than the entire time horizon.
While concrete reachability algorithms reduce overapproximation error due to
linearization, they introduce additional overapproximation error by concretizing
the reachable set at each time step into an overapproximate reachable set (fig-
ure 9.11). This error compounds over time, and the accumulation of this error is
often referred to as the wrapping effect.
The decrease in linearization error and introduction of the wrapping effect
for concrete reachability algorithms result in a tradeoff between concrete and
symbolic reachability (figures 9.12 and 9.13). The choice of which type of al-
gorithm to use depends on the specific system, the reachability algorithm, and
the desired tradeoff between computational efficiency and overapproximation
error. For example, if we are using linearized models for reachability and the
one-step reachability function is nearly linear, concrete reachability algorithms
may produce tighter overapproximations than symbolic reachability algorithms.
It is common to mix concrete and symbolic reachability algorithms to take advan-
tage of the strengths of each approach. For example, instead of concretizing the
reachable set at each time step, we can concretize the reachable set every k time
steps to reduce the wrapping effect.
Another benefit of using concrete reachability algorithms is that we can use
them to check for invariant sets. Similar to the check for invariance described for
the set propagation techniques in section 8.2, if we find that the reachable set at
a given time step in contained within the concrete reachable set at the previous
time step, we can conclude that the reachable set is invariant. For example, the
concrete versions of R6 in figures 9.12 and 9.13 are contained within the concrete
versions of R5 . Therefore, we can conclude that R6 is an invariant set in both
cases, meaning that the system will remain within the set for all future time steps.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.5. optimization-based nonlinear reachability 219
Similar to the ideas in section 8.5, we can overapproximate the reachable set of
nonlinear systems by sampling the support function. For symbolic reachability,
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
220 chapter 9. reachability for nonlinear systems
minimize d> sd
sd ,x1:d
subject to s1 ∈ S
(9.27)
xt ∈ Xt for all t ∈ 1 : d
sd = r (s1 , x1:d )
For concrete reachability, we replace the last constraint with a constraint for each
time step as follows:
minimize d> sd
sd ,x1:d ,α
subject to s1 ∈ S
xt ∈ Xt for all t ∈ 1 : d
" # (9.29)
s1 − sc
sd = r (sc , xc ) + J +α
x1:d − xc
α ∈ [α]
where sc and xc are the centers of the state and disturbance sets and J is the
Jacobian of the reachability function evaluated at sc and xc . We introduce another
decision variable α to represent the remainder term and constrain it to belong
to the Lagrange remainder interval [α] (equation (9.25)). The concrete version
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.5. optimization-based nonlinear reachability 221
0 2 4 6 8 10 12 14
x
function as a set of mixed integer contraints turns equation (9.27) into a MILP, 14
A detailed overview of integer
which we can solve using a variety of algorithms.14 programming can be found in L. A.
Wolsey, Integer Programming. Wi-
While many real-world nonlinear systems do not have piecewise linear reach- ley, 2020. Modern solvers, such
ability functions, we can overapproximate them with piecewise linear bounds. as Gurobi and CPLEX, can rou-
tinely handle problems with mil-
First, we decompose the reachability function into a conjunction of elementary lions of variables. There are pack-
nonlinear functions (see example 9.6). For each nonlinear elementary function, ages for Julia that provide access
to Gurobi, CPLEX, and a variety of
other solvers.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
222 chapter 9. reachability for nonlinear systems
Consider the following piecewise linear function (shown in the caption): Example 9.4. Writing the ReLU
function in terms of the max func-
tion.
0 if x < 0
f (x) =
x otherwise
This function is often referred to as the rectified linear unit (ReLU) function
and is commonly used in neural networks. We can rewrite this function in
terms of the max function as follows:
f ( x ) = max(0, x )
If x < 0, the max function will return 0, and if x ≥ 0, the max function will
return x.
we can derive piecewise linear lower and upper bounds over a given interval.
We can then convert those bounds to mixed-integer constraints and solve the
resulting MILP to overapproximate the reachable set.15 15
For more details on the process of
deriving the bounds and convert-
ing to constraints, see C. Sidrane, A.
9.6 Partitioning Maleki, A. Irfan, and M. J. Kochen-
derfer, “OVERT: An Algorithm for
Safety Verification of Neural Net-
The methods presented in this chapter tend to result in less overapproximation work Control Policies for Nonlin-
error when computing reachable sets over smaller regions of the input space. For ear Systems,” Journal of Machine
Learning Research, vol. 23, no. 117,
example, Taylor approximations are more accurate for points near the center of the
pp. 1–45, 2022.
region and become less accurate as we move away from the center (figure 9.15).
Therefore, we want to keep the input set for Taylor inclusion functions and Taylor
models as small as possible to minimize overapproximation error. 1
gorithms by partitioning the input set into smaller regions and computing the 0
reachable set for each region separately. Specifically, we divide the input set S
−1
into a set of smaller regions S (1) , S (2) , . . . , S (m) such that
m −2 −1 0 1 2
(i )
[
S= S (9.30) x
i =1
Figure 9.15. First-order Taylor ap-
(i ) proximation (dashed blue line) for
To compute the reachable set at depth d, we compute the reachable set Rd for the function f ( x) = x − sin( x)
each region S (i) separately and then combine the results to form the reachable (gray) at centered x = 0. The ap-
proximation is more accurate near
the center.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.6. partitioning 223
Suppose we want to solve an optimization problem with the following piece- Example 9.5. Mixed-integer formu-
lation of the ReLU function.
wise linear constraint:
y = max(0, x )
We will also assume that we know that x lies in the interval [ x, x ]. We can
encode this constraint using a set of mixed-integer constraints as follows:
y ≤ x − x (1 − a )
y≥x
y ≤ xa
y≥0
a ∈ {0, 1}
The plots below iteratively build up the constrained region for each possible
value of a.
y ≤ x−x y≥x y≥0 y≤0
a=0
x x x x x x x x
x x x x x x x x
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
224 chapter 9. reachability for nonlinear systems
Consider the nonlinear constraint y = x − sin( x ) over the region −2 ≤ x ≤ 2. Example 9.6. Converting a nonlin-
ear equality constraint into a set
We can convert this constraint into a set of piecewise linear constraints by of mixed-integer constraints using
first decomposing the function into its elementary functions: piecewise linear bounds. We use
the OVERT.jl package to compute
the overapproximations.
y = x−z
z = sin( x )
−2 ≤ x ≤ 2
We then derive a piecewise linear lower bound z and upper bound z for
sin( x ) and rewrite the constraints as
y = x−z
z≤z≤z
z = z( x )
z = z( x )
−2 ≤ x ≤ 2
The plots below show the overapproximations of sin( x ) using different num-
bers of linear segments.
1 z( x ) z( x ) z( x )
f (x)
−1 z( x ) z( x ) z( x )
−2 0 2 −2 0 2 −2 0 2
x x x
The final step is to convert the piecewise linear functions z( x ) and z( x ) into
their corresponding mixed-integer constraints. The overapproximations be-
come tighter as the number of segments increases, but the computational
cost and the number of mixed integer constraints required to represent the
piecewise linear bounds also increases.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.6. partitioning 225
Input Set Input Partition Output Partition Output Set Figure 9.16. Computing the one
step reachable set for the inverted
S S (1) S (2) pendulum system using partition-
ing. The input set S is partitioned
R (3) R into four regions, and the reachable
R (4)
set for each region is computed sep-
R (1) arately using a first order Taylor in-
R (2)
clusion function. The union of the
S (3) S (4) resulting output sets forms the full
reachable set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
226 chapter 9. reachability for nonlinear systems
We can use the techniques discussed in the previous sections to verify properties
of neural networks. Neural networks are a class of functions that are widely used
in machine learning and could be used to represent the agent, environment, or
sensor model. They are composed of a series of layers, each of which applies
an affine transformation followed by a nonlinear activation function.17 Given a 17
More details about the structure
set of inputs to a neural network, we are often interested in understanding the and training of neural networks are
found in appendix C.
possible outputs.18 For example, we may want to ensure that an aircraft collision 18
This process is sometimes re-
avoidance system will always output an alert when other aircraft are nearby. ferred to as neural network verifica-
Evaluating a neural network is similar to performing a rollout of a system. tion. A detailed overview of neural
network verfication can be found
However, instead of computing st+1 by passing st through the sensor, agent, and in C. Liu, T. Arnon, C. Lazarus,
environment models, we compute it by passing st through the tth layer of the C. Strong, C. Barrett, and M. J.
Kochenderfer, “Algorithms for Ver-
neural network. If st is the input to layer t, then the output st+1 is computed as ifying Deep Neural Networks,”
Foundations and Trends in Optimiza-
st +1 = φ (Wt st + bt ) (9.32) tion, vol. 4, no. 3–4, pp. 244–404,
2021.
where Wt is a matrix of weights, bt is a bias vector, and φ(·) is a nonlinear ac-
tivation function. Common activation functions include ReLU, sigmoid, and
hyperbolic tangent. Figure 9.18 shows an example of a two-layer neural network.
In this context, we can check properties of the neural network by computing the
reachable set of the output layer given an input set.
For piecewise linear activation functions, we can compute the exact reachable
set by partitioning the input space into different activation sets and computing
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.7. neural networks 227
1 0
0.4
0.5
0.2 −1
0 0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
Input Set Layer 1 Output Output Set Figure 9.20. Computing the over-
approximate reachable set of a two-
s12 s22 s32 layer neural network using natural
inclusion functions. The true reach-
able set for each layer is shown in
blue, and the interval overapproxi-
s11 s21 s31 mation is shown in purple.
the reachable set for each subset separately.19 For example, we can compute exact 19
W. Xiang, H.-D. Tran, J. A. Rosen-
reachable sets for neural networks with ReLU activation functions (example 9.7). feld, and T. T. Johnson, “Reachable
Set Estimation and Safety Verifi-
However, the number of subsets grows exponentially with the number of nodes cation for Piecewise Linear Sys-
in the network. Therefore, exact reachability analysis is often intractable for large tems with Neural Network Con-
trollers,” in American Control Con-
neural networks, so it is common to instead use overapproximation techniques to ference (ACC), 2018.
bound the output set.
Similar to the nonlinear systems discussed earlier, we can use inclusion func-
tions to overapproximate the output set of neural networks. By replacing each
activation function with its interval counterpart, we obtain the natural inclusion
function for a neural network. Figure 9.19 shows an example evaluation of the
interval counterpart for the ReLU function. Evaluating the natural inclusion func-
tion for the network on a set of input intervals provides an overapproximation of
the possible network outputs (figure 9.20).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
228 chapter 9. reachability for nonlinear systems
Suppose we want to propagate the input set S1 shown below through the Example 9.7. Exact reachability for
a two-layer neural network with
first layer of the neural network in figure 9.18. We first apply the linear ReLU activation functions.
transformation to obtain the pre-activation region Z2 = W1 S1 ⊕ b1 :
s11 z11
To compute the final output set of the neural network in figure 9.18, we would
repeat this process for each of the subsets that comprise S2 . The final output
set will therefore be the union of 16 subsets.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.8. summary 229
where f n (s1 ) is the neural network function and S is the input set. For ReLU
networks, it is possible to write this optimization problem as a MILP by converting
each ReLU activation function into its corresponding mixed-integer contraints
(see example 9.5). To create the mixed integer constraints, we need an upper 22
M. Akintunde, A. Lomuscio, L.
and lower bound on the input to each ReLU. We can either select a sufficiently Maganti, and E. Pirovano, “Reach-
ability Analysis for Neural Agent-
large bound for all nodes22 or compute specific bounds by evaluating the natural Environment Systems,” in Inter-
inclusion function.23 To compute an overapproximation of the output set, we national Conference on Principles of
Knowledge Representation and Rea-
evaluate the support function in multiple directions. soning, 2018.
In addition to evaluating the support function, we can use the MILP formula- 23
V. Tjeng, K. Y. Xiao, and R.
tion to check other properties of the neural network by changing the objective Tedrake, “Evaluating Robustness
of Neural Networks with Mixed
function or adding constraints.24 For example, we can check if the output set Integer Programming,” in Interna-
intersects with a given avoid set or find the maximum disturbance that causes the tional Conference on Learning Repre-
network to change its output. In general, neural network verification approaches sentations (ICLR), 2018.
24
C. A. Strong, H. Wu, A. Zeljic,
can be combined with the techniques discussed in this chapter to verify closed- K. D. Julian, G. Katz, C. Barrett,
loop properties of systems that contain neural networks.25 and M. J. Kochenderfer, “Global
Optimization of Objective Func-
tions Represented by ReLU Net-
9.8 Summary works,” Machine Learning, vol. 112,
pp. 3685–3712, 2023.
25
M. Everett, G. Habibi, C. Sun,
• Reachable sets for nonlinear systems are often nonconvex and difficult to
and J. P. How, “Reachability Anal-
compute exactly. ysis of Neural Feedback Loops,”
IEEE Access, vol. 9, pp. 163 938–
• We can apply a variety of techniques to overapproximate the reachable sets of 163 953, 2021.
nonlinear systems.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
230 chapter 9. reachability for nonlinear systems
• We can use interval arithmetic to create inclusion functions that provide over-
appoximate output intervals for nonlinear functions.
• We can sample the support function of the reachable set for nonlinear sys-
tems by solving an overapproximate linear program or mixed-integer linear
program.
• We can extend some of the techniques outlined in this chapter to analyze the
output sets of neural networks.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10 Reachability for Discrete Systems
While the techniques in chapters 8 and 9 focus on reachability for systems with
continuous states, this chapter focuses on reachability for systems with discrete
states. We begin by representing the transitions of a discrete system as a directed
graph. This formulation allows us to use graph search algorithms to perform
reachability analysis. Next, we discuss techniques for probabilistic reachability
analysis, in which we calculate the probability of reaching a particular state or
set of states. We conclude by discussing a method to apply these techniques to
continuous systems by abstracting them into discrete systems.
each edge represents a transition between states (figure 10.1). We can also associate Figure 10.1. Graph representation
a probability with each edge to represent the likelihood of the transition occurring. of a discrete system with two states,
Algorithm 10.1 creates a directed graph from a discrete system. For each dis- s1 and s2 . The graph has a node
for each state and an edge originat-
crete state, it computes the set of possible next states and their corresponding ing from each state for each possi-
probabilities. It then adds an edge to the graph for each possible transition. Fig- ble transition. Each edge is labeled
with the probability of the transi-
ure 10.2 shows the graph representation of the grid world system. For systems tion. For example, when we are in
with large state spaces, it may be inefficient to store the full graph in memory. In s1 , we have a 0.8 probability of tran-
these cases, we can represent the graph implicitly using a function that takes in a sitioning from s1 to s2 .
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.2. reachable sets 233
To compute reachable sets, we ignore the probabilities associated with the edges of
the graph and focus only on its connectivity. The reachable sets are represented as
collections of discrete states. We focus on two types of reachability analysis: forward
reachability and backward reachability. Forward reachability analysis determines the
set of states that can be reached from a given set of initial states within a specified
time horizon.1 . Backward reachability analysis determines the set of states from 1
This process is sometimes re-
which a given set of target states can be reached within a specified time horizon. ferred to as bounded model checking
Figure 10.3 demonstrates the difference between the two types of reachability
analysis, and the rest of this section presents algorithms for each type.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
234 chapter 10. reachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 235
The reachable set has converged once it no longer changes. If we find that
R1:d = R1:d−1 , the reachable set has converged, and R1:∞ = R1:d .2 We can also 2
This condition allows us to per-
check for invariant sets by relaxing this condition. Specifically, if R1:d ⊆ R1:d−1 , form unbounded model checking, in
which the output holds over all pos-
we can conclude that R1:d is an invariant set and that the system will remain sible trajectories.
within this set for all future time steps (R1:∞ ⊆ R1:d ). Performing this check
on discrete sets is straightforward because we can directly compare the states
contained in each set.
10.3 Satisfiability
We can use the forward and backward reachable sets of discrete systems to
determine whether they satisfy a reachability specification (figure 10.6). For
forward reachability, we check whether the target set intersects with the forward
reachable set. For backward reachability, we check whether the initial set intersects
with the backward reachable set. In both cases, these checks require us to compute
the full forward or backward reachable set. This process can be computationally
expensive, especially for systems with large state spaces.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
236 chapter 10. reachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 237
ST ST ST ST
further increase efficiency by using heuristics to prioritize paths that are more
likely to lead to a counterexample. In cases where the system satisfies the specifi-
cation and no counterexample exists, these algorithms have the same computa-
tional complexity as breadth-first search. Figure 10.7 compares the performance
of breadth-first search, depth-first search, and heuristic search for finding coun-
terexamples in the grid world problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 239
The wildfire problem is an example of a problem in which graph search is Example 10.1. Demonstration of
difficulties that arise when apply-
intractable. Consider a wildfire scenario modeled as an n × n grid where ing graph search algorithms to the
each cell is either burning or not burning. At each time step, a burning cell wildfire problem.
has a nonzero probability of spreading the fire to each of its neighboring
cells. A burning cell will also remain burning at the next time step with
2
some probability. This problem has 2n possible states, and a state with b
burning cells has as many as 25b possible successors. For a 5 × 5 grid, the state
space has 225 = 3.4 × 107 states. For a 10 × 10 grid, that number increases to
2100 = 1.27 × 1030 possible states. The example below shows the successors
for a state where only the cell in the center is burning.
···
Even though only one cell is burning, there are still 32 successor states.
This number only increases as we increase the number of burning cells. A
state with 10 burning cells has as many as 250 = 1.13 × 1015 successors.
For most grid sizes, even partially computing and storing the graph for the
wildfire problem is intractable. For this reason, we cannot use graph search
algorithms for this problem and must turn to other methods such as Boolean
satisfiability.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
240 chapter 10. reachability for discrete systems
Consider a wildfire problem with a 10 × 10 grid and a time horizon of h = 20. Example 10.2. Encoding the initial
state of the wildfire problem as a
The state at a particular time step is represented as a set of Boolean variables propositional logic formula using
that represent whether each grid cell is burning. The SAT problem will the Satisfiability.jl package.
therefore have 100 × 20 = 2000 Boolean variables representing the states
at each time step. We can represent the initial state as a propositional logic
formula that evaluates to true when the bottom left cell is burning and all
other cells are not burning. The following code implements this formula:
n = 10 # grid is n x n
h = 20 # time horizon
@satvariable(burning[1:n, 1:n, 1:h], Bool)
init = burning[1, 1, 1] # bottom left cell is burning
for i in 1:n, j in 1:n
if i ≠ 1 || j ≠ 1 # all other cells are not burning
init = init ∧ ¬burning[i, j, 1]
end
end
Combining the initial state and transition formulas with the failure condition,
we can create a single propositional logic formula that represents the reachability
problem:
A SAT solver will search the space of possible values for the Boolean variables s1:h
to find an assignment that satisfies the formula. A satisfying assignment corre-
sponds to a feasible trajectory that satisfies the failure condition. Therefore, if the
SAT solver determines that there are no satisfying assignments, we can conclude
that the system satisfies the specification. Example 10.4 demonstrates how to use
Boolean satisfiability to check reachability specifications for the wildfire problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 241
The following code implements the propositional logic formula for the tran- Example 10.3. Encoding the tran-
sitions of the wildfire problem as a
sitions of the wildfire problem: propositional logic formula using
transition = true the Satisfiability.jl package.
for i in 1:n, j in 1:n, t in 1:h-1
transition = transition ∧ (
burning[i, j, t+1] ⟹
(burning[i, j, t] ∨
burning[max(1, i-1), j, t] ∨
burning[min(n, i+1), j, t] ∨
burning[i, max(1, j-1), t] ∨
burning[i, min(n, j+1), t])
)
end
T ( ,
)= true
T ( ,
)= false
In the first case, both cells burning at time t + 1 were either burning at time t
or had a neighbor that was burning at time t. In the second case, the cell at
(3, 4) was not burning at time t, and none of its neighbors were burning at
time t.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
242 chapter 10. reachability for discrete systems
Suppose there is a densely populated area in the top right cell of the wildfire Example 10.4. Checking reachabil-
ity specifications for the wildfire
grid, and we want to determine whether it might burn. We can encode the problem using Boolean satisfiabil-
failure condition as a propositional logic formula that evaluates to true when ity.
the top right cell is burning. We can then combine this formula with the
initial state and transition formulas from equation (10.1) and pass it to a SAT
solver to determine whether the top right cell is reachable. The following
code demonstrates this process:
ψ = ¬burning[n, n, t]
reachable = sat!(init ∧ transition ∧ ¬ψ)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.4. probabilistic reachability 243
Consider the grid world problem with a slip probability of 0.3. Running Example 10.5. Comparison of
reachable set analysis and proba-
algorithm 10.2 with a time horizon h = 9 leads to the conclusion that the bilistic forward reachability analy-
system is unsafe because the obstacle is included in the forward reachable sis on the grid world problem.
set. However, the probability of reaching the obstacle after 9 steps when
following the optimal policy is only 0.0004, and the system is more likely to
be in a state near its nominal path to the goal. In this scenario, the probabilistic
reachability provides a more useful assessment of the actual safety of the
system. The plots below show the reachable set (left) and the results of a
probabilistic reachability analysis (right).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
244 chapter 10. reachability for discrete systems
Pt+1 (s) = ∑
0
T (s0 , s) Pt (s0 ) (10.2)
s ∈S
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.4. probabilistic reachability 245
The plots below show the results from probabilistic occupancy analysis on the Example 10.6. Determining occu-
pancy probabilities for the grid
grid world problem with a slip probability of 0.3. They show the distribution world problem.
over reachable states at different time steps with reachable states appearing
larger and darker states indicating a higher probability of reaching them.
The nominal path is highlighted in gray.
While the obstacle state is reachable in three of the plots, the probability of
occupying the obstacle state is low and the probability is much higher for
states near the nominal path. After 50 time steps, most of the probability
mass is in the goal state with a small portion in the obstacle state and the
other grid cells. At this point, the probability of being in the goal state is
0.981 and the probability of being in the obstacle state is 0.018. We can use
these numbers to draw conclusions about the overall safety of the system.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
246 chapter 10. reachability for discrete systems
In other words, for states in the target set, the probability of reaching the target
set is 1. For all other states, the probability of reaching the target set within t + 1
time steps is sum of the probability of transitioning to each of its successors times
the probability that they reach the target set within t time steps. We initialize R1
to be 1 for states in the target set and 0 otherwise.
We can use the results of this analysis to identify dangerous states for the
system. Furthermore, if we know the initial state distribution P1 for the system,
we can determine the probability of reaching the target set within a given time
horizon by summing the probability of reaching the target set from each state
weighted by the probability of occupying that state at time t = 1:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.4. probabilistic reachability 247
Preach
many scenarios, running the analysis for a sufficiently long time horizon is enough
0.4
to draw conclusions about the overall safety of the system. However, it is also obstacle
0.2
possible to compute the probability of reaching the target set in the limit as the
0
time horizon approaches infinity. This probability is known as the infinite-horizon 50 100 150 200
h
reachability probability, and we denote it as R∞ (s).
To compute this probability, we rewrite the recursive relationship in equa- Figure 10.9. Probability of reach-
tion (10.3) as ing the goal state and the obsta-
cle state in the grid world problem
Rt+1 (s) = R1 (s) + ∑ TR (s, s0 ) Rt (s0 ) (10.5) with a slip probability of 0.6 as a
s0 ∈S function of the time horizon. We
where assume the system is initialized in
the bottom left corner. As the hori-
0 if s ∈ S T
TR (s, s0 ) = (10.6) zon increases, the probabilities be-
T (s, s0 ) otherwise gin to converge.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
248 chapter 10. reachability for discrete systems
Rt+1 = R1 + TR Rt (10.7)
where Rt is a vector of length |S| such that the ith entry corresponds to Rt (si ),
and TR is a matrix of size |S| × |S| such that entry in the ith row and jth column
corresponds to TR (si , s j ).10 10
This formulation is equivalent to
For an infinite horizon, we have that a Markov reward process with an
immediate reward of 1 for all states
in the target set and 0 otherwise.
R∞ = R1 + TR R∞ (10.8) The states in the target set are ter-
minal states.
We can solve for R∞ by rearranging the terms in equation (10.8) to get
R∞ − TR R∞ = R1 (10.9)
(I − TR )R∞ = R1 (10.10)
−1
R∞ = (I − T R ) R1 (10.11)
The methods discussed in this chapter apply only to discrete systems. However,
we can use them to produce overapproximate reachability results for continuous
systems by creating a discrete state abstraction (DSA). To create a discrete state
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.5. discrete state abstractions 249
Suppose we want to understand the probability of reaching the obstacle state Example 10.7. Infinite-horizon
probability of reaching the obsta-
for grid world problems with different slip probabilities. The plots below cle for different slip probabilities in
show the results of infinite-horizon reachability analysis with the obstacle as the grid world problem.
the target set for slip probabilities of 0.3, 0.5, and 0.7. For each slip probability,
we compute Pfail assuming we start in the bottom left corner of the grid.
abstraction, we partition the continuous state space into a finite number of smaller
regions. We then create a graph where the nodes correspond to the regions, and
the edges correspond to transitions between regions. Figure 10.10 shows the
process of creating a DSA for the inverted pendulum problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
250 chapter 10. reachability for discrete systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.6. summary 251
• Backwards reachability algorithms begin with a set of target states and calculate
the set of states that can reach the target set in a given time horizon.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
252 chapter 10. reachability for discrete systems
We can create a DSA for the inverted pendulum system using algorithm 10.1 Example 10.8. Creating a DSA for
the inverted pendulum system us-
by defining the states function to partition the state space into a grid of ing algorithm 10.1. The plots show
regions and the successors function to determine the connectivity of the the process of determining the con-
graph using a nonlinear forward reachability technique such as conservative nectivity of the graph for a single
linearization. Example implementations are as follows: region S (i) . The plot below shows
the graph for the final DSA with a
function states(env::InvertedPendulum; nθ=8, nω=8)
uniform partition of the state space
θs, ωs = range(-1.2, 1.2, length=nθ+1), range(-1.2, 1.2, length=nω+1)
into 64 regions.
𝒮 = [Hyperrectangle(low=[θlo, ωlo], high=[θhi, ωhi])
for (θlo, θhi) in zip(θs[1:end-1], θs[2:end])
for (ωlo, ωhi) in zip(ωs[1:end-1], ωs[2:end])] 1
return 𝒮
ω (rad/s)
end
function successors(sys, 𝒮⁽ⁱ⁾) 0
_, 𝒳 = sets(sys, 2)
ℛ⁽ⁱ⁾ = conservative_linearization(sys, 𝒮⁽ⁱ⁾ × 𝒳)
ℛ⁽ⁱ⁾ = VPolytope([clamp.(v, -1.2, 1.2) for v in vertices_list(ℛ⁽ⁱ⁾)])
𝒮⁽ʲ⁾s = filter(𝒮⁽ʲ⁾->!isempty(ℛ⁽ⁱ⁾ ∩ 𝒮⁽ʲ⁾), states(sys.env)) −1
return 𝒮⁽ʲ⁾s, ones(length(𝒮⁽ʲ⁾s)) −1 0 1
end
θ (rad)
The plots below demonstrate the successors function on an example state
S (i) . The function first computes R(i) using conservative linearization (left).
It then determines the regions S ( j) that intersect with R(i) (middle). Finally,
the function returns these regions so that they can be connected in the graph
(right). The edge weights can be ignored when computing reachable sets.
R (i )
ω (rad/s)
0
S (i )
−1
−1 0 1 −1 0 1 −1 0 1
θ (rad) θ (rad) θ (rad)
Algorithm 10.1 calls the successors function for each region in the partition
to determine the connectivity of the graph. The result is shown in the caption.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.6. summary 253
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
254 chapter 10. reachability for discrete systems
Suppose we have a continuum world problem with Gaussian disturbances Example 10.9. Overapproximation
the transition probabilities for the
on its transitions. For example, if the agent takes the up action, its next DSA of the continuum world sys-
position is sampled from a Gaussian distribution with a mean 1 unit above tem.
its current state and a standard deviation of 1 in each direction. In other words,
T (s, s0 ) = N (s0 | s + d, I) where d is the direction vector corresponding to
the action taken in the state s. Our goal is to determine the overapproximated
transition probabilities T (S , S 0 ) for a DSA of the continuous system.
To obtain the probability of transitioning from a specific state s to a
region in the partition S 0 , we integrate the transition function such that
T (s, S 0 ) = S 0 T (s, s0 ) ds0 . To obtain an overapproximation of the transition
R
probabilities, we select the transition from the current region S 0 that re-
sults in the highest probability of reaching the target region S 0 such that
T (S , S 0 ) = maxs∈S T (s, S 0 ). The plots below show the transition probabili-
ties for a single state s to the regions in the DSA. The plots below demonstrate
this process.
T (s, s0 ) T (s, S 0 ) T (S , S 0 )
s s S
The maximization in the formula for T (S , S 0 ) finds the state in S that puts
the highest amount of probability mass in S 0 . The plots below demonstrate
this maximization for three different next regions S 0 . This process produces
an overapproximation of the transition probabilities since we assume all
states in S transition to the worst-case next state.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11 Explainability
11.1 Explanations
θ (rad)
h (m)
0 0 axis.
−200
−1
−400
0 10 20 30 40 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 257
plots by performing rollouts of the policy and plotting the state of the system at
each time step. This visualization can help us understand how the policy behaves 1 20
in different scenarios and identify potential failure modes. Figure 11.1 shows an
example of this visualization technique for the aircraft collision avoidance and 0 0
ω
inverted pendulum policies.
If the policy is Markov and therefore depends only on the current state, we can −1 −20
also visualize it directly by plotting the action taken by the agent in each state. If −1 0 1
the state space is two-dimensional as in the inverted pendulum example, we can θ
plot the action taken by the agent as a two-dimensional heatmap (figure 11.2). For Figure 11.2. Visualization of the
higher-dimensional state spaces, we often need to apply dimensionality reduction actions taken by the inverted pen-
dulum policy. The colors represent
techniques to visualize the policy. One common technique is to fix all but two of the torque applied in each state.
the state variables, which become associated with the vertical and horizontal axes.
We can indicate the action for every state with a color. Example 11.1 demonstrates
this technique for the collision avoidance policy.
Instead of fixing the hidden state variables, we could also use various tech-
niques to aggregate over them (figure 11.3). One method involves partitioning
the state space into a set of regions and keeping track of the actions taken in
each region over a series of rollouts. We can then aggregate over these actions by
plotting the mean or mode of the actions taken in each region. One benefit of this
technique is that it relies only on rollouts of the policy and therefore extends to
non-Markovian policies. Because all states may not be reachable in practice, some
areas of the policy plot may have no data associated with it.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
258 chapter 11. e xplainability
The figure in the caption shows policy plots for the aircraft collision avoidance Example 11.1. Aircraft collision
avoidance policy when the rela-
policy when the relative vertical rate is fixed at 0 m/s and 4 m/s, and the tive vertical rate is fixed at 0 m/s
previous action is fixed at no advisory. The red aircraft represents the relative (left) and 4 m/s (right), and the
previous action is fixed at no advi-
location of the intruder aircraft. We can use these plots to explain the behavior sory. The colors represent the ac-
of the policy in these scenarios. For example, we can see that when the relative tion taken by the agent in each
vertical rate is fixed at zero, the policy advises our aircraft to climb when it is state.
above the intruder and descend when it is below the intruder. This behavior
is aligned with our objective of avoiding collisions.
The plot on the left also reveals some potentially unexpected behaviors. For
example, when the time to collision is near zero and a collision is imminent,
the policy results in no advisory. This behavior may prompt us to perform
further analysis. For example, a counterfactual analysis (see section 11.5)
reveals that a collision is inevitable in this scenario regardless of the action
taken by the agent due to limits on the vertical rate of the aircraft.
ḣ = 0 m/s ḣ = 4 m/s
400
no advisory
descend 200
h (m)
climb
0
−200
−400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 259
Low High
11.3.1 Sensitivity Analysis Sensititivity Sensititivity
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
260 chapter 11. explainability
Consider a wildfire scenario modeled as a grid where each cell is either Example 11.2. Motivation for con-
sidering the interactions between
burning or not burning. At each time step, there is a 30 % chance that a cell features when determining feature
that was not burning at the previous time step will be burning if at least one importance.
of its neighbors was burning. The plots below show an example of a current
state st and the probability that each cell is burning at the next time step
p(st+1 ) (darker cells indicate higher probability). Suppose we are interested
in understanding the features that are most important in determining the
probability that the cell in the upper right corner will burn.
st p ( s t +1 )
∗ ∗
For this example, we will focus specifically on the feature that indicates
whether the cell directly to the left of the upper right cell is burning. We
can test the first definition of feature importance by changing that cell to
not burning while holding all other cells constant and observing the effect
on the probability that the upper right cell will burn. In this case (leftmost
plots), the probability that the upper right cell will be burning at the next
time step does not change. Therefore, we will conclude that this cell has no
contribution to the output. However, if we remove fire from both this cell
and the cell below the upper right cell (rightmost plots), the upper right
cell changes to zero probability of burning at the next time step. The second
definition of feature importance considers the interaction between these two
features and would conclude that the cell does contribute to the output.
∗ ∗ 4 ∗ ∗
0
024
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 261
Suppose we have an agent that selects a steering angle for an aircraft based Example 11.3. Sensitivity analysis
at a single time step. Brighter pix-
on runway images from a camera mounted on its wing. Given a particular els in the sensitivity map indicate
input image, we can generate a sensitivity map to identify the pixels that are pixels with higher sensitivity.
most important in determining the steering angle by fixing all but the pixel
of interest and checking its effect on the steering angle output. The results
are shown below, where the left image is the original image and the right
image is the sensitivity map. This analysis indicates that the agent is focusing
on the portion of the runway in front of it where the lines are most visible.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
262 chapter 11. explainability
We can use sensitivity analysis to understand the effect of disturbances on the Example 11.4. Sensitivity analysis
over a full trajectory. Brighter col-
outcome of a trajectory. For example, consider an inverted pendulum system ors in the sensitivity map of the
in which the agent’s observation of its current angle is subject to a noise inverted pendulum trajectory indi-
cate higher sensitivity. The black
disturbance. We can estimate the sensitivity of the robustness of a trajectory line shows the true angle of the
with respect to its disturbances by perturbing the disturbances at each time pendulum at each time step and
step and observing the effect on the robustness of the trajectory. The results the colored markers indicate the
noisy observation of the current an-
on a given failure trajectory are shown below. This analysis indicates that gle at each time step.
small changes in the disturbances at the beginning of the trajectory have a
large effect on the robustness of the trajectory. Furthermore, the disturbances
applied towards the end of the failure trajectory have little to no effect because
the controller is saturated and the system cannot recover.
1
θ (rad)
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 263
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
264 chapter 11. explainability
A simple way to produce a saliency map given a set of inputs is to take the 7
D. Baehrens, T. Schroeter, S.
Harmeling, M. Kawanabe, K.
gradient of the output of interest with respect to the inputs.7 The saliency of a
Hansen, and K.-R. Müller, “How
particular input is related to the magnitude of the gradient at that input. A high to Explain Individual Classi-
gradient magnitude indicates that small changes in the input will result in large fication Decisions,” Journal of
Machine Learning Research, vol. 11,
changes in the output. In other words, inputs with high gradient values are more pp. 1803–1831, 2010.
salient and indicate higher sensitivity. This method is often used to determine the
8
K. Simonyan, A. Vedaldi, and A.
components of an observation (such as the pixels of an image) that contribute
Zisserman, “Deep Inside Convolu-
most to an agent’s decision.8 We can also use it to approximate sensitivity over tional Networks: Visualising Image
a full trajectory by taking the gradient of a performance measure with respect Classification Models and Saliency
Maps,” in International Conference
to input features such as actions or disturbances. Algorithm 11.2 measures the on Learning Representations (ICLR),
sensitivity of the robustness of a trajectory with respect to its disturbances, and 2014.
9
figure 11.5 shows an example on the inverted pendulum system. For image inputs in particular, it
has also been shown that there are
While algorithm 11.2 is more computationally efficient than algorithm 11.1, it
sometimes meaningless local vari-
is limited by its local nature. Important input features, for example, often saturate ations in gradients that can lead to
the output function of interest, causing the gradient to be small even when the noisy sensitivity maps. D. Smilkov,
N. Thorat, B. Kim, F. Viégas, and
feature is important.9 The integrated gradients10 algorithm addresses this limitation M. Wattenberg, “Smoothgrad: Re-
by averaging the gradient along the path between a baseline input and the input moving Noise by Adding Noise,”
in International Conference on Ma-
of interest (figure 11.6). The choice of baseline depends on the context. For images,
chine Learning (ICML), 2017.
a common choice is a black image (figure 11.7). For disturbances, we can set all 10
M. Sundararajan, A. Taly, and
disturbances to zero. Q. Yan, “Axiomatic Attribution
for Deep Networks,” in Interna-
Algorithm 11.3 calculates the sensitivity of the robustness of a trajetory with
tional Conference on Machine Learn-
respect to the disturbances at each time step using integrated gradients. It takes ing (ICML), 2017.
m steps along the path between the baseline and the current input and computes
the gradient of the robustness at each step. The algorithm then returns the av-
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 265
Original Image Sensitivity Gradient Integrated Gradients Figure 11.8. Comparison of the
sensitivity descriptions using algo-
rithms 11.1 to 11.3 for an aircraft
taxi system that selects a steering
angle from an image observation.
The sensitivity map focuses on the
portion where the edge and cen-
ter lines are most apparent, while
the gradient-based methods focus
only on the edges of the runway.
The integrated gradients method
provides a smoother map than the
single gradient approach.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
266 chapter 11. explainability
erage gradient at each time step. As m approaches infinity, the average gradient
approaches the integral of the gradient along the path. Figure 11.8 compares the
sensitivity estimates from algorithms 11.1 to 11.3 for an aircraft taxi system. All
three methods produce slightly different descriptions of the agent’s behavior, and
in general, the most appropriate sensitivity estimate is application dependent.
The Shapley value φi of feature i is then defined as the average marginal contri-
bution of feature i to the expectation of the outcome over all possible subsets of
features:
|Is |!(n − |Is | − 1)!
φi (x) = ∑ n!
( f Is ∪{i} (x) − f Is (x)) (11.2)
Is ⊆I\{i }
Intuitively, computing the Shapley value of feature i involves looping over all
possible subsets of features that do not include i and computing the difference in
the expectation of the outcome when adding i to the subset. The constant factor
in equation (11.2) ensures that subsets of different sizes are weighted equally. In
general, Shapley values are expensive (often intractable) to compute due to the
large number of possible subsets. For example, a function with 100 input features
has 6.3 × 1029 possible subsets.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.4. policy explanation through surrogate models 267
where π (n) represents the set of all possible permutations of n elements, j is the
index in the permutation P that corresponds to feature i, and P1:j represents the
first j elements of P . We can then approximate the Shapley value using sampling.
For each sample, we randomly permute the features and compute the difference
in the expectation of the outcome when adding feature i to the features before it
in the permutation.
Algorithm 11.4 estimates the Shapley values for the disturbances in a trajectory
to determine their contribution to the robustness of the trajectory. It takes in a
current trajectory τ with disturbance trajectory x and a number of samples per
time step m. For each time step in the trajectory, the algorithm randomly samples
another disturbance trajectory w by performing a rollout using the nominal
trajectory distribution.13 It then samples a random permutation P of the time 13
This step requires that the distur-
steps and performs a rollout in which the disturbances are taken from x for the bances sampled at each time step
are independent of one another.
time steps in P1:j and from w for all other time steps. It similarly performs a This assumption may break if the
rollout in which the disturbances are taken from x for the time steps in P1:j−1 and disturbances depend on the states,
actions, or observations.
from w for all other time steps. The algorithm then computes the difference in
the robustness of the two rollouts and averages the differences over m sampled
permutations to estimate the Shapley value of each disturbance.
Figure 11.9 shows the Shapley values for the disturbances of the inverted
pendulum trajectory used in example 11.3 and figure 11.5. The Shapley values
differ from the sensitivity estimates because they account for interactions between
disturbances. If we remove groups of disturbances with high Shapley values, it
produces a large change in the outcome.
For agents with complex policies, it may be difficult to understand the reasoning
behind their decisions. In such cases, we can build surrogate models to approximate
the policy with a model that is easier to interpret. A good surrogate model should
have the following characteristics:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
268 chapter 11. explainability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.4. policy explanation through surrogate models 269
1
θ (rad)
0
one at a time all four
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
270 chapter 11. explainability
• High Fidelity: The surrogate model should accurately represent the policy. If
the surrogate model does not adequately represent the policy, the explanations
it provides may be misleading.
One common choice for a surrogate model is a linear model. Linear models have
the form
n
f (x) = ∑ wi x i + b (11.4)
i =1
where xi is a feature of the observation, wi is a weight for feature i, and b is the bias
term. If the action space is discrete, we may apply the logistic or softmax function
to the output of the linear model to obtain probabilities for each action. Linear
surrogate models can be used to determine feature importance. The magnitudes
of the weights of the linear model indicate the contribution of each feature to the
agent’s decision. Figure 11.10 demonstrates how to use a linear surrogate model
to describe the behavior of a collision avoidance policy in two different regions of
the observation space. This technique is particularly useful for high-dimensional
observations, where it may be difficult to visualize the policy directly.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11. 4. policy explanation through surrogate models 271
Original Policy Linear Approximation Feature Weights Figure 11.10. Linear surrogate
model fit to samples in two differ-
ent local regions (highlighted cir-
100 cles) of the observation space for
a collision avoidance policy. The
left column shows the original pol-
h (m)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
272 chapter 11. explainability
−100
40 30 20 10 0
tcol tcol (s)
h t2col htcol h2 t3col ht2col h2 t col h3
High Low
Fidelity Interpretability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.5. counterfactual explanations 273
2. Distance to original input: The counterfactual input should be close to the orig-
inal input τ to ensure that the change is minimal, resulting in the following
objective:
f close (τ 0 ) = −kτ 0 − τ k p (11.6)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
274 chapter 11. e xplainability
Suppose we want to train a decision tree to approximate the slice of the Example 11.5. Simple decision tree
for the collision avoidance policy.
collision avoidance policy shown in example 11.1. The following decision The policy represented by the deci-
tree was trained on a dataset of 100,000 randomly sampled states from the sion tree is shown below.
policy slice. The decision tree has a maximum depth of 2 and uses the state 400
variables to make decisions. Nodes that split using h are shown in black,
200
nodes that split using tcol are shown in gray, and the color of the square leaf
h (m)
0
nodes are the actions taken by the agent.
−200
h climb
−400
tcol descend 40 30 20 10 0
no advisory tcol (s)
0
<0 ≥0
−101 98
< −101 ≥ −101 < 98 ≥ 98
With a maximum depth of 2, the decision tree only makes decisions based
on h. If h is positive, the tree selects whether to climb or issue no advisory
based on the magnitude of the relative altitude. Similarly, if h is negative, the
tree selects whether to descend or issue no advisory based on the magnitude
of the relative altitude. The policy represented by the decision tree is shown
in the caption. This decision tree provides a simple, interpretable model of
the agent’s policy. However, the fidelity of the decision tree is limited by its
depth, and it misses some key features of the policy that depend on the time
to collision.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.5. counterfactual explanations 275
−200
−400
40 30 20 10 0
tcol (s)
400
200
h (m)
−200
−400
40 30 20 10 0
tcol (s)
High Low
Fidelity Interpretability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
276 chapter 11. explainability
3. Sparsity of the change: The difference between the original input and the counter-
factual input should be sparse. In other words, the counterfactual input should
differ in only a few features. We can use the following objective
4. Plausibility: The new input should be a plausible input. We can check plausibil-
ity using the likelihood of the counterfactual trajectory as follows:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.5. counterfactual explanations 277
1
8 changes
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
278 chapter 11. e xplainability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.6. failure mode characterization 279
Another way to explain the behavior of a system is to characterize its failure modes.
We can use clustering algorithms to create groupings of failure trajectories that
are similar to one another. Identifying the similarities and differences between
failures helps us understand their underlying causes. One common clustering
algorithm is k-means22 (algorithm 11.6), which groups data points into k clusters 22
This algorithm is also referred
based on their similarity to one another.23 to as Lloyd’s algorithm, named af-
ter Stuart P. Lloyd (1923–2007). S.
To apply k-means, we must first extract a set of real-valued features from each Lloyd, “Least Squares Quantiza-
failure trajectory to use for clustering. Let x represent the set of features from tion in PCM,” IEEE Transactions on
Information Theory, vol. 28, no. 2,
trajectory τ and φ be a feature extraction function such that x = φ(τ ). To represent pp. 129–137, 1982.
the clusters C , k-means keeps track of k cluster centroids µ1:k in feature space and 23
A detailed overview of cluster-
assigns each trajectory to the cluster with the closest centroid to its features. We ing algorithms is provided in D.
Xu and Y. Tian, “A Comprehen-
begin by initializing the centroids to the features of k random trajectories. At each sive Survey of Clustering Algo-
iteration, k-means performs the following steps: rithms,” Annals of Data Science,
vol. 2, pp. 165–193, 2015.
1. Assign each trajectory to the cluster with the closest centroid to its feature
vector. In other words, τi is assigned to cluster C j when
2. Update the centroids to the mean of the feature vectors of the trajectories in
each cluster such that
1
|C j | τ∑
µj = φ(τ ) (11.10)
∈C j
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
280 chapter 11. e xplainability
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.6. failure mode characterization 281
0 to the right.
−1
0 2 4 0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s) Time (s)
The clustering results help us understand the failure modes of the system. 1
One way to interpret the clusters is to create a prototypical example for each cluster.
The prototypical example for a given cluster is the trajectory that is closest to
θ (rad)
0
its centroid in feature space. By examining the prototypical examples, we can
understand the characteristics of each failure mode. Figure 11.16 shows the proto-
typical examples for the final clusters in figure 11.15. At runtime, we can assign −1
new failure trajectories to the cluster with the closest centroid to their features 0 2 4
and use the prototypical examples to explain the failure mode of the trajectory. Time (s)
Algorithm 11.6 requires us to select the number of clusters k, the distance
Figure 11.16. Prototypical exam-
function d, and the feature extraction function φ. The clustering results are highly ples of failure modes for in the in-
dependent on these choices. However, selecting the number of clusters and the verted pendulum system using the
clusters in figure 11.15. The proto-
features is often a subjective process that requires domain knowledge. To select
types reveal that one failure mode
the number of clusters, we can try different values for k and select the one that involves the pendulum falling to
results in the most interpretable clusters or that minimizes a clustering objective the left, while the other involves
the pendulum falling to the right.
such as the sum of the squared distances between each point and its cluster
centroid.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
282 chapter 11. e xplainability
We can also use domain knowledge to select features that are likely to capture
the underlying causes of the failures. A simple way to select features is to create a
feature vector by concatenating all of the states in the trajectory. We could create
similar feature vectors for the actions, observations, and disturbances. However,
these feature vectors will be high-dimensional and may not result in interpretable
clusters (figure 11.17).
State Features Action Features Disturbance Features Figure 11.17. Clustering failure tra-
jectories of the inverted pendulum
1 system using features consisting
of the states, actions, and distur-
bances of each trajectory, respec-
θ (rad)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.7. summary 283
failure modes. Figure 11.18 shows the clusters of failure trajectories of the inverted 1
pendulum system using the PSTL template in example 11.6.
Clustering using PSTL features requires us to select a template formula. The
θ (rad)
template formula should capture the key aspects of the system that are relevant 0
to the failure modes. For systems with complex failure modes, it may be difficult
to hand-design a template formula that captures all the failure modes. In these −1
cases, we can use more sophisticated techniques that build decision trees using a 0 2 4
grammar based on temporal logic.25 Time (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
284 chapter 11. explainability
The following STL formula specifies that the angle of the pendulum should Example 11.6. Example of a PSTL
template formula for the inverted
not exceed π/4 for the first 200 time steps: pendulum system. The plots show
the robustness of the formula for
π different values of φ. Our goal is
ψ = [0,200] θ <
4 to find the value of φ that causes a
given trajectory to marginally sat-
If we replace the time bound with a parameter φ, we obtain the following isfy the formula.
PSTL template formula:
π
ψφ = [0,φ] θ <
4
The plots below show the robustness of the formula for different values of φ.
The plot on the left shows a value for φ such that the trajectory satisfies the
formula, the plot in the middle shows a value for φ that marginally satisfies
the formula, and the plot on the right shows a value for φ such that the
trajectory does not satisfy the formula.
1
φ = 2.45 φ = 3.85 φ = 4.05
θ (rad)
−1
0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)
We can find the value of φ that marginally satisfies ψφ by searching for the
value that causes the robustness to be as close as possible to zero. For this
simple formula, we can solve the optimization problem using a grid search
over the values of φ. The the value of φ that marginally satisfies the formula
will be the time just before the magnitude of the angle of the pendulum
exceeds π/4.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12 Runtime Monitoring
While the validation algorithms in the previous chapters are typically applied
offline prior to deployment, runtime monitoring techniques perform online assess-
ments that check the safety of the system during operation. The goal of a runtime
monitor is to flag situations that may be hazardous when they occur so that we
can trigger fallback mechanisms such as alerting a human operator, switching
to a safe mode or fallback policy, or updating the system’s model of the world.
Offline validation algorithms rely on a set of modeling assumptions about the
environment and disturbances that the system will encounter during operation.
If these models are incorrect or the environment changes, the validation results
may no longer be valid. This chapter begins by discussing techniques to identify
when we are operating outside the assumptions made during offline validation.
We then discuss techniques to monitor uncertainty in the behavior of the system.
Finally, we present techniques to monitor for potential failures in the system.
The operational design domain (ODD) of a system is the set of conditions under
which it is designed to operate safety. For example, the operational design domain
for an image-based aircraft taxi system may consist of the set of weather conditions,
times of day, and taxiways for which the system was designed. A good system
model should cover the ODD so that the validation results from the previous
chapters are valid within the ODD. If the system is operating outside the ODD, the
validation results may no longer be valid, and we cannot provide any guarantees
on the system’s safety. Therefore, it is important for us to monitor at runtime
whether a system is operating within its ODD.
286 chapter 12. runtime monitoring
There are multiple ways to represent the ODD of a system. One option is to
specify the ODD as a set of hand-designed conditions. For example, we could
write down the exact weather conditions, times of day, and taxiways that the
aircraft taxi system is designed to operate under (figure 12.1). We can also specify 3 No Clouds
the ODD in terms of acceptable ranges for the model parameters. For example, we
might expect the variance of our sensor measurements to stay within a particular 3 No Glare
bound. A drawback of this approach is that can require specialized domain 3 Daytime
knowledge to properly specify these conditions.
3 Taxiway A
We could also represent the ODD using data-driven approaches. These ap-
proaches rely on a data set of trajectory features that can be monitored at runtime
User
Figure 12.1. An example of a hand-
and adequately capture the characteristics of the ODD. These features are problem designed operational design do-
dependent. For example, if the characteristics of the ODD are well-described by main for an aircraft taxi system.
the state, the data could be the set of states observed during offline validation
or training (figure 12.2). Some problems may require additional features to ade-
quately represent the ODD. For example, the aircraft taxi system may perform
differently depending on the image observation it receives. In this case, the data
could be the set of images observed during offline validation or training. We
could also use trajectory segments or full trajectories as the data.
Data-driven approaches to ODD monitoring use the data to define a set repre-
sentation of the ODD. When the system encounters new data points at runtime,
Figure 12.2. For the continuum
it can check whether they belong to the set and flag potential dangerous behavior world problem, we can use the
if they do not. The remainder of this section discusses techniques to define this states visited during rollouts used
for offline validation to derive a
set given a representative data set. representation of the operational
design domain.
12.1.1 Nearest Neighbors Representation
One way to define the ODD given a data set is to use a nearest neighbors rep-
resentation. Specifically, we define the ODD as the set of points whose nearest
neighbor in the data set is within a certain threshold distance γ according to a
distance metric. A common distance metric is the Euclidean distance. The thresh-
old γ controls the conservatism of the set representation. A smaller γ results in
a more conservative representation in the sense of being less likely to include
situations that should not be included in the ODD. However, a value for γ that is
too small may be too conservative such that it misses out on situations that should
be included in the ODD. Figure 12.3 shows the ODD for the data in figure 12.3
defined using different threshold values.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.1. operational design domain monitoring 287
efficiency of the nearest neighbors representation is to cluster the data into groups
of similar points and compute the nearest neighbors to the center of each cluster
(figure 12.5). A common clustering algorithm called k-means is described in
section 11.6. We can increase the threshold γ to account for the distance between
the input and the center of the cluster.
polytopes are not as expressive as the nearest neighbors representation and may
produce representations that are insufficiently conservative when the ODD is
nonconvex (figure 12.6).
We can represent the ODD using a more expressive set by defining it as the
union of multiple polytopes. For example, we could cluster the data set into k
clusters using a clustering algorithm and take the union of the convex hulls of the
clusters. Algorithm 12.2 implements this monitoring technique given a data set
and a clustering of the data. Figure 12.7 shows the ODD for the data in figure 12.2 Figure 12.6. The ODD (blue) de-
using different numbers of clusters. This approach, however, may still produce fined using the convex hull of the
data in figure 12.2. This ODD repre-
an ODD that contains regions of low data density if outliers are present. sentation is underconservative be-
cause it includes a large area for
which there is very little data.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.1. operational design domain monitoring 289
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
290 chapter 12. runtime monitoring
class that is expressive enough to capture the characteristics of the ODD (see
chapter 2). For example, figure 12.8 demonstrates a scenario in which fitting a
mixture of Gaussians to the data results in a better representation of the ODD
than fitting a single Gaussian. Because the distributions of many model classes
can be fully specified using a small number of parameters, this approach is more
memory efficient than the nearest neighbors representation. The method is also
robust to outliers because the likelihood will be high where the data is dense.
Instead of using the superlevel set of a distribution fit to the data, we can also
use the superlevel set of a function that outputs the likelihood of the input being
in the ODD. For example, we can train a classifier to predict the likelihood of
a point being in the ODD and define the ODD as the set of points for which
the model outputs a likelihood greater than a threshold. Figure 12.9 shows an
example of this approach. One drawback of this approach is that training the
model typically requires data from inside and outside the ODD, which may be
difficult to obtain.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.1. operational design domain monitoring 291
2
For example, researchers have
shown that normalizing flows
trained on images often assign
increases, the volume of the space the data must cover increases exponentially. higher likelihoods to images out-
This increase in volume makes it difficult to adequately cover the space with a side the ODD. P. Kirichenko, P. Iz-
mailov, and A. G. Wilson, “Why
limited amount of data and can cause distance metrics to lose meaning. Normalizing Flows Fail to De-
One way to address the curse of dimensionality is to gather more data and tect Out-Of-Distribution Data,” Ad-
vances in Neural Information Process-
use more expressive models. For example, we could represent high-dimensional ing Systems, vol. 33, pp. 20 578–
distributions using an expressive model such as a normalizing flow. However, 20 589, 2020.
this approach may lead to overfitting or poor generalization.2 Another approach 3
Common approaches for dimen-
sionality reduction include prin-
is to assume that the data lies on a lower-dimensional manifold and to use dimen-
ciple component analysis and au-
sionality reduction techniques3 to find this manifold. toencoders. A detailed overview is
Given a lower dimensional projection of the data, we can use the methods provided by B. Ghojogh, M. Crow-
ley, F. Karray, and A. Ghodsi, El-
described in this section to define the ODD. When we get new data at runtime, ements of Dimensionality Reduction
we can project it onto this lower dimensional manifold and check whether it and Manifold Learning. Springer,
2023.
fits within the ODD. Figure 12.10 shows an example of a two-dimensional mani-
fold for the aircraft taxi image observations. When creating a lower-dimensional
representation of the data, it is important to ensure that the projection captures
the relevant features of the data that define the ODD and that data outside the
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
292 chapter 12. runtime monitoring
Training Data Classifier Probability Superlevel Set Figure 12.9. The ODD (blue, right
plot) defined using the superlevel
set of a classifier (middle plot)
trained on the data in figure 12.2
as well as additional data sampled
uniformly outside the ODD. The
classifier outputs the probability
that a point is in the ODD. The su-
perlevel set is defined as the set of
points for which the classifier out-
puts a probability greater than 0.5,
though this can be treated as a free
parameter to control conservatism.
ODD in the original space is projected outside the ODD in the lower-dimensional
space. A representation that is not expressive enough may result in feature col-
lapse, where far points in the original space are projected to nearby points in the
lower-dimensional space.4 Figure 12.11 shows an example of feature collapse in 4
J. Postels, M. Segù, T. Sun, L. D.
the two-dimensional manifold for the aircraft taxi example. Sieber, L. Van Gool, F. Yu, and F.
Tombari, “On the Practicality of De-
terministic Epistemic Uncertainty,”
in International Conference on Learn-
12.2 Uncertainty Quantification ing Representations (ICLR), 2022.
results in different possible observations for the same state. In previous chapters,
we modeled this type of uncertainty using disturbance distributions.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 293
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
294 chapter 12. runtime monitoring
φ1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 295
no data.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
296 chapter 12. runtime monitoring
Suppose we want to quantify the outcome uncertainty when predicting a Example 12.1. Learning the pa-
rameters of a conditional Gaussian
continuous variable y given an input x fron a data set of ( x, y) pairs. We can model to quantify outcome uncer-
learn the parameters of the following conditional Gaussian model: tainty.
pθ (y | x ) = N (y | µθ ( x ), σθ ( x )2 )
Intuitively, the first term in the final objective encourages the model to predict
a high variance when the squared error is high, and the second term penalizes
high variances. This objective is commonly used in machine learning and is
referred to as the Gaussian negative log-likelihood loss function. Figure 12.13
shows the result of fitting a model using this objective to the data set in
figure 12.12.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 297
exp(yi ) 1 2 3 4
pi = k
(12.1)
∑ j=1 exp(y j )
High Entropy
where k is the total number of possible outputs. We can learn the parameters θ of
the model by maximizing the likelihood of the data given this model.
Given a distribution over predicted outputs, we can quantify our uncertainty 1 2 3 4
using the entropy of the distribution. Higher entropy indicates higher uncertainty
in the prediction, and we may want to flag situations that result in outputs with Figure 12.14. Entropy of a dis-
crete distribution over four possi-
high entropy as potentially dangerous. The entropy of a conditional Gaussian ble outputs. If the distribution as-
distribution is a function of the variance of the distribution. A higher variance signs high probability to a single
output, the entropy will be low. If
results in a higher entropy. For discrete distributions, the entropy is defined as the distribution assigns equal prob-
ability to all outputs, the entropy
k
will be high.
− ∑ pi log pi (12.2)
i =1
where pi is the probability of the ith output. A model that assigns equal probability
to all outputs will have maximum entropy, indicating high uncertainty in the
prediction (figure 12.14). We can use a threshold on entropy to create a runtime
monitor that flags uncertain situations.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
298 chapter 12. runtime monitoring
αtrue
We can use this data to assess baseline performance by comparing the model
distribution to the distribution of the calibration data (figure 12.15). We can then 0.4
apply calibration techniques to adjust the model distribution to better match the 0.2
calibration data distribution. 0
0 0.2 0.4 0.6 0.8 1
A common technique to calibrate a model is to perform histogram binning on the αmodel
desired uncertainty metric. For discrete outputs, a common uncertainty metric
for calibration is the predicted probability of the correct output. For continuous Figure 12.15. Calibration plot of a
neural network model trained to
outputs predicted using a Gaussian distribution, a common uncertainty metric for predict the discrete actions of the
calibration is the predicted variance of the distribution. The first step in histrogram continuum world agent using the
data in figure 12.2. A calibrated
binning is to divide the calibration data into bins using the predicted values of
model should match the dashed
the uncertainty metric. We typically select the bin boundaries to create bins of line. The model is poorly calibrated
equal width or equal number of samples. and tends to be underconfident in
its predictions.
After binning the data, we can calculate the actual value of the uncertainty
metric for each bin. For example, for discrete outputs, we can calculate the average
predicted probability of the correct output for each bin. We can then adjust the
model predictions to match the actual values of the uncertainty metric in each bin.
For example, if the model is underconfident in its predictions, we can increase the
predicted probability of the model in the corresponding bins. When we obtain a
new data point, we calculate its predicted uncertainty metric to determine which
bin it belongs to. We can then use the actual value of the uncertainty metric in
that bin to adjust the model prediction. Example 12.2 provides an example of this
process.
Histogram binning requires storing the bin edges and corresponding cali-
bration values at runtime. Other techniques focus on instead fitting a single
9
calibration parameter to the data. For example, a common calibration technique Because we are fitting a single pa-
rameter, the risk of overfitting is
for models that output probabilities of discrete outputs using a softmax function low, so we do not necessarily need
is to introduce a precision parameter λ such that a separate set of calibration data.
This method is also sometimes re-
exp(λyi ) ferred to as temperature scaling with
pi = k
(12.3) temperature 1/λ. C. Guo, G. Pleiss,
∑ j=1 exp(λy j ) Y. Sun, and K. Q. Weinberger, “On
Calibration of Modern Neural Net-
This model is similar to the softmax response model introduced in section 2.4.2. works,” in International Conference
We can select the precision parameter lambda that minimizes a proper scoring rule on Machine Learning (ICML), 2017.
such as the negative log-likelihood of the calibration data.9 Figure 12.16 shows the
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 299
Suppose we have a model that predicts binary outputs for a system by pre- Example 12.2. Histogram bin-
ning calibration for a model that
dicting a probability p that the output is 1. Given a data set of calibration data, predicts a binary discrete output.
we can see that the model tends to be underconfident in its predictions (left). The rightmost plot shows the cali-
brated runtime data after applying
For example, for data points where the model predicts that the probability of the calibration technique. A well-
the output being 1 is between 0.5 and 0.6, the actual probability of the output calibrated model should follow the
being 1 according to the calibration data is around 0.73. If we were to deploy dashed line.
0.8 0.73
0.6
ptrue
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
pmodel pmodel p̂model
We can use the bins of the model probability and their correspondong
actual probabilities from the calibration data to adjust the model predictions.
For example, suppose we get a new input at runtime that has a predicted
probability according to the model of 0.52. This probability falls into the high-
lighted bin above where the actual probability is 0.73. We should therefore
adjust that probability such that p̂ = 0.73 and use this probability to make
runtime monitoring decisions. The plot on the right shows the result if we
apply this calibration technique to runtime data in the plot on the right. The
model is now well-calibrated and follows the dashed line.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
300 chapter 12. runtime monitoring
0.6
result of applying this calibration technique to calibrate the model in figure 12.15.
For Gaussian models, we can apply a similar technique by introducing a single
scaling parameter to the predicted variance of the model.
While fitting a single calibration parameter reduces the complexity of the
calibration procedure, it is not as expressive as histogram binning. For example,
there may not be a single precision parameter λ that adequately calibrates the all
bins of the model. Other calibration techniques use more complex models with
multiple parameters to adjust the model predictions.10 10
These techniques include Platt
scaling and isotonic regression.
A. Niculescu-Mizil and R. Caru-
12.2.3 Prediction Sets ana, “Predicting Good Probabili-
ties with Supervised Learning,” in
Another approach to quantifying uncertainty is to predict a set of possible out- International Conference on Machine
Learning (ICML), 2005.
comes rather than a single outcome. To create a prediction set, we must choose a
desired level of coverage. The coverage of a prediction set is the probability that
the true output lies within the set. For example, a prediction set with coverage
of 0.95 should contain the true output 95 % of the time. A large prediction set
indicates high uncertainty, while a small prediction set indicates low uncertainty.
For example, a large prediction set for the location of an autonomous vehicle
indicates that we have high uncertainty in its true location.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 301
0.95
0.4 0.33
0.24
0.03 0.03 0.02 0.0
1 2 3 4 1 2 3 4
y y
Given a desired level of coverage c, we can derive a prediction set from a model
trained using the methods in section 12.2.1 to predict a distribution over the
output. For models that predict the parameters of a Gaussian distribution, it is
common to create a prediction set centered around the predicted mean. We create
this set by extending outward from the mean until the set includes c probability
mass.11 For models that predict the probabilities of discrete outputs, we can create 11
Prediction sets do not need to
a prediction set by adding outputs to the set in order of decreasing predicted be centered around the mean. Any
prediction set that occupies c prob-
probability until the sum of the probabilities of the outputs in the set exceeds c. ability mass is a valid prediction
Figure 12.17 shows small and large prediction sets for both discrete and con- set. However, the centered predic-
tion set is the smallest possible set
tinuous models. Before generating prediction sets, it is important to ensure that that contains c probability mass.
the model is well-calibrated. If the model is not well-calibrated, the prediction
12
sets may be too small or too large. For example, if the model is underconfident in A detailed overview of confor-
mal prediction is provided in A. N.
its predictions, the prediction sets may be too small. We can calibrate the model
Angelopoulos, S. Bates, et al., “Con-
using the techniques described in section 12.2.2. formal Prediction: A Gentle Intro-
We can also generate accurate prediction sets from an uncalibrated uncertainty duction,” Foundations and Trends
in Machine Learning, vol. 16, no. 4,
measure using a technique known as conformal prediction.12 Similar to the tech- pp. 494–591, 2023.
niques described in section 12.2.2, conformal prediction uses a calibration set
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
302 chapter 12. runtime monitoring
to adjust the uncertainty measure. The data points in the calibration set must
be exchangeable, meaning that the joint distribution of the data points does not
change based on the order they apppear.13 However, conformal prediction does 13
Exchangeability is a more re-
not require a model that predicts a distribution over the output. Instead, it uses laxed conditioned than indepen-
dence. Any set of variables that are
the calibration set to adjust any heuristic uncertainty measure. independent and identically dis-
The first step of conformal prediction involves identifying a heuristic notion of tributed are also exchangeable.
uncertainty. For example, we can use the parameters of an uncalibrated output
distribution from a model trained using the methods in section 12.2.1. Next,
conformal prediction requires a score function s( x, y) that encodes how well the
predicted uncertainty in the output conditioned on the input x matches the true
output y. The score should be lower when there is good agreement between
the prediction and the true output. Example 12.3 shows a score function for
a Gaussian model, and example 12.4 shows a score function for a model that
predicts discrete outputs.
The final step of conformal prediction involves computing the score for each
point in the calibration set. We then find the score q that corresponds to the
d(n + 1)ce/n quantile of the calibration scores, where d·e is the ceiling function
and n is the number of points in the calibration set. Given a new input at runtime,
the prediction set that guarantees a coverage of at least c is
{y | s( x, y) ≤ q} (12.4)
as long as the new input is exchangeable with the calibration set.14 Example 12.5 14
A proof of this property is pro-
shows an example of a prediction set for a Gaussian model, and example 12.6 vided by A. N. Angelopoulos, S.
Bates, et al., “Conformal Predic-
shows an example of a prediction set for a model that predicts discrete outputs. tion: A Gentle Introduction,” Foun-
The practicality of conformal prediction is highly dependent on the heuristic dations and Trends in Machine Learn-
ing, vol. 16, no. 4, pp. 494–591,
notion of uncertainty and the score function. If these choices do not accurately 2023.
reflect the true uncertainty in the model, the prediction sets may be too large to be
useful. It is also important to note that conformal prediction provides guarantees
on the marginal coverage of the prediction sets and not the conditional coverage. In
other words, the prediction sets will have a coverage of at least c on average, but
the coverage may vary for different inputs (figure 12.18). Example 12.7 highlights
a limitation of conformal prediction caused by this property.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 303
An example of a score function for a Gaussian model described in exam- Example 12.3. Score function
for conformal prediction using a
ple 12.1 is conditional Gaussian model as a
|y − µθ ( x )| heuristic notion of uncertainty.
s( x, y) =
σθ ( x )
where µθ ( x ) and σθ ( x ) are the predicted mean and standard deviation. The
score will be low for inputs where the predicted mean is close to the true
output. The score will also be low if the predicted output is far from the true
output but the predicted standard deviation is large enough to account for
this variation.
The plots below show the score function for different predicted distribu-
tions and corresponding true outputs. Although the true output is further
from the predicted mean in the center plot than it is the left plot, the two
data points produce the same score because the predicted standard deviation
is larger in the center plot. In the plot on the right, the predicted standard
deviation does not account for the increased gap between the predicted mean
and the true output, resulting in a higher score.
µθ ( x ) y µθ ( x ) y µθ ( x ) y
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
304 chapter 12. runtime monitoring
Suppose we want to use the softmax probabilities of a model f θ ( x ) that Example 12.4. Score function
for conformal prediction using a
predicts discrete outputs as a heuristic notion of uncertainty. We first define model that predicts probabilities
the function π j ( x ) to return the index of the output with the jth highest of discrete outputs.
predicted probability. The score function can then be defined as
k
s( x, y) = ∑ f θ ( x )π j ( x)
j =1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 305
Plugging the score function from example 12.3 into equation (12.4) provides Example 12.5. Prediction set from
conformal prediction using a Gaus-
us with the following prediction set for a Gaussian model: sian model and the score function
( ) from example 12.3.
|y − µθ ( x )|
{y | s( x, y) ≤ q} = y ≤q
σθ ( x )
= {y | |y − µθ ( x )| ≤ qσθ ( x )}
where q is the quantile of the calibration scores that corresponds to the desired
coverage c. This prediction set is centered around the predicted mean and
extends outward q standard deviations from the mean.
Intuitively, conformal prediction scales the standard deviation based on
the results of the calibration data. For example, suppose we want to create
a prediction set with 95 % coverage. If the model was perfectly calibrated,
we would expect q ≈ 2 because 95 % of the data should lie within two
standard deviations of the mean. However, if the model was underconfident,
we would expect q > 2 to produce larger prediction sets, and if the model
was overconfident, we would expect q < 2 to produce smaller prediction
sets.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
306 chapter 12. runtime monitoring
Plugging the score function from example 12.4 into equation (12.4) provides Example 12.6. Prediction set
from conformal prediction using
us with the following prediction set for a model that predicts probabilities of a model that predicts probabilities
discrete outputs: of discrete outputs and the score
function from example 12.4.
( )
k
{y | s( x, y) ≤ q} = y ∑ f θ ( x )π j ( x) ≤ q
j =1
0.8
0.6
0.2 0.15
0.05 0.08 0.07 0.05
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 307
y
set. While 95 % of the points are in-
side the prediction set on average
in both plots, the plot on the left
only provides marginal coverage,
while the plot on the right provides
x x both marginal and conditional cov-
erage.
Suppose we perform conformal prediction on a model that predicts the Example 12.7. Example of the lim-
itations of the coverage guarantees
location of an aircraft from runway images. We use calibration data in which provided by conformal prediction.
95 % of the images are taken during the day, and the remaining 5 % are taken
at night. If we use conformal prediction to produce prediction sets with
95 % coverage, the prediction sets will have a coverage of at least 95 % on
average. However, it is possible that the prediction sets for nighttime images
may have a coverage of 0 %, while the prediction sets for daytime images
may have a coverage of 95 %. In this case, if the aircraft is operating at night,
the prediction sets will be inaccurate, resulting in potentially dangerous
behavior. Therefore, it is important to consider the conditional coverage of
the prediction sets when using conformal prediction.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
308 chapter 12. runtime monitoring
y
system, which may arise when the system is operating outside its ODD. In these
scenarios, we do not have data on the system’s behavior, and we therefore cannot
expect data-driven output uncertainty estimates to be accurate (figure 12.19). For
x
this reason, we need other techniques to estimate model uncertainty.
One approach to estimating model uncertainty is to use a Bayesian approach Figure 12.19. We cannot expect
data-driven uncertainty estimates
in which we maintain a distribution over possible models. The key insight behind
to be calibrated in regions of the
this approach is that there are many possible models that could have generated input space where we do not have
the data (figure 12.20), and we should account for this uncertainty when making data. For example, it is possible that
the data we were missing (purple)
predictions. We represent the distribution over possible models as p(θ | D ), when we trained the model in fig-
where θ are the parameters of the model and D is the data. ure 12.13 lies well outside the 2σ
region of the model’s predictions.
Given a new input, we can compute a distribution output using a process
known as Bayesian model averaging. Bayesian model averaging uses the following
equation to make predictions:
Z
p(y | x, D ) = p(y | x, θ) p(θ | D ) dθ (12.5)
where p(y | x, θ) is the distribution over the prediction given the input and a
specific instantiation of the parameters of the model. Intuitively, this equation
computes the distribution over the output by averaging the predictions of all
possible models weighted by the probability of each model given the data.
x
15
A. G. Wilson and P. Izmailov,
“Bayesian Deep Learning and a
Probabilistic Perspective of Gen-
In general, equation (12.5) is intractable to compute because it requires integrat- eralization,” Advances in Neural
ing over the entire parameter space. However, we can use a variety of techniques Information Processing Systems
(NeurIPS), vol. 33, pp. 4697–4708,
to approximate this integral.15 One approach is to use MCMC to sample from 2020.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.3. failure monitoring 309
y
our uncertainty.
θ x
the posterior distribution over the parameters of the model p(θ | D ) (see sec-
tions 2.3.2 and 6.3). We can then use these samples to approximate the integral
16
in equation (12.5). One drawback of this approach is that it requires running B. Lakshminarayanan, A. Pritzel,
and C. Blundell, “Simple and Scal-
MCMC to make predictions as runtime, which can be computationally expensive.
able Predictive Uncertainty Esti-
Another approach to approximating the integral in equation (12.5) is to create mation Using Deep Ensembles,”
an ensemble consisting of a set of models M that all have high likelihood ac- Advances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 30,
cording to p(θ | D ).16 One way to create these models is to train the same model 2017.
multiple times with different initializations. We then approximate the integral as
an equally weighted mixture of the predictions of each model as follows:
− p (θ | D )
|M| θ∑
p(y | x, D ) ≈ p(y | x, θ) (12.6)
∈M
sity. By starting from different initializations, we encourage each model to find a Figure 12.22. By training models
different local minima in the loss function (figure 12.22). However, this property with different initializations for θ,
is not guaranteed. It is possible that all models in the ensemble will still converge we hope to arrive at different local
minima in the loss function. The
to the same local minima, which results in overconfident uncertainty estimates colored points represent local min-
(figure 12.21). Therefore, it may be necessary to incorporate other heuristics into ima for each mode in figure 12.20.
the training to ensure that the models in the ensemble are diverse.17 17
V. Dwaracherla, Z. Wen, I.
Osband, X. Lu, S. M. Asghari, and
B. Van Roy, “Ensembles for Uncer-
12.3 Failure Monitoring tainty Estimation: Benefits of Prior
Functions and Bootstrapping,”
Transactions on Machine Learning
Even when a system is operating with low uncertainty within its ODD, it may Research, 2022.
still end up in situations that lead to failure. Therefore, it is important to monitor
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
310 chapter 12. runtime monitoring
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.3. failure monitoring 311
systems for potential dangerous situations within their ODD. A simple approach
to failure monitoring is to create a set of heuristic rules or properties that describe
dangerous scenarios using information that can be monitored at runtime. For
example, in an aircraft collision avoidance scenarios, we can monitor the distance
between the two aircraft and issue a warning if the distance falls below a certain
threshold (figure 12.24). We typically want to set this threshold to be conservative
so that we have time to take corrective actions mitigate the likelihood of a potential
failure.
In addition to heuristic rules, we can make predictions about whether a partic-
ular situation is likely to lead to failure by performing some additional computa-
tion at runtime. Specifically, we can use a model of the system to run validation
algorithms online during deployment. For example, we could use one of the
reachability algorithms discussed in chapters 8 to 10 to determine whether fail-
ure states are reachable from the current state (first row of figure 12.25). If the
specification for the system can be written as an LTL formula, we can use the
techniques discussed in section 3.6 to convert the specification into a reachability
specification using an automaton. We can then monitor the system at runtime by
traversing the automaton.18 18
A. Bauer, M. Leucker, and C.
Monitors that use reachability analysis may be overly conservative, and we Schallhart, “Runtime Verification
for LTL and TLTL,” ACM Transac-
may instead only want to flag situations that have a signficant probability of tions on Software Engineering and
leading to failure. In this case, we can use the techniques discussed in chapter 7 to Methodology (TOSEM), vol. 20,
no. 4, pp. 1–64, 2011.
compute the probability of reaching a failure state from the current state. We can
then use this probability to determine whether to issue a warning. The second
row of figure 12.25 shows an example of a failure monitoring system that uses
a probabilistic model to predict the likelihood of failure at each time step. In
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
312 chapter 12. runtime monitoring
200
h (m)
−200
−400
40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s) tcol (s) tcol (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.4. summary 313
store the results in a lookup table. At runtime, we can then query the lookup
table to determine the probability of failure for the current state. The third row of
figure 12.25 shows an example of a failure monitoring system that uses a lookup
table to determine the probability of failure at each time step. For systems that
have memory limitations, we can use a variety of compression techniques to store
the results in a more compact form. For example, we could train a neural network
to approximate the failure probability from any state in the state space.
12.4 Summary
• We can represent the operational design domain using a variety of set rep-
resentations such as sets defined by nearest neighbors, convex hulls, or level
sets.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a p p e n dices
A Systems
This appendix summarizes some of the systems used as examples in this book.
Each component of the system must take in both its typical inputs as well as a
disturbance. For components that do not use the disturbance, the disturbance is
ignored using the default implementation in algorithm A.1.
Each component must also have a disturbance distribution that takes in the
necessary inputs and returns a distribution over disturbances. Algorithm A.2
provides a default implementation for components that do not use the disturbance.
It returns a distribution object that specifies that the component is deterministic.
The environment component for each system must also have a function that
returns the default distribution over initial states.
The simple Gaussian system consists of an environment with a single state variable
that is sampled from a Gaussian distribution with mean0 and standard deviation
1. After sampling an initial state, the system will remain in that state for all time
regardless of the action. In other words, the system has no agent, and the state is
fully observable. Algorithm A.3 defines each component of the system. A typical −4 −2 0 2 4
specification (chapter 3) for the simple Gaussian system is τ
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.4. mass-spring-damper system 319
p0 = p + v∆t
k
0 k c 1 β
v = v + − p − v + β ∆t m
m m m
where m is the mass, k is the spring constant, c is the damping coefficient, and ∆t c
back and forth before coming to rest. In general, we want to ensure that the system −0.2
remains stable, meaning that the position does not exceed some magnitude. We −0.4
can write this specification in STL as 0 2 4
Time (s)
ψ = (| p| < γ) (A.4)
Figure A.4. Example trajectories
where γ is the maximum position magnitude of the mass. If the noise becomes of the mass-spring-damper sys-
too large, the system may become unstable. tem. The system oscillates back and
forth before coming to rest.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
320 appendix a. systems
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.5. inverted pendulum system 321
where m is the mass of the pendulum, ` is the length of the pendulum, g is the Figure A.5. State, action, and ob-
acceleration due to gravity, and ∆t is the discrete time step. Algorithm A.6 imple- servation for the inverted pendu-
lum system. The goal is to apply a
ments these dynamics. The magnitude of the angular velocity of the pendulum is torque at each time step to balance
limited to ωmax and the torque applied is limited to amax . The initial state for the the pendulum upright. The obser-
pendulum is sampled uniformly from angles near upright with a small angular vation is a noisy measurement of
the state.
velocity.
function (env::InvertedPendulum)(s, a)
θ, ω = s[1], s[2]
dt, g, m, l = env.dt, env.g, env.m, env.l
a = clamp(a, -env.a_max, env.a_max)
ω = ω + (3g / (2 * l) * sin(θ) + 3 * a / (m * l^2)) * dt
θ = θ + ω * dt
ω = clamp(ω, -env.ω_max, env.ω_max)
return [θ, ω]
end
Ps(env::InvertedPendulum) = Product([Uniform(-π / 16, π / 16),
Uniform(-1.0, 1.0)])
Throughout the book, we use sensor noise as the main source of randomness in
the inverted pendulum system. Therefore, the inverted pendulum system uses the
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
322 appendix a. systems
θ (rad)
0
ψ = (|θ | < π/4) (A.6)
Figure A.6 shows an example plot of both a success and failure trajectory for the −1
inverted pendulum system. 0 0.5 1 1.5 2
Time (s)
ψ = ♦ G ( s t ) ∧ ¬ F ( s t ) (A.7)
where G (st ) returns true if st is a goal state and F (st ) returns true if st is an
obstacle state. Figure A.8 shows an example of the grid world system with an
obstacle state in the center of the grid and a goal state near the upper right corner.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.7. continuum world system 323
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
324 appendix a. s ystems
up down
cardinal directions; however, instead of slipping in one of the other cardinal left right
directions, the agents slips in a random direction on the unit circle with higher
probability of slipping in directions close to the desired direction. Specifically, we
model this process by first adding a random vector x sampled from a multivariate
Gaussian distribution to the desired direction and then normalizing the result to
have a magnitude of 1.
The agent for the continuum world problem maps continuous states to discrete
actions by interpolating a policy defined on a grid of discrete points. Specifically,
each state in the grid corresponds to a set of values that represent the expected Figure A.9. Policy for the contin-
future return when taking each action from the state.1 Given a new state that uum world agent.
1
is not part of the grid, the agent uses multilinear interpolation to estimate the These values make up the state-
action value function. More details
expected future return for each action. It then takes the action with the highest are provided by M. J. Kochender-
expected return. Figure A.9 shows the resulting policy for an agent trained to fer, T. A. Wheeler, and K. H. Wray,
Algorithms for Decision Making. MIT
reach the green goal while avoiding the red obstacle. Algorithm A.8 defines the Press, 2022.
agent and environment for the continuum world problem.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.8. aircraft collision avoidance system 325
The continuum world agent uses the same specification as the grid world
system with continuous states instead of discrete states. The obstacles and goal
for the continuum world system are represented as balls.
The aircraft collision avoidance system involves issuing climb or descend advi-
sories to an aircraft to avoid an intruder aircraft.2 There are three actions cor- Figure A.10. Example of a success
(green) and failure (red) trajectory
responding to no advisory, commanding a 5 m/s descend, and commanding a for the continuum world system.
5 m/s climb. The intruder is approaching us head on, with a constant horizontal The failure trajectory enters the cir-
cle with the obstacle.
closing speed. The state is specified by the altitude h of our aircraft measured 2
This formulation is a highly
relative to the intruder aircraft, our vertical rate ḣ measured relative to the in- simplified version of the prob-
truder aircraft, the previous action aprev , and the time to potential collision tcol . lem described by M. J. Kochen-
derfer and J. P. Chryssanthacopou-
Figure A.11 illustrates the problem scenario. los, “Robust Airborne Collision
Given action a, the state variables are updated as follows Avoidance Through Dynamic Pro-
gramming,” Massachusetts Insti-
h0 = h + ḣ∆t (A.8) tute of Technology, Lincoln Labora-
tory, Project Report ATC-371, 2011.
0
ḣ = ḣ + (ḧ + xs ) (A.9)
0 ḣ (m/s)
aprev =a (A.10)
0 tcol (s)
tcol = tcol − ∆t (A.11) our aircraft h (m)
where ∆t = 1 s and xs is noise added to the relative vertical rate to account for
variations in intruder behavior. The value ḧ is given by
intruder
Figure A.11. State variables for the
0
if a = no advisory aircraft collision avoidance system.
ḧ = a/∆t if | a − ḣ|/∆t < ḧlimit (A.12)
sign( a − ḣ)ḧlimit otherwise
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
326 appendix a. s ystems
−200
−400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
B Mathematical Concepts
Before introducing the definition of a measure space, we will first discuss the
notion of a sigma-algebra over a set Ω. A sigma-algebra is a collection Σ of subsets
of Ω such that
1. Ω ∈ Σ.
1. If E ∈ Σ, then µ( E) ≥ 0 (nonnegativity).
2. µ(∅) = 0.
A probability space is a measure space (Ω, Σ, µ) with the requirement that µ(Ω) = 1. 1
These axioms are sometimes
called the Kolmorogov axioms.
In the context of probability spaces, Ω is called the sample space, Σ is called the
A. Kolmogorov, Foundations of
event space, and µ (or, more commonly, P) is the probability measure. The probability the Theory of Probability, 2nd ed.
axioms1 refer to the nonnegativity and countable additivity properties of measure Chelsea, 1956.
A set with a metric is called a metric space. A metric d, sometimes called a distance
metric, is a function that maps pairs of elements in X to nonnegative real numbers
such that for all x, y, z ∈ X:2 2
In section 3.2, we use a less pre-
cise definition of metric to refer to
1. d( x, y) = 0 if and only if x = y (identity of indiscernibles). any function that maps system be-
havior to a real value.
2. d( x, y) = d(y, x ) (symmetry).
A normed vector space consists of a vector space X and a norm k·k that maps elements
of X to nonnegative real numbers such that for all scalars α and vectors x, y ∈ X:
1. kxk = 0 if and only if x = 0.
where the limit is necessary for defining the infinity norm, L∞ . Several L p norms
are shown in figure B.1.
Norms can be used to induce distance metrics in vector spaces by defining the
metric d(x, y) = kx − yk. We can then, for example, use an L p norm to define
distances.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
b.4. normed vector spaces 329
q
L2 : kxk2 = x12 + x22 + · · · + xn2
This metric is often referred to as the
Euclidean norm.
L∞ : kxk∞ = max(| x1 |, | x2 |, · · · , | xn |)
This metric is often referred to as the max
norm, Chebyshev norm, or chessboard norm.
The latter name comes from the minimum
number of moves that a king needs to move
between two squares in chess.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
330 appendix b. mathematical concepts
A symmetric matrix A is positive definite if x> Ax is positive for all points other than
the origin. In other words, x> Ax > 0 for all x 6= 0. A symmetric matrix A is positive
semidefinite if x> Ax is always nonnegative. In other words, x> Ax ≥ 0 for all x.
3
B.6 Information Content Sometimes information content is
referred to as Shannon information,
in honor of Claude Shannon, the
If we have a discrete distribution that assigns probability P( x ) to value x, the founder of the field of information
information content3 of observing x is given by theory. C. E. Shannon, “A Math-
ematical Theory of Communica-
tion,” Bell System Technical Journal,
I ( x ) = − log P( x ) (B.2) vol. 27, no. 4, pp. 623–656, 1948.
The unit of information content depends on the base of the logarithm. We generally
assume natural logarithms (with base e), making the unit nat, which is short for
natural. In information theoretic contexts, the base is often 2, making the unit bit.
We can think of this quantity as the number of bits required to transmit the value
x according to an optimal message encoding when the distribution over messages
follows the specified distribution.
B.7 Entropy
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
b.8. cross entropy 331
The cross entropy of one distribution relative to another can be defined in terms
of expected information content. If we have one discrete distribution with mass
function P( x ) and another with mass function Q( x ), then the cross entropy of P
relative to Q is given by
The Taylor expansion,5 also called the Taylor series, of a function is important to 5
Named for the English mathe-
many approximations used in this book. From the first fundamental theorem of matician Brook Taylor (1685–1731)
who introduced the concept.
calculus,6 we know that 6
The first fundamental theorem of
Z h calculus relates a function to the
f ( x + h) = f ( x ) + f 0 ( x + a) da (B.9) integral of its derivative:
0
Z b
f (b) − f ( a) = f 0 ( x ) dx
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND a license.
2024-12-23 15:12:43-05:00, comments to [email protected]
332 appendix b. mathematical concepts
In the formulation given here, x is typically fixed and the function is evaluated
in terms of h. It is often more convenient to write the Taylor expansion of f ( x )
about a point a such that it remains a function of x:
∞
f (n) ( a )
f (x) = ∑ n!
( x − a)n (B.17)
n =0
f ( x ) ≈ f ( a) + f 0 ( a)( x − a) (B.18)
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
b.10. taylor expansion 333
−1
−2
−3
−4 −2 0 2 4 6
x
1
f (x) = f (a) + ∇ f (a)> (x − a) + (x − a)> ∇2 f (a)(x − a) + · · · (B.20)
2
The first two terms form the tangent plane at a. The third term incorporates local
curvature. This book will use only the first three terms shown here.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
C Neural Representations
us to use the gradient of the loss function with respect to the parameterization
∇θ ` to iteratively improve the parameterization. This process is often referred to
as neural network training or parameter tuning. It is demonstrated in example C.1.
Neural networks are typically trained on a data set of input-output pairs D. In
this case, we tune the parameters to minimize the aggregate loss over the data
set:
arg min ∑ `(fθ (x), y) (C.1)
θ (x,y)∈D
Data sets for modern problems tend to be very large, making the gradient of
equation (C.1) expensive to evaluate. It is common to sample random subsets of
the training data in each iteration, using these batches to compute the loss gradient.
336 appendix c. neural representations
Consider a very simple neural network, f θ ( x ) = θ1 + θ2 x. We wish our Example C.1. The fundamentals
of neural networks and parameter
neural network to take the square footage x of a home and predict its price tuning.
ypred . We want to minimize the squared deviation between the predicted
housing price and the true housing price by the loss function `(ypred , ytrue ) =
(ypred − ytrue )2 . Given a training pair, we can compute the gradient:
If our initial parameterization were θ = [10,000, 123] and we had the input-
output pair ( x = 2,500, ytrue = 360,000), then the loss gradient would be
∇θ ` = [−85,000, −2.125 × 108 ]. We would take a small step in the opposite
3
direction to improve our function approximation. A sufficiently large, single-layer
neural network can, in theory, ap-
proximate any function. See A.
Pinkus, “Approximation Theory
In addition to reducing computation, computing gradients with smaller batch of the MLP Model in Neural
Networks,” Acta Numerica, vol. 8,
sizes introduces some stochasticity to the gradient, which helps training to avoid pp. 143–195, 1999.
getting stuck in local minima.
Neural networks are typically constructed to pass the input through a series of 4
The nonlinearity introduced by
layers.3 Networks with many layers are often called deep. In feedforward networks, the activation function provides
each layer applies an affine transform, followed by a nonlinear activation function something analogous to the acti-
vation behavior of biological neu-
applied elementwise:4
rons, in which input buildup even-
x0 = φ.(Wx + b) (C.2) tually causes a neuron to fire. A. L.
Hodgkin and A. F. Huxley, “A
where matrix W and vector b are parameters associated with the layer. A fully Quantitative Description of Mem-
connected layer is shown in figure C.1. The dimension of the output layer is brane Current and Its Applica-
tion to Conduction and Excitation
different from that of the input layer when W is nonsquare. Figure C.2 shows a in Nerve,” Journal of Physiology,
more compact depiction of the same network. vol. 117, no. 4, pp. 500–544, 1952.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
337
There are many types of activation functions that are commonly used. Similar
to their biological inspiration, they tend to be close to zero when their input is
low and large when their input is high. Some common activation functions are
shown in figure C.5.
Sometimes special layers are incorporated to achieve certain effects. For ex-
ample, in figure C.4, we used a softmax layer at the end to force the output to
represent a two-element categorical distribution. The softmax function applies
the exponential function to each element, which ensures that they are positive
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
338 appendix c. neural representations
0
0.4 ∈ R5 complicated, nonlinear decision
boundaries.
−0.5 fully connected + softmax
0.2
−1 0 ypred ∈ R2
−1 −0.5 0 0.5 1
x1
1
φ( x )
−1
relu leaky relu swish
max(0, x ) max(αx, x ) x sigmoid( x )
2
1
φ( x )
−1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
339
exp( xi )
softmax(x)i = (C.4)
∑ j exp( x j )
Example C.2 demonstrates this process. Many deep learning packages compute
gradients using such automatic differentiation techniques.6 Users rarely have to 6
A. Griewank and A. Walther, Eval-
provide their own gradients. uating Derivatives: Principles and
Techniques of Algorithmic Differentia-
tion, 2nd ed. SIAM, 2008.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
340 appendix c. neural representations
Recall the neural network and loss function from example C.1. Here we have Example C.2. How reverse accu-
mulation is used to compute pa-
drawn the computational graph for the loss calculation: rameter gradients given training
x data.
× c1
θ2
+ ypred
− c2 c22 `
θ1
ytrue
Reverse accumulation begins with a forward pass, in which the compu-
tational graph is evaluated. We will again use θ = [10,000, 123] and the
input-output pair ( x = 2,500, ytrue = 360,000) as follows:
2,500
x
307,500
× c1
123
θ2 317,500
+ ypred
−42,500 1.81 × 109
10,000
360,000
− c2 c22 `
θ1
ytrue
The gradient is then computed by working back up the tree:
2,500
x
307,500
× c1
123
θ2 317,500
∂ypred /∂c1 = 1 ypred
+
∂c1 /∂θ2 = 2,500 −42,500 1.81 × 109
10,000 ∂c2 /∂ypred = 1 − c2 c22 `
θ1 360,000
∂ypred /∂θ1 = 1 ytrue ∂ `/∂c2 = −85,000
Finally, we compute:
∂` ∂ ` ∂c2 ∂ypred
∂θ1 = ∂c2 ∂ypred ∂θ1 = −85,000 · 1 · 1 = −85,000
∂ ` ∂c2 ∂ypred ∂c1
∂`
∂θ2 = ∂c2 ∂ypred ∂c1 ∂θ2 = −85,000 · 1 · 1 · 2500 = −2.125 × 108
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
D Julia
MATLAB, and R. It was selected for use in this book because it is sufficiently
high level2 so that the algorithms can be compactly expressed and readable while 2
In contrast with languages like
also being fast. This book is compatible with Julia version 1.11. This appendix C++, Julia does not require pro-
grammers to worry about memory
introduces the concepts necessary for understanding the included code, omitting management and other lower-level
many of the advanced features of the language. details, yet it allows low-level con-
trol when needed.
D.1 Types
Julia has a variety of basic types that can represent data given as truth values,
numbers, strings, arrays, tuples, and dictionaries. Users can also define their own
types. This section explains how to use some of the basic types and how to define
new types.
D.1.1 Booleans
The Boolean type in Julia, written as Bool, includes the values true and false. We
can assign these values to variables. Variable names can be any string of characters,
including Unicode, with a few restrictions.
α = true
done = false
The variable name appears on the left side of the equal sign; the value that variable
is to be assigned is on the right side.
342 appendix d. julia
We can make assignments in the Julia console. The console, or REPL (for read,
eval, print, loop), will return a response to the expression being evaluated. The #
symbol indicates that the rest of the line is a comment.
julia> x = true
true
julia> y = false; # semicolon suppresses the console output
julia> typeof(x)
Bool
julia> x == y # test for equality
false
D.1.2 Numbers
Julia supports integer and floating-point numbers, as shown here:
julia> typeof(42)
Int64
julia> typeof(42.0)
Float64
Here, Int64 denotes a 64-bit integer, and Float64 denotes a 64-bit floating-point
value.3 We can perform the standard mathematical operations: 3
On 32-bit machines, an integer
literal like 42 is interpreted as an
julia> x = 4 Int32.
4
julia> y = 2
2
julia> x + y
6
julia> x - y
2
julia> x * y
8
julia> x / y
2.0
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 343
julia> x ^ y # exponentiation
16
julia> x % y # remainder from division
0
julia> div(x, y) # truncated division returns an integer
2
Note that the result of x / y is a Float64, even when x and y are integers. We
can also perform these operations at the same time as an assignment. For example,
x += 1 is shorthand for x = x + 1.
We can also make comparisons:
julia> 3 > 4
false
julia> 3 >= 4
false
julia> 3 ≥ 4 # unicode also works, use \ge[tab] in console
false
julia> 3 < 4
true
julia> 3 <= 4
true
julia> 3 ≤ 4 # unicode also works, use \le[tab] in console
true
julia> 3 == 4
false
julia> 3 < 4 < 5
true
D.1.3 Strings
A string is an array of characters. Strings are not used very much in this textbook
except to report certain errors. An object of type String can be constructed using
" characters. For example:
julia> x = "optimal"
"optimal"
julia> typeof(x)
String
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
344 appendix d. julia
D.1.4 Symbols
A symbol represents an identifier. It can be written using the : operator or con-
structed from strings:
julia> :A
:A
julia> :Battery
:Battery
julia> Symbol("Failure")
:Failure
D.1.5 Vectors
A vector is a one-dimensional array that stores a sequence of values. We can
construct a vector using square brackets, separating elements by commas:
julia> x = []; # empty vector
julia> x = trues(3); # Boolean vector containing three trues
julia> x = ones(3); # vector of three ones
julia> x = zeros(3); # vector of three zeros
julia> x = rand(3); # vector of three random numbers between 0 and 1
julia> x = [3, 1, 4]; # vector of integers
julia> x = [3.1415, 1.618, 2.7182]; # vector of floats
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 345
We can pull out a range of elements from an array. Ranges are specified using
a colon notation:
julia> x = [1, 2, 5, 3, 1]
5-element Vector{Int64}:
1
2
5
3
1
julia> x[1:3] # pull out the first three elements
3-element Vector{Int64}:
1
2
5
julia> x[1:2:end] # pull out every other element
3-element Vector{Int64}:
1
5
1
julia> x[end:-1:1] # pull out all the elements in reverse order
5-element Vector{Int64}:
1
3
5
2
1
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
346 appendix d. julia
[1, 2, 5, 3, 1]
[1, 2, 5, 3, 1]
julia> push!(x, -1) # add an element to the end
6-element Vector{Int64}:
1
2
5
3
1
-1
julia> pop!(x) # remove an element from the end
-1
julia> append!(x, [2, 3]) # append [2, 3] to the end of x
7-element Vector{Int64}:
1
2
5
3
1
2
3
julia> sort!(x) # sort the elements, altering the same vector
7-element Vector{Int64}:
1
1
2
2
3
3
5
julia> sort(x); # sort the elements as a new vector
julia> x[1] = 2; print(x) # change the first element to 2
[2, 1, 2, 2, 3, 3, 5]
julia> x = [1, 2];
julia> y = [3, 4];
julia> x + y # add vectors
2-element Vector{Int64}:
4
6
julia> 3x - [1, 2] # multiply by a scalar and subtract
2-element Vector{Int64}:
2
4
julia> using LinearAlgebra
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 347
D.1.6 Matrices
A matrix is a two-dimensional array. Like a vector, it is constructed using square
brackets. We use spaces to delimit elements in the same row and semicolons to
delimit rows. We can also index into the matrix and output submatrices using
ranges:
julia> X = [1 2 3; 4 5 6; 7 8 9; 10 11 12];
julia> typeof(X) # a 2-dimensional array of Int64s
Matrix{Int64} (alias for Array{Int64, 2})
julia> X[2] # second element using column-major ordering
4
julia> X[3,2] # element in third row and second column
8
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
348 appendix d. julia
We can also construct a variety of special matrices and use array comprehen-
sions:
julia> Matrix(1.0I, 3, 3) # 3x3 identity matrix
3×3 Matrix{Float64}:
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
julia> Matrix(Diagonal([3, 2, 1])) # 3x3 diagonal matrix with 3, 2, 1 on diagonal
3×3 Matrix{Int64}:
3 0 0
0 2 0
0 0 1
julia> zeros(3,2) # 3x2 matrix of zeros
3×2 Matrix{Float64}:
0.0 0.0
0.0 0.0
0.0 0.0
julia> rand(3,2) # 3x2 random matrix
3×2 Matrix{Float64}:
0.892384 0.649514
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 349
0.415807 0.196564
0.543603 0.382245
julia> [sin(x + y) for x in 1:3, y in 1:2] # array comprehension
3×2 Matrix{Float64}:
0.909297 0.14112
0.14112 -0.756802
-0.756802 -0.958924
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
350 appendix d. julia
0.14112 0.841471
julia> map(sin, X) # elementwise application of sin
2×2 Matrix{Float64}:
0.841471 0.14112
0.14112 0.841471
julia> vec(X) # reshape an array as a vector
4-element Vector{Int64}:
1
3
3
1
D.1.7 Tuples
A tuple is an ordered list of values, potentially of different types. They are con-
structed with parentheses. They are similar to vectors, but they cannot be mutated:
julia> x = () # the empty tuple
()
julia> isempty(x)
true
julia> x = (1,) # tuples of one element need the trailing comma
(1,)
julia> typeof(x)
Tuple{Int64}
julia> x = (1, 0, [1, 2], 2.5029, 4.6692) # third element is a vector
(1, 0, [1, 2], 2.5029, 4.6692)
julia> typeof(x)
Tuple{Int64, Int64, Vector{Int64}, Float64, Float64}
julia> x[2]
0
julia> x[end]
4.6692
julia> x[4:end]
(2.5029, 4.6692)
julia> length(x)
5
julia> x = (1, 2)
(1, 2)
julia> a, b = x;
julia> a
1
julia> b
2
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 351
D.1.9 Dictionaries
A dictionary is a collection of key-value pairs. Key-value pairs are indicated with
a double arrow operator =>. We can index into a dictionary using square brackets,
just as with arrays and tuples:
julia> x = Dict(); # empty dictionary
julia> x[3] = 4 # associate key 3 with value 4
4
julia> x = Dict(3=>4, 5=>1) # create a dictionary with two key-value pairs
Dict{Int64, Int64} with 2 entries:
5 => 1
3 => 4
julia> x[5] # return the value associated with key 5
1
julia> haskey(x, 3) # check whether dictionary has key 3
true
julia> haskey(x, 4) # check whether dictionary has key 4
false
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
352 appendix d. julia
The double-colon operator can be used to specify the type for any field:
struct A
a::Int64
b::Float64
end
These type annotations require that we pass in an Int64 for the first field and
a Float64 for the second field. For compactness, this book does not use type
annotations, but it is at the expense of performance. Type annotations allow
Julia to improve runtime performance because the compiler can optimize the Any
underlying code for specific types. Number
..
. Real
..
D.1.11 Abstract Types . AbstractFloat
..
. Float64
So far we have discussed concrete types, which are types that we can construct.
Float32
However, concrete types are only part of the type hierarchy. There are also abstract
Float16
types, which are supertypes of concrete types and other abstract types.
BigFloat
We can explore the type hierarchy of the Float64 type shown in figure D.1
Figure D.1. The type hierarchy for
using the supertype and subtypes functions:
the Float64 type.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 353
julia> supertype(Float64)
AbstractFloat
julia> supertype(AbstractFloat)
Real
julia> supertype(Real)
Number
julia> supertype(Number)
Any
julia> supertype(Any) # Any is at the top of the hierarchy
Any
julia> using InteractiveUtils # required for using subtypes in scripts
julia> subtypes(AbstractFloat) # different types of AbstractFloats
4-element Vector{Any}:
BigFloat
Float16
Float32
Float64
julia> subtypes(Float64) # Float64 does not have any subtypes
Type[]
For dictionaries, the first parameter specifies the key type, and the second param-
eter specifies the value type. The example has Int64 keys and Float64 values,
making the dictionary of type Dict{Int64,Float64}. Julia was able to infer these
types based on the input, but we could have specified it explicitly:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
354 appendix d. julia
D.2 Functions
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.2. functions 355
julia> (x::A)() = x.a + x.b # adding a zero-argument function to the type A defined earlier
julia> (x::A)(y) = y*x.a + x.b # adding a single-argument function
julia> x = A(22, 8);
julia> x()
30
julia> x(2)
52
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
356 appendix d. julia
22
julia> f(2, z = 3)
36
julia> f(2, 3)
10
julia> f(2, 3, z = 1)
5
D.2.6 Dispatch
The types of the arguments passed to a function can be specified using the double
colon operator. If multiple methods of the same function are provided, Julia will
execute the appropriate method. The mechanism for choosing which method to
execute is called dispatch:
julia> f(x::Int64) = x + 10;
julia> f(x::Float64) = x + 3.1415;
julia> f(1)
11
julia> f(1.0)
4.141500000000001
julia> f(1.3)
4.4415000000000004
The method with a type signature that best matches the types of the arguments
given will be used:
julia> f(x) = 5;
julia> f(x::Float64) = 3.1415;
julia> f([3, 2, 1])
5
julia> f(0.00787499699)
3.1415
D.2.7 Splatting
It is often useful to splat the elements of a vector or a tuple into the arguments to
a function using the ... operator:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.3. control flow 357
julia> f(x,y,z) = x + y - z;
julia> a = [3, 1, 2];
julia> f(a...)
2
julia> b = (2, 2, 0);
julia> f(b...)
4
julia> c = ([0,0],[1,1]);
julia> f([2,2], c...)
2-element Vector{Int64}:
1
1
We can control the flow of our programs using conditional evaluation and loops.
This section provides some of the syntax used in the book.
We can also use the ternary operator with its question mark and colon syntax.
It checks the Boolean expression before the question mark. If the expression
evaluates to true, then it returns what comes before the colon; otherwise, it
returns what comes after the colon:
julia> f(x) = x > 0 ? x : 0;
julia> f(-10)
0
julia> f(10)
10
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
358 appendix d. julia
D.3.2 Loops
A loop allows for repeated evaluation of expressions. One type of loop is the
while loop, which repeatedly evaluates a block of expressions until the specified
condition after the while keyword is met. The following example sums the values
in the array X:
X = [1, 2, 3, 4, 6, 8, 11, 13, 16, 18]
s = 0
while !isempty(X)
s += pop!(X)
end
Another type of loop is the for loop, which uses the for keyword. The following
example will also sum over the values in the array X but will not modify X:
X = [1, 2, 3, 4, 6, 8, 11, 13, 16, 18]
s = 0
for y in X
s += y
end
D.3.3 Iterators
We can iterate over collections in contexts such as for loops and array comprehen-
sions. To demonstrate various iterators, we will use the collect function, which
returns an array of all items generated by an iterator:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.4. packages 359
1
2
3
julia> Y = [-5, -0.5, 0];
julia> collect(zip(X, Y)) # iterate over multiple iterators simultaneously
3-element Vector{Tuple{String, Float64}}:
("feed", -5.0)
("sing", -0.5)
("ignore", 0.0)
julia> import IterTools: subsets
julia> collect(subsets(X)) # iterate over all subsets
8-element Vector{Vector{String}}:
[]
["feed"]
["sing"]
["feed", "sing"]
["ignore"]
["feed", "ignore"]
["sing", "ignore"]
["feed", "sing", "ignore"]
julia> collect(eachindex(X)) # iterate over indices into a collection
3-element Vector{Int64}:
1
2
3
julia> Z = [1 2; 3 4; 5 6];
julia> import Base.Iterators: product
julia> collect(product(X,Y)) # iterate over Cartesian product of multiple iterators
3×3 Matrix{Tuple{String, Float64}}:
("feed", -5.0) ("feed", -0.5) ("feed", 0.0)
("sing", -5.0) ("sing", -0.5) ("sing", 0.0)
("ignore", -5.0) ("ignore", -0.5) ("ignore", 0.0)
D.4 Packages
A package is a collection of Julia code and possibly other external libraries that
can be imported to provide additional functionality. This section briefly reviews
a few of the key packages that we build upon in this book. To add a registered
package like Distributions.jl, we can run
using Pkg
Pkg.add("Distributions")
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
360 appendix d. julia
Pkg.update()
D.4.1 Distributions.jl
We use the Distributions.jl package (version 0.25) to represent, fit, and sample
from probability distributions:
julia> using Distributions
julia> dist = Categorical([0.3, 0.5, 0.2]) # create a categorical distribution
Distributions.Categorical{Float64, Vector{Float64}}(support=Base.OneTo(3), p=[0.3, 0.5, 0.2])
julia> data = rand(dist) # generate a sample
1
julia> data = rand(dist, 2) # generate two samples
2-element Vector{Int64}:
3
2
julia> μ, σ = 5.0, 2.5; # define parameters of a normal distribution
julia> dist = Normal(μ, σ) # create a normal distribution
Distributions.Normal{Float64}(μ=5.0, σ=2.5)
julia> rand(dist) # sample from the distribution
4.944128552366248
julia> data = rand(dist, 3) # generate three samples
3-element Vector{Float64}:
-1.3094336870488144
6.84427722292975
2.0861877312652815
julia> data = rand(dist, 1000); # generate many samples
julia> Distributions.fit(Normal, data) # fit a normal distribution to the samples
Distributions.Normal{Float64}(μ=4.92941470988734, σ=2.3738926505677984)
julia> μ = [1.0, 2.0];
julia> Σ = [1.0 0.5; 0.5 2.0];
julia> dist = MvNormal(μ, Σ) # create a multivariate normal distribution
FullNormal(
dim: 2
μ: [1.0, 2.0]
Σ: [1.0 0.5; 0.5 2.0]
)
julia> rand(dist, 3) # generate three samples
2×3 Matrix{Float64}:
1.02625 0.648017 1.25499
1.04747 3.78303 1.35686
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.4. packages 361
D.4.2 JuMP.jl
We use the JuMP.jl package (version 1.23) to specify optimization problems that
we can then solve using a variety of solvers, such as those included in GLPK.jl
and Ipopt.jl:
julia> using JuMP
julia> using GLPK
julia> model = Model(GLPK.Optimizer) # create model and use GLPK as solver
A JuMP Model
├ solver: GLPK
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 0
├ num_constraints: 0
└ Names registered in the model: none
julia> @variable(model, x[1:3]) # define variables x[1], x[2], and x[3]
3-element Vector{JuMP.VariableRef}:
x[1]
x[2]
x[3]
julia> @objective(model, Max, sum(x) - x[2]) # define maximization objective
x[1] + 0 x[2] + x[3]
julia> @constraint(model, x[1] + x[2] ≤ 3) # add constraint
x[1] + x[2] ≤ 3
julia> @constraint(model, x[2] + x[3] ≤ 2) # add another constraint
x[2] + x[3] ≤ 2
julia> @constraint(model, x[2] ≥ 0) # add another constraint
x[2] ≥ 0
julia> optimize!(model) # solve
julia> value.(x) # extract optimal values for elements in x
3-element Vector{Float64}:
3.0
0.0
2.0
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
362 appendix d. julia
D.4.3 Optim.jl
We use the Optim.jl package (version 1.9) to solve unconstrained optimization
problems using a variety of techniques:
julia> using Optim
julia> f(x) = x[1]^4+x[1]^2-x[1]+x[2]^2-20*x[1]^2*x[2]^2; # function to minimize
julia> x₀ = [0.0, 0.0]; # initial guess
julia> result = optimize(f, x₀, LBFGS()) # minimize f starting at x₀ using LBFGS
* Status: success
* Candidate solution
Final objective value: -2.148047e-01
* Found with
Algorithm: L-BFGS
* Convergence measures
|x - x'| = 1.94e-04 ≰ 0.0e+00
|x - x'|/|x'| = 5.04e-04 ≰ 0.0e+00
|f(x) - f(x')| = 7.13e-08 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 3.32e-07 ≰ 0.0e+00
|g(x)| = 5.12e-09 ≤ 1.0e-08
* Work counters
Seconds run: 0 (vs limit Inf)
Iterations: 4
f(x) calls: 9
∇f(x) calls: 9
julia> result.minimizer # extract the optimal value for x
2-element Vector{Float64}:
0.3854584971606701
0.0
julia> result.minimum # extract the optimal value of the function
-0.21480474685286194
D.4.4 SimpleWeightedGraphs.jl
We extend the SimpleWeightedGraphs.jl package (version 1.4) to represent
graphs with weighted edges for discrete reachability analysis in chapter 10. Specif-
ically, we extend the package to create a WeightedGraph type that can represent a
graph with weighted edges between states. The code for this extension is provided
in the ancillaries for this book, and the following code demonstrates its usage:
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.5. convenience functions 363
julia> add_edge!(g, :s1, :s2, 1.0); # add an edge from s1 to s2 with weight 1.0
julia> get_weight(g, :s1, :s2) # get the weight of the edge from s1 to s2
1.0
julia> inneighbors(g, :s2) # get the nodes with an edge pointing to s2
1-element Vector{Symbol}:
:s1
julia> outneighbors(g, :s1) # get the nodes with an edge starting at s1
1-element Vector{Symbol}:
:s2
function SetCategorical(
elements::AbstractVector{S},
weights::AbstractVector{Float64}
) where S
ℓ₁ = norm(weights,1)
if ℓ₁ < 1e-6 || isinf(ℓ₁)
return SetCategorical(elements)
end
distr = Categorical(normalize(weights, 1))
return new{S}(elements, distr)
end
end
Distributions.rand(D::SetCategorical) = D.elements[rand(D.distr)]
Distributions.rand(D::SetCategorical, n::Int) = D.elements[rand(D.distr, n)]
function Distributions.pdf(D::SetCategorical, x)
sum(e == x ? w : 0.0 for (e,w) in zip(D.elements, D.distr.p))
end
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
364 appendix d. julia
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
References
11. T. L. Arel, Safety Management System Manual, Air Traffic Organization, Federal Avia-
tion Administration, 2022 (cit. on p. 13).
12. D. Ariely, Predictably Irrational: The Hidden Forces That Shape Our Decisions. Harper,
2008 (cit. on p. 41).
13. M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A Tutorial on Particle
Filters for Online Nonlinear/non-Gaussian Bayesian Tracking,” IEEE Transactions
on Signal Processing, vol. 50, no. 2, pp. 174–188, 2002 (cit. on p. 157).
14. T. W. Athan and P. Y. Papalambros, “A Note on Weighted Criteria Methods for
Compromise Solutions in Multi-Objective Optimization,” Engineering Optimization,
vol. 27, no. 2, pp. 155–176, 1996 (cit. on p. 60).
15. D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller,
“How to Explain Individual Classification Decisions,” Journal of Machine Learning
Research, vol. 11, pp. 1803–1831, 2010 (cit. on p. 264).
16. C. Baier and J.-P. Katoen, “Principles of Model Checking,” in MIT Press, 2008, ch. 1
(cit. on p. 2).
17. C. Baier and J.-P. Katoen, “Principles of Model Checking,” in MIT Press, 2008, ch. 6
(cit. on p. 67).
18. C. Baier and J.-P. Katoen, Principles of Model Checking. MIT Press, 2008 (cit. on p. 75).
19. G. Barthe, J.-P. Katoen, and A. Silva, Foundations of Probabilistic Programming. Cam-
bridge University Press, 2020 (cit. on p. 134).
20. A. Bauer, M. Leucker, and C. Schallhart, “Runtime Verification for LTL and TLTL,”
ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 20, no. 4,
pp. 1–64, 2011 (cit. on p. 311).
21. R. Bhattacharyya, S. Jung, L. Kruse, R. Senanayake, and M. J. Kochenderfer, “A Hy-
brid Rule-Based and Data-Driven Approach to Driver Modeling Through Particle
Filtering,” IEEE Transactions on Intelligent Transportation Systems, no. 2108.12820,
2021 (cit. on p. 48).
22. A. F. Bielajew, “History of Monte Carlo,” in Monte Carlo Techniques in Radiation
Therapy, CRC Press, 2021, pp. 3–15 (cit. on p. 4).
23. A. Biere, Handbook of Satisfiability. IOS Press, 2009, vol. 185 (cit. on pp. 238, 240).
24. C. M. Bishop and H. Bishop, Deep Learning: Foundations and Concepts. Springer
Nature, 2023 (cit. on p. 28).
25. A. T. Borchers, F. Hagie, C. L. Keen, and M. E. Gershwin, “The History and Contem-
porary Challenges of the US Food and Drug Administration,” Clinical Therapeutics,
vol. 29, no. 1, pp. 1–16, 2007 (cit. on p. 5).
26. G. E. Box, “Science and Statistics,” Journal of the American Statistical Association,
vol. 71, no. 356, pp. 791–799, 1976 (cit. on p. 20).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 367
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
368 references
41. T. Dang and T. Nahhal, “Coverage-Guided Test Generation for Continuous and
Hybrid Systems,” Formal Methods in System Design, vol. 34, pp. 183–213, 2009 (cit.
on p. 108).
42. P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A Tutorial on the
Cross-Entropy Method,” Annals of Operations Research, vol. 134, pp. 19–67, 2005 (cit.
on p. 151).
43. P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo Samplers,” Journal of
the Royal Statistical Society Series B: Statistical Methodology, vol. 68, no. 3, pp. 411–436,
2006 (cit. on p. 159).
44. H. Delecki, A. Corso, and M. J. Kochenderfer, “Model-Based Validation as Proba-
bilistic Inference,” in Conference on Learning for Dynamics and Control (L4DC), 2023
(cit. on p. 129).
45. W. M. Dickie, “A Comparison of the Scientific Method and Achievement of Aristotle
and Bacon,” The Philosophical Review, vol. 31, no. 5, pp. 471–494, 1922 (cit. on p. 3).
46. M. Dowson, “The Ariane 5 Software Failure,” Software Engineering Notes, vol. 22,
no. 2, p. 84, 1997 (cit. on p. 7).
47. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid Monte Carlo,”
Physics Letters B, vol. 195, no. 2, pp. 216–222, 1987 (cit. on p. 134).
48. A. Duret-Lutz, “Manipulating LTL Formulas Using Spot 1.0,” in Automated Technol-
ogy for Verification and Analysis, 2013 (cit. on p. 75).
49. V. Dwaracherla, Z. Wen, I. Osband, X. Lu, S. M. Asghari, and B. Van Roy, “Ensem-
bles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping,”
Transactions on Machine Learning Research, 2022 (cit. on p. 309).
50. EASA AI Task Force, “Concepts of Design Assurance for Neural Networks,” EASA,
2020 (cit. on p. 5).
51. B. Efron, “Bootstrap Methods: Another Look at the Jackknife,” in Breakthroughs in
Statistics: Methodology and Distribution, Springer, 1992, pp. 569–593 (cit. on p. 39).
52. V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo, “Generalized Multiple Im-
portance Sampling,” Statistical Science, vol. 34, no. 1, pp. 129–155, 2019 (cit. on
p. 148).
53. A. Engel, Verification, Validation, and Testing of Engineered Systems. John Wiley & Sons,
2010, vol. 73 (cit. on p. 1).
54. J. M. Esposito, J. Kim, and V. Kumar, “Adaptive RRTs for Validating Hybrid Robotic
Control Systems,” in Algorithmic Foundations of Robotics, Springer, 2005, pp. 107–121
(cit. on p. 106).
55. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural
Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 226).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 369
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
370 references
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 371
86. E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge University Press,
2003 (cit. on p. 20).
87. P. Jorion, “Risk Management Lessons from Long-Term Capital Management,”
European Financial Management, vol. 6, no. 3, pp. 277–300, 2000 (cit. on p. 7).
88. H. Kahn and T. E. Harris, “Estimation of Particle Transmission by Random Sam-
pling,” National Bureau of Standards Applied Mathematics Series, vol. 12, pp. 27–30,
1951 (cit. on p. 170).
89. G. K. Kamenev, “An Algorithm for Approximating Polyhedra,” Computational Math-
ematics and Mathematical Physics, vol. 4, no. 36, pp. 533–544, 1996 (cit. on p. 194).
90. S. Karaman and E. Frazzoli, “Incremental Sampling-Based Algorithms for Optimal
Motion Planning,” Robotics Science and Systems VI, vol. 104, no. 2, pp. 267–274, 2010
(cit. on p. 111).
91. H. Karloff, Linear Programming. Springer, 2008 (cit. on p. 196).
92. S. M. Katz, K. D. Julian, C. A. Strong, and M. J. Kochenderfer, “Generating Probabilis-
tic Safety Guarantees for Neural Network Controllers,” Machine Learning, vol. 112,
pp. 2903–2931, 2023 (cit. on p. 250).
93. S. M. Katz, A.-C. LeBihan, and M. J. Kochenderfer, “Learning an Urban Air Mobility
Encounter Model from Expert Preferences,” in Digital Avionics Systems Conference
(DASC), 2019 (cit. on p. 28).
94. P. Kirichenko, P. Izmailov, and A. G. Wilson, “Why Normalizing Flows Fail to Detect
Out-Of-Distribution Data,” Advances in Neural Information Processing Systems, vol. 33,
pp. 20 578–20 589, 2020 (cit. on p. 291).
95. J. Kleinberg, S. Mullainathan, and M. Raghavan, “Inherent Trade-Offs in the Fair
Determination of Risk Scores,” in Innovations in Theoretical Computer Science (ITCS)
Conference, 2017 (cit. on p. 7).
96. I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing Flows: An Introduction
and Review of Current Methods,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 43, no. 11, pp. 3964–3979, 2020 (cit. on p. 24).
97. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019
(cit. on pp. 92, 261, 276).
98. M. J. Kochenderfer and J. P. Chryssanthacopoulos, “Robust Airborne Collision
Avoidance Through Dynamic Programming,” Massachusetts Institute of Technol-
ogy, Lincoln Laboratory, Project Report ATC-371, 2011 (cit. on p. 325).
99. M. J. Kochenderfer, M. W. M. Edwards, L. P. Espindle, J. K. Kuchar, and J. D. Grif-
fith, “Airspace Encounter Models for Estimating Collision Risk,” AIAA Journal on
Guidance, Control, and Dynamics, vol. 33, no. 2, pp. 487–499, 2010 (cit. on p. 44).
100. M. J. Kochenderfer, T. A. Wheeler, and K. H. Wray, Algorithms for Decision Making.
MIT Press, 2022 (cit. on pp. 2, 10, 25, 27, 34, 41, 324).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
372 references
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 373
117. F. Liese and I. Vajda, “On Divergences and Informations in Statistics and Information
Theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412,
2006 (cit. on p. 46).
118. C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, and M. J. Kochenderfer, “Algo-
rithms for Verifying Deep Neural Networks,” Foundations and Trends in Optimization,
vol. 4, no. 3–4, pp. 244–404, 2021 (cit. on p. 226).
119. F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago, “Marginal Likelihood
Computation for Model Selection and Hypothesis Testing: an Extensive Review,”
SIAM Review, vol. 65, no. 1, pp. 3–58, 2023 (cit. on pp. 158, 161).
120. S. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information
Theory, vol. 28, no. 2, pp. 129–137, 1982 (cit. on p. 279).
121. K. Makino and M. Berz, “Taylor Models and Other Validated Functional Inclusion
Methods,” International Journal of Pure and Applied Mathematics, vol. 4, no. 4, pp. 379–
456, 2003 (cit. on p. 212).
122. O. Maler, “Computing Reachable Sets: An Introduction,” French National Center of
Scientific Research, pp. 1–8, 2008 (cit. on p. 186).
123. O. Maler and D. Nickovic, “Monitoring Temporal Properties of Continuous Signals,”
in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems,
2004 (cit. on p. 69).
124. R. L. McCarthy, “Autonomous Vehicle Accident Data Analysis: California OL 316
Reports: 2015–2020,” ASCE-ASME Journal of Risk and Uncertainty in Engineering
Systems, Part B: Mechanical Engineering, vol. 8, no. 3, p. 034 502, 2022 (cit. on p. 6).
125. S. B. McGrayne, The Theory That Would Not Die. Yale University Press, 2011 (cit. on
p. 34).
126. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller,
“Equation of State Calculations by Fast Computing Machines,” Journal of Chemical
Physics, vol. 21, no. 6, pp. 1087–1092, 1953 (cit. on p. 126).
127. B. P. Miller, L. Fredriksen, and B. So, “An Empirical Study of the Reliability of UNIX
Utilities,” Communications of the ACM, vol. 33, no. 12, pp. 32–44, 1990 (cit. on p. 85).
128. B. Müller, J. Reinhardt, and M. T. Strickland, Neural Networks. Springer, 1995 (cit. on
p. 335).
129. C. N. Murphy and J. Yates, The International Organization for Standardization (ISO):
Global Governance Through Voluntary Consensus. Routledge, 2009 (cit. on p. 5).
130. K. P. Murphy, Probabilistic Machine Learning: An Introduction. MIT Press, 2022 (cit.
on p. 28).
131. R. Neidinger, “Directions for Computing Truncated Multivariate Taylor Series,”
Mathematics of Computation, vol. 74, no. 249, pp. 321–340, 2005 (cit. on p. 209).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
374 references
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 375
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
376 references
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
index 377
177. A. M. Turing, “Computing Machinery and Intelligence,” Mind, vol. 59, pp. 433–460,
1950 (cit. on p. 48).
178. M. Vazquez-Chanlatte, J. V. Deshmukh, X. Jin, and S. A. Seshia, “Logical Clustering
and Learning for Time-Series Data,” in International Conference on Computer Aided
Verification, 2017 (cit. on p. 282).
179. A. G. Wilson and P. Izmailov, “Bayesian Deep Learning and a Probabilistic Perspec-
tive of Generalization,” Advances in Neural Information Processing Systems (NeurIPS),
vol. 33, pp. 4697–4708, 2020 (cit. on p. 308).
180. L. A. Wolsey, Integer Programming. Wiley, 2020 (cit. on p. 221).
181. B. Wong, “Points of View: Color Blindness,” Nature Methods, vol. 8, no. 6, pp. 441–
442, 2011 (cit. on p. xi).
182. W. Xiang, H.-D. Tran, and T. T. Johnson, “Output Reachable Set Estimation and
Verification for Multilayer Neural Networks,” IEEE Transactions on Neural Networks
and Learning Systems, vol. 29, no. 11, pp. 5777–5783, 2018 (cit. on p. 229).
183. W. Xiang, H.-D. Tran, J. A. Rosenfeld, and T. T. Johnson, “Reachable Set Estima-
tion and Safety Verification for Piecewise Linear Systems with Neural Network
Controllers,” in American Control Conference (ACC), 2018 (cit. on p. 227).
184. D. Xu and Y. Tian, “A Comprehensive Survey of Clustering Algorithms,” Annals of
Data Science, vol. 2, pp. 165–193, 2015 (cit. on p. 279).
185. H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel, “Efficient Neural
Network Robustness Certification with General Activation Functions,” Advances in
Neural Information Processing Systems (NeurIPS), vol. 31, 2018 (cit. on p. 229).
186. Y.-D. Zhou, K.-T. Fang, and J.-H. Ning, “Mixture Discrepancy for Quasi-Random
Point Sets,” Journal of Complexity, vol. 29, no. 3-4, pp. 283–301, 2013 (cit. on p. 107).
187. B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse
Reinforcement Learning,” in AAAI Conference on Artificial Intelligence (AAAI), 2008
(cit. on p. 41).
188. G. M. Ziegler, Lectures on Polytopes. Springer Science & Business Media, 2012, vol. 152
(cit. on p. 187).
189. A. Zutshi, J. V. Deshmukh, S. Sankaranarayanan, and J. Kapinski, “Multiple Shoot-
ing, CEGAR-Based Falsification for Hybrid Systems,” in International Conference on
Embedded Software, 2014 (cit. on p. 99).
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
Index
CVaR, see conditional value at risk exploration, 112 inclusion function, 205
independently and identically
DAgger, see data set aggregation F-divergences, 46 distributed, 28
data set aggregation, 40 failure region, 104 individuals, 95
decision tree, 273 failure trajectories, 81 information content, 330
deep learning, 335 failures, 81 integrated gradients, 264
deep neural network, 336 Falsification, 13 interaction models, 42
defect, 97 falsification, 81 interpretability, 256
dependency effect, 182 falsifying trajectories, 81 interval arithmetic, 203
depth of rationality, 42 feature, 257 interval box, 203, see hyperrectangle
dictionary, 351 feature collapse, 292 interval counterpart, 204
differential entropy, 330 feature importance, 257 interval hull, 204
direct methods, 93 feedforward network, 336 intervals, 189
discrepancy, 106 first fundamental theorem of calculus, invariant set, 185
discrete state abstraction, 248 331 inverse reinforcement learning, 40
disjunction, 63 first-order, 93 irreducible uncertainty, 292, see output
dispatch, 356 first-order logic, 63 uncertainty
dispersion, 105 forward reachability, 177, 233 iterative deepening, 237
distance metric, 328 function, 354 iterative refinement, 194
disturbance distribution, 84 fuzzing, 85
disturbances, 81, 82 joint distribution, 25
double progressive widening, 116 Gaussian distribution, 21, 22
Gaussian mixture model, 23 k-fold cross validation, 39
earth mover’s distance, 46 generalization performance, 35 K-L divergence, see Kullback-Leibler
ECE, see expected calibration error generative adversarial networks, 24 divergence
elite samples, 153 generative models, 24 K-S statistic, see Kolmogorov-Smirnov
EM, see expectation-maximization generators, 188 statistic
entropic value at risk, 56 global explanation, 270 kernel, 126
entropy, 297, 330 keyword argument, 355
environment, 8 half space, 186 Kolmogorov-Smirnov statistic, 46
episode, 116 Hausdorff distance, 190 Kolmorogov axioms, 328
epistemic uncertainty, 293 hierarchical, 42 Kullback-Leibler divergence, 44, 331,
Euclidean norm, 329 hierarchical softmax, 42 see relative entropy
event space, 328 histogram binning, 298
exchangeable, 302 holdout method, 39 Lagrange remainder, 212
existential quantifier, 65 hyperrectangle, 189 least-squares, 28, 29
expectation-maximization, 34 linear inequalities, 186
expected calibration error, 46 identity of indiscernibles, 328 linear model, 270
expected value, 54 imitation game, 48 linear program, 195
explainability, 256 imitation learning, 39 linear systems, 178
explanation, 255 implication, 63 Linear temporal logic, 67
exploitation, 112 Importance sampling, 145 linearization, 208
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
index 381
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
382 index
PSTL, see parametric signal temporal Shannon information, 330 Taylor series, 331
logic Shapley value, 266 temperature scaling, 298
Shooting methods, 97 temporal logic, 65
Q-Q plot, see quantile-quantile plot sigmoid models, 27 ternary operator, 357
quantal response, 41 signal, 69 test set, 39
quantal-level-k, 42 Signal temporal logic, 69 testing, 1
quantifiers, 63 single shooting, 98 training, 27, 335
quantile-quantile plot, 44 smooth robustness, 73 training set, 39
Smoothing, 129 trajectory, 10
rapidly exploring random trees (RRT), SMT, see Satisfiability modulo theories triangle inequality, 328
100 softmax, 73, 297, 337 truthtable, 63
ratio importance sampling, 163 softmax response, 41 tuple, 350
reachability specification, 73 softmin, 73 Turing test, 48
reachable set, 178 spatial index, 287
rectified linear unit, 222 specification, 8, 11 umbrella sampling, 163
reducible uncertainty, 293, see model specifications, 53 unbiased, 141
uncertainty splat, 356 unbounded model checking, 235
Reinforcement learning, 116 spurious correlations, 259 uncertainty quantification, 292
Rejection sampling, 122 standard error, 140 unimodal, 21
relative entropy, 331 Star discrepancy, 107 univariate distribution, 25
reverse accumulation, 339 star sets, 212 universal quantifier, 65
risk metric, 55 stationary, 86 utopia point, 58
robustness, 71 STL, see Signal temporal logic
rollout, 10 string, 343
V model, 5
rotation estimation, 39 superlevel set, 290
validation, 1
runtime monitoring, 285 support, 21
value at risk, 55
support function, 191
VaR, see value at risk
safety case, 14 support vector, 193
variable, 63
saliency maps, 261 supremum, 105
variance, 55
sample efficiency, 117 surrogate model, 267
vector, 344
sample space, 328 Swiss cheese model, 14
vector space, 328
SAT, see Boolean satisfiability symbol, 344
verification, 1
Satisfiability modulo theories, 240 symbolic reachability, 216
second-order, 93 symmetry, 328
Self-normalized importance sampling, system, 8 Wasserstein distance, 46
165 waterfall model, 4
sensitivity analysis, 259 tail value at risk, 56 weighted exponential sum, 60
sensor, 9 taxicab norm, 329 weighted metrics, 58
sequential interactive demonstration, Taylor approximation, 332 weighted sum, 58
40 Taylor expansion, 331 wrapping effect, 218
sequential Monte Carlo, 157 Taylor inclusion function, 208
Set propagation, 179 Taylor models, 212 zonotope, 188
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]