0% found this document useful (0 votes)
62 views396 pages

Algorithms For Validation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views396 pages

Algorithms For Validation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Algorithms for Validation

Algorithms for Validation

Mykel J. Kochenderfer
Sydney M. Katz
Anthony L. Corso
Robert J. Moss

Stanford, California
© 2024 Kochenderfer, Katz, Corso, and Moss

All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means
(including photocopying, recording or information storage and retrieval) without permission in writing from
the publisher.

This book was set in TEX Gyre Pagella by the authors in LATEX.
Printed and bound in the United States of America.

Library of Congress Cataloging-in-Publication Data is available.

ISBN:

10 9 8 7 6 5 4 3 2 1
To our families.
Contents

Acknowledgments xi
Preface xiii
1 Introduction 1
1.1 Validation 1
1.2 History 3
1.3 Societal Consequences 6
1.4 Validation Algorithms 8
1.5 Challenges 14
1.6 Overview 16
2 System Modeling 19
2.1 Model Building 19
2.2 Probability 20
2.3 Parameter Learning 27
2.4 Agent Models 39
2.5 Model Validation 42
2.6 Summary 51
3 Property Specification 53
3.1 Properties of Systems 53
3.2 Metrics for Stochastic Systems 54
3.3 Composite Metrics 56
3.4 Logical Specifications 62
3.5 Temporal Logic 65
3.6 Reachability Specifications 73
3.7 Summary 77
viii contents

4 Falsification through Optimization 81


4.1 Direct Sampling 81
4.2 Disturbances 82
4.3 Fuzzing 85
4.4 Falsification through Optimization 88
4.5 Objective Functions 89
4.6 Optimization Algorithms 92
4.7 Summary 95
5 Falsification through Planning 97
5.1 Shooting Methods 97
5.2 Tree Search 99
5.3 Heuristic Search 100
5.4 Monte Carlo Tree Search 112
5.5 Reinforcement Learning 116
5.6 Simulator Requirements 117
5.7 Summary 120
6 Failure Distribution 121
6.1 Distribution over Failures 121
6.2 Rejection Sampling 122
6.3 Markov Chain Monte Carlo 126
6.4 Probabilistic Programming 134
6.5 Summary 136
7 Failure Probability Estimation 139
7.1 Direct Estimation 139
7.2 Importance Sampling 145
7.3 Adaptive Importance Sampling 151
7.4 Sequential Monte Carlo 157
7.5 Ratio of Normalizing Constants 161
7.6 Multilevel Splitting 170
7.7 Summary 173
7.8 Exercises 174

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
contents ix

8 Reachability for Linear Systems 177


8.1 Forward Reachability 177
8.2 Set Propagation Techniques 179
8.3 Set Representations 186
8.4 Reducing Computational Cost 190
8.5 Linear Programming 195
8.6 Summary 199
9 Reachability for Nonlinear Systems 203
9.1 Interval Arithmetic 203
9.2 Inclusion Functions 205
9.3 Taylor Models 212
9.4 Concrete Reachability 214
9.5 Optimization-Based Nonlinear Reachability 219
9.6 Partitioning 222
9.7 Neural Networks 226
9.8 Summary 229
10 Reachability for Discrete Systems 231
10.1 Graph Formulation 231
10.2 Reachable Sets 233
10.3 Satisfiability 235
10.4 Probabilistic Reachability 242
10.5 Discrete State Abstractions 248
10.6 Summary 251
11 Explainability 255
11.1 Explanations 255
11.2 Policy Visualization 256
11.3 Feature Importance 257
11.4 Policy Explanation through Surrogate Models 267
11.5 Counterfactual Explanations 273
11.6 Failure Mode Characterization 279
11.7 Summary 283
12 Runtime Monitoring 285
12.1 Operational Design Domain Monitoring 285
12.2 Uncertainty Quantification 292
12.3 Failure Monitoring 309
12.4 Summary 313
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
x c o ntents

a p p endices
A Systems 317
A.1 Default Implementations 317
A.2 Simple Gaussian System 318
A.3 Multivariate Gaussian System 318
A.4 Mass-Spring-Damper System 319
A.5 Inverted Pendulum System 321
A.6 Grid World System 322
A.7 Continuum World System 322
A.8 Aircraft Collision Avoidance System 325
B Mathematical Concepts 327
B.1 Measure Spaces 327
B.2 Probability Spaces 328
B.3 Metric Spaces 328
B.4 Normed Vector Spaces 328
B.5 Positive Definiteness 330
B.6 Information Content 330
B.7 Entropy 330
B.8 Cross Entropy 331
B.9 Relative Entropy 331
B.10 Taylor Expansion 331
C Neural Representations 335
D Julia 341
D.1 Types 341
D.2 Functions 354
D.3 Control Flow 357
D.4 Packages 359
D.5 Convenience Functions 363
References 365
Index 377

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
Acknowledgments

We wish to thank the many individuals who have provided valuable feedback
on early drafts of our manuscript, including Matthias Althoff, Stephen Boyd,
Emmanual Candés, Francois Chaubard, Harrison Delecki, Hanna Krasowski,
Liam Kruse, Alexandros Tzikas, Romeo Valentin, and Jun Wang. Many of the
algorithms discussed in this book were explored during the development of
the ACAS X aircraft collision avoidance systems with the generous support and
leadership of Neal Suchy of the Federal Aviation Administration. The participants
of Dagstuhl Seminar 24361 provided valuable input to the topics included in the
book. It has been a pleasure working with Elizabeth Swayze and the editing team
from the MIT Press in preparing this manuscript for publication.
The style of this book was inspired by Edward Tufte. Among other stylistic
elements, we adopted his wide margins and use of small multiples. The type-
setting of this book is based on the Tufte-LaTeX package by Kevin Godby, Bil
Kleb, and Bill Wood. The book’s color scheme was adapted from the Monokai
theme by Jon Skinner of Sublime Text (sublimetext.com) and a palette that better
accommodates individuals with color blindness.1 For plots, we use the viridis 1
B. Wong, “Points of View: Color
color map defined by Stéfan van der Walt and Nathaniel Smith. Blindness,” Nature Methods, vol. 8,
no. 6, pp. 441–442, 2011.
We have also benefited from the various open-source packages on which this
textbook depends (see appendix D). The authors thank Tor Fjelde for his help
with Turing.jl. The typesetting of the code was done with the help of pythontex,
which is maintained by Geoffrey Poore. The typeface used for the algorithms is
JuliaMono (github.com/cormullion/juliamono). The plotting was handled by
pgfplots, which is maintained by Christian Feuersänger.
Preface

This book provides a broad introduction to algorithms for validating safety-critical


systems. We cover a wide variety of topics related to validation, introducing the
underlying mathematical problem formulations and the algorithms for solving
them. Figures, examples, and exercises are provided to convey the intuition behind
the various approaches.
This book is intended for advanced undergraduates and graduate students, as
well as professionals. It requires some mathematical maturity and assumes prior
exposure to multivariable calculus, linear algebra, and probability concepts. Some
review material is provided in the appendices. Disciplines where the book would
be especially useful include mathematics, statistics, computer science, aerospace,
electrical engineering, and operations research.
Fundamental to this textbook are the algorithms, which are all implemented
in the Julia programming language. We have found this language to be ideal for
specifying algorithms in human-readable form. The priority in the design of the
algorithmic implementations was interpretability rather than efficiency. Indus-
trial applications, for example, may benefit from alternative implementations.
Permission is granted, free of charge, to use the code snippets associated with
this book, subject to the condition that the source of the code is acknowledged.

Mykel J. Kochenderfer
Sydney M. Katz
Anthony L. Corso
Robert J. Moss
Stanford, California
December 23, 2024
1 Introduction

Before deploying decision-making systems in high-stakes settings, it is important


to ensure that they will operate as intended. We refer to the process of analyzing
the behavior of these systems as validation. Validation is a critical component of
the development process for decision-making systems in a variety of domains
including autonomous vehicles, robotics, and healthcare. As these systems and
their operating environments increase in complexity, understanding the full
spectrum of possible behaviors becomes more challenging and requires a rigorous
validation process. This book discusses these challenges and presents a variety of
computational methods for validating autonomous systems. This chapter begins
with a broad overview of validation. We motivate the need for validation from a
historical perspective and outline the societal consequences of validation failures.
We then introduce the validation framework that we will use throughout the
book. We discuss the challenges associated with validation and conclude with an
overview of the remaining chapters in the book.

1.1 Validation

The concept of validation is defined differently by different communities, and the


word itself is often used in conjunction with other terms such as verification and
testing.1 In this book, we define validation as the broad process of establishing 1
For a discussion on these defini-
confidence that a system will behave as desired when deployed in the real world. tions, see section 1.2.3 of A. Engel,
Verification, Validation, and Testing
We define verification as a special type of validation that provides guarantees of Engineered Systems. John Wiley &
about the correctness of a system with respect to a specification. We define testing Sons, 2010, vol. 73.
as a technique used for validation that involves evaluating the system on a discrete
set of test cases.
2 c hapter 1. introduction

From a systems engineering perspective, validation is viewed as a phase of the


development cycle for autonomous systems (figure 1.1). A typical development
cycle begins by defining a set of operational requirements for the system. For Define
example, the developers of an aircraft collision avoidance system may identify Requirements
a requirement on the probability of collision when deployed in the airspace.
Designers then use these requirements to produce an initial version of the system. Design
In the aircraft collision avoidance example, the system may consist of a decision-
making agent that selects actions to avoid collisions based on sensor information.
A common technique for designing a system to match a set of desired require- Validate
ments is to optimize the system with respect to an objective function or reward
model that captures the requirements. However, the models used to perform Figure 1.1. A typical development
the optimization may be imperfect, the optimization objective may not perfectly cycle for an autonomous system.
capture the requirements, and the optimization process itself is often approximate. This book focuses on the validation
phase of development.
This misalignment can result in a mismatch between the desired behavior of the
system and its actual behavior when deployed in the real world. We refer to this
phenomenon as the alignment problem.2 2
A detailed discussion of the align-
The alignment problem motivates the need for the validation phase of the ment problem is provided in B.
Christian, The Alignment Problem:
development cycle. Given the requirements and design, validation algorithms Machine Learning and Human Values.
analyze whether the system will behave as intended when deployed in its operat- W. W. Norton & Company, 2020.
ing environment. Based on the results of the validation process, developers may
need to revise the design or requirements. This process is often repeated multiple
times before the system is ready for deployment. It is important to perform vali-
dation early in the development cycle to detect bugs and misalignments before
they become more costly to fix. For example, repairing a software bug during
maintenance is often orders of magnitude more expensive than fixing the bug 3
C. Baier and J.-P. Katoen, “Princi-
early in the development cycle.3 ples of Model Checking,” in MIT
Press, 2008, ch. 1.
This book focuses entirely on the validation phase of the development cycle.
We assume that we are given a system that has been designed to meet a set 4
More information about the
of established requirements, and we discuss methods to translate the system systems engineering process can
be found in A. Kossiakoff, S. M.
and its requirements to computational models and formal specifications that Biemer, S. J. Seymour, and D. A.
allow us to apply a variety of validation algorithms. In other words, this book Flanigan, Systems Engineering
Principles and Practice. John Wiley &
is not about systems engineering or the development of systems.4 Instead, we Sons, 2020. A variety of algorithms
focus on algorithms that validate the behavior of these systems in their operating for designing decision-making
environments. systems are provided in M. J.
Kochenderfer, T. A. Wheeler, and
Validation techniques have been developed for a wide variety of systems K. H. Wray, Algorithms for Decision
ranging from aircraft parts to medical devices to customer service chatbots. For Making. MIT Press, 2022.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.2. history 3

example, aircraft designers validate the structural integrity of the wings through
extensive stress testing, and medical device manufacturers validate the safety
of their devices through clinical trials. In this book, we present an algorithmic
perspective on validation and focus specifically on the validation of decision-
making agents.
Decision-making agents interact with the environment and make decisions
based on the information they receive. These agents range from fully automated
systems that operate independently within their environment to decision-support
systems that inform human decision-makers.5 Examples include aircraft collision 5
Autonomy and automation have
different definitions in different
avoidance systems, adaptive cruise control systems, hiring assistants, disaster
communities. Autonomy is often
response systems, and other cyberphysical systems.6 While the algorithms pre- defined as the automation of high-
sented in this book can be applied to many different types of decision-making level tasks such as driving. The al-
gorithms in this book can be ap-
agents, we place a particular emphasis on sequential decision-making agents, plied to decision-making systems
which make a series of decisions over time. For example, an autonomous vehicle with any level of automation or au-
must make a sequence of decisions to navigate from one location to another. tonomy.
6
Cyberphysical systems are com-
putational systems that interact
1.2 History with the physical world.

The history of validation is deeply intertwined with the evolution of complex


systems across many domains. Early forms of validation can be traced back to the
ideas of ancient Greek philosophers such as Aristotle (384–322 BC).7 Aristotle 7
W. M. Dickie, “A Comparison of
advocated for a continuous cycle of observation and experimentation to validate the Scientific Method and Achieve-
ment of Aristotle and Bacon,” The
hypotheses. The scientific method introduced during the scientific revolution of Philosophical Review, vol. 31, no. 5,
the 16th and 17th centuries formalized this notion. During this time, Francis Bacon pp. 471–494, 1922.
(1561–1626) proposed a method for validating scientific hypotheses through
empirical observation and experimentation.
The technological changes brought on by the industrial revolutions accelerated
progress in validation. During the First Industrial Revolution of the late 18th and
early 19th centuries, the complexity of systems increased dramatically, and the
field of validation shifted from validating ideas and hypotheses to validating
machines and production processes. The increase of mass production in factories
during the Second Industrial Revolution (1870–1914) further motivated the need
for validation. Supervisors began to perform quality control checks on products to
ensure that they met the desired specifications.8 As production volume increased 8
K. Ishikawa and J. H. Loftus, “In-
in the following years, supervisors could no longer inspect every product, and troduction to Quality Control,” in
Springer, 1990, vol. 98, ch. 1.
factories began to hire designated inspectors for quality control.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4 c h apter 1. introduction

Waterfall Model V Model Figure 1.2. Comparison of the wa-


terfall and V models of the soft-
ware development lifecycle.
Concept of Operation &
Requirements
Operations Maintenance

Design
System System
Requirements Validation
Development

Detailed Unit
Testing
Design Tests

Deployment
Implementation
Maintenance

During World War II, production volume increased to the point where it was no 9
W. M. Tsutsui, “W. Edwards Dem-
longer possible to inspect every product. This increase in production output led ing and the Origins of Quality Con-
trol in Japan,” Journal of Japanese
to the adoption of statistical quality control methods, which relied on sampling Studies, vol. 22, no. 2, pp. 295–325,
to speed up inspection. These ideas were developed by W. Edwards Deming9 1996.
(1900–1993) and Joseph M. Juran10 (1904–2008) and marked the beginning of 10
D. Phillips-Donaldson, “100
the field of statistical process control. Deming and Juran introduced these ideas Years of Juran,” Quality Progress,
vol. 37, no. 5, pp. 25–31, 2004.
to Japanese manufacturers after World War II, which played a key role in the
post-war economic recovery of Japan.
The advancements in computing technology in the latter half of the 20th cen-
tury increased our ability to use statistical methods to validate complex systems.
In the late 1940s, scientists at Los Alamos National Laboratory developed the
Monte Carlo method, which uses random sampling to solve complex mathemati-
cal problems.11 These methods were later used to validate complex systems in 11
A. F. Bielajew, “History of Monte
a variety of domains such as aviation and finance. Progress in computing tech- Carlo,” in Monte Carlo Techniques in
Radiation Therapy, CRC Press, 2021,
nology also led to new challenges in validation. The development of software pp. 3–15.
systems required new validation techniques and best practices to ensure that the
software operated correctly.
In the 1970s, software engineers began formalizing the software development
life cycle into phases that supported rigorous testing and validation. The water-
fall model of software development, introduced in 1970, divided the software
development process into distinct phases including requirements, design, im-

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.2. history 5

plementation, testing, and maintenance.12 In the 1990s, the waterfall model was 12
W. W. Royce, “Managing the De-
refined into the V model, which emphasizes the importance of testing and valida- velopment of Large Software Sys-
tems: Concepts and Techniques,”
tion throughout the software development process.13 The V model aligns testing IEEE WESCON, 1970.
and validation activities with the corresponding development activities, ensuring 13
K. Forsberg and H. Mooz, “The
Relationship of System Engineer-
that the system is validated at each stage of development. Figure 1.2 compares
ing to the Project Cycle,” Center
the waterfall and V models of the software development life cycle. for Systems Management, vol. 5333,
The 20th century also saw the emergence of regulatory bodies to guide the safe 1991.

development of new technologies. The Food and Drug Administration (FDA)


was established in the United States in 1906 after a series of food and drug safety
incidents.14 In 1947, the International Organization for Standardization (ISO) 14
A. T. Borchers, F. Hagie, C. L.
was founded to develop international standards for products and services.15 After Keen, and M. E. Gershwin, “The
History and Contemporary Chal-
a series of midair collisions between aircraft, the Federal Aviation Administration lenges of the US Food and Drug
(FAA) was formed in 1958 to regulate civil aviation in the United States.16 Administration,” Clinical Therapeu-
tics, vol. 29, no. 1, pp. 1–16, 2007.
As technology matured in the late 20th and early 21st centuries, these reg- 15
C. N. Murphy and J. Yates, The In-
ulatory bodies introduced new standards and requirements. For example, the ternational Organization for Standard-
Radio Technical Commission for Aeronautics (RTCA) introduced the DO-178 ization (ISO): Global Governance
Through Voluntary Consensus. Rout-
standard in 1982 to provide guidelines for the development of safety-critical ledge, 2009.
software in aviation. DO-178 has been updated multiple times in the following 16
J. W. Gelder, “Air Law: The Fed-
years to account for new technological advancements in the field and has been eral Aviation Act of 1958,” Michigan
Law Review, vol. 57, no. 8, pp. 1214–
used frequently by the FAA to certify the safety of aircraft software.17 In 2011, 1227, 1959.
ISO 26262 was introduced as an international standard relating to the functional 17
More information on the history
safety of automotive systems. While ISO 26262 was developed specifically for of software standards in aviation
can be found in L. Rierson, Devel-
electronic/electric systems in road vehicles, many researchers have used it as a oping Safety-Critical Software: a Prac-
guideline for the development of both hardware and software for autonomous tical Guide for Aviation Software and
vehicles.18 DO-178C Compliance. CRC Press,
2017.
Starting in the 2010s, artificial intelligence (AI) and machine learning sys- 18
M. A. Gosavi, B. B. Rhoades, and
tems became increasingly prevalent in a variety of applications. For example, AI J. M. Conrad, “Application of Func-
systems were introduced into autonomous vehicles, aircraft, medical diagnosis, tional Safety in Autonomous Vehi-
cles Using ISO 26262 Standard: A
and financial trading. The increased capabilities and applications of AI led to Survey,” in SoutheastCon, 2018.
new validation challenges and techniques. Not only are the systems themselves
complex, but they also operate in complex environments, making validation of
these systems particularly challenging. In 2020, the European Union Aviation
Safety Agency (EASA) published initial guidelines related to the design assur-
ance of neural networks, which are a key component of many machine learning
systems.19 In that document, they outline a modification of the traditional V 19
EASA AI Task Force, “Concepts
of Design Assurance for Neural
Networks,” EASA, 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6 c h apter 1. introduction

model to account for validation of the learning process. In general, the validation
of AI systems is still an active area of research.

1.3 Societal Consequences

The validation of decision-making agents is critical in ensuring that these systems


are properly integrated into society. Failures in validation can have severe societal
consequences. This section discusses the impacts of validation on various aspects
of society.

1.3.1 Safety
Validation is necessary for ensuring the safety of systems that interact with the
physical world. Failures of safety-critical systems can result in catastrophic acci- 20
N. G. Leveson and C. S. Turner,
dents that cause injury or loss of life. For example, unintended behavior of the “An Investigation of the Therac-25
Accidents,” Computer, vol. 26, no. 7,
safety-critical software used by the Therac-25 radiation therapy machine caused pp. 18–41, 1993.
radiation overdoses that resulted in death or serious injury to six patients.20 Safety
is also important for transportation systems such as aircraft and cars. In 2002, 21
J. Kuchar and A. C. Drumm, “The
a mid-air collision over Überlingen, Germany resulted in 71 fatalities when the Traffic Alert and Collision Avoid-
ance System,” Lincoln Laboratory
traffic alert and collision avoidance system (TCAS) and air traffic control (ATC) Journal, vol. 16, no. 2, p. 277, 2007.
systems issued conflicting instructions to the pilots.21 Furthermore, it is important 22
R. L. McCarthy, “Autonomous
Vehicle Accident Data Analy-
to ensure that autonomous vehicles make safe decisions in a wide range of scenar-
sis: California OL 316 Reports:
ios to prevent potential accidents. Since their introduction, autonomous vehicles 2015–2020,” ASCE-ASME Journal
have been involved in accidents that have resulted in injuries or fatalities.22 of Risk and Uncertainty in Engi-
neering Systems, Part B: Mechanical
Engineering, vol. 8, no. 3, p. 034 502,
1.3.2 Fairness 2022.

When agents make decisions that affect the lives of large groups of people, we must
ensure that their decisions are fair and unbiased. Validation helps researchers
and organizations identify and correct biases in decision-making systems before
deployment. If these biases are not addressed, they can have serious consequences
for individuals and society as a whole. For example, an automated hiring system 23
A. L. Hunkenschroer and
developed by Amazon was ultimately discontinued after it was found to be A. Kriebitz, “Is AI Recruiting
(Un)ethical? A Human Rights
biased against women due to biases in the historical data it was trained on.23 In Perspective on the Use of AI for
another case, a software system designed to predict recidivism rates in criminal Hiring,” AI and Ethics, vol. 3, no. 1,
pp. 199–213, 2023.
defendants called COMPAS was found to be biased toward certain demographics

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.3. societal consequences 7

based on empirical data.24 Using the outputs of these systems to make decisions 24
Other research has argued that
can result in the unfair treatment of individuals. Validating these systems before the system is fair under a differ-
ent definition of fairness. A de-
deployment can help prevent this type of failure. tailed discussion is provided in J.
Kleinberg, S. Mullainathan, and
M. Raghavan, “Inherent Trade-Offs
1.3.3 Public Trust in the Fair Determination of Risk
Scores,” in Innovations in Theoretical
Public trust in autonomous systems is critical for their widespread adoption, and Computer Science (ITCS) Conference,
validation plays a key role in developing this trust. For example, trust has been 2017.

identified as a key factor in the eventual adoption of autonomous vehicles into


society.25 For this reason, autonomous vehicle designers and manufacturers have 25
J. K. Choi and Y. G. Ji, “Inves-
invested heavily in validation to ensure that their vehicles are safe and reliable. tigating the Importance of Trust
on Adopting an Autonomous Vehi-
The aviation industry is another example of an industry that relies on public cle,” International Journal of Human-
trust. The industry has maintained public trust by upholding a rigorous safety Computer Interaction, vol. 31, no. 10,
pp. 692–702, 2015.
process that has resulted in a strong safety record. However, failures in validation
can erode public trust. For instance, when the Boeing 737 MAX 8 aircraft was
grounded worldwide after two fatal crashes, public trust in the aviation industry
was significantly impacted.26 Validation also allows us to anticipate possible 26
J. Herkert, J. Borenstein, and
ethical dilemmas before deployment.27 Addressing these dilemmas is crucial to K. Miller, “The Boeing 737 MAX:
Lessons for Engineering Ethics,”
maintaining trust. Science and Engineering Ethics,
vol. 26, pp. 2957–2974, 2020.
27
An example of an ethical analy-
1.3.4 Economics sis for autonomous vehicles can be
found in J. Siegel and G. Pappas,
Systems that operate expensive equipment or control finances require validation “Morals, Ethics, and the Technol-
to decrease the risk of significant economic loss. In 1996, the maiden voyage of the ogy Capabilities and Limitations
of Automated and Self-Driving Ve-
Ariane 5 rocket ended in an explosion that could ultimately be traced back to a hicles,” AI & Society, vol. 38, no. 1,
software bug caused by overflow when converting from a 64-bit to 16-bit value.28 pp. 213–226, 2023.
28
The failure resulted in loss of the rocket and the research satellites it was carrying M. Dowson, “The Ariane 5 Soft-
ware Failure,” Software Engineering
to space for a total of $370 million in damages. Furthermore, failures in financial
Notes, vol. 22, no. 2, p. 84, 1997.
decision-making systems can affect entire economic systems. For example, the
failure of the Long-Term Capital Management (LTCM) hedge fund in 1998 nearly
caused a global financial crisis and required a $3.6 billion bailout. The fund used
a trading strategy that failed to account for extreme events.29 When these events 29
P. Jorion, “Risk Management
occurred, LTCM suffered massive losses. Lessons from Long-Term Capital
Management,” European Financial
Management, vol. 6, no. 3, pp. 277–
300, 2000.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8 c hapter 1. introduction

Figure 1.3. Validation algorithms


System
check whether a given system sat-
Agent Environment isfies a specification. The system
consists of an agent operating in
an environment, which it perceives
Sensor using a sensor or set of sensors.

Validation
Algorithm

Specification

1.4 Validation Algorithms

Validation algorithms require two inputs, as shown in figure 1.3. The first input is
the system under test, which we will refer to as the system. The system represents
a decision-making agent operating in an environment. The agent makes decisions
based on information from the environment that it receives from sensors.30 The 30
Up to this point, we have infor-
second input is a specification, which expresses an operating requirement for the mally used the term system to refer
to only the agent and its sensors.
system. Specifications often pertain to safety, but they may also address other key For the remainder of the book, we
design objectives. Given these inputs, validation algorithms output metrics to will also include the operating en-
vironment as part of the system.
help us understand the scenarios in which the system does or does not satisfy
the specification. The rest of this section provides a high-level overview of these
inputs and outputs.

1.4.1 System
A system (algorithm 1.1) consists of three main components: an environment,
an agent, and a sensor. The environment represents the world in which the agent
operates. We refer to an agent’s configuration within its environment as its state s.
The state space S represents the set of all possible states. An environment consists
of an initial state distribution and a transition model. When the agent takes an
action, the state evolves probabilistically according to the transition model. The
transition model T (s0 | s, a) denotes the probability of transitioning to state s0
from state s when the agent takes action a.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.4. validation algorithms 9

abstract type Agent end Algorithm 1.1. A system consists


abstract type Environment end of an agent, its operating environ-
abstract type Sensor end ment, and the sensor or set of sen-
sors that it uses to perceive its en-
struct System vironment.
agent::Agent
env::Environment
sensor::Sensor
end

For physical systems, the state often represents an agent’s position and velocity
in the environment, and the transition model is typically governed by the agent’s
equations of motion. Figure 1.4 shows an example of a state for an inverted
pendulum system. The state and transition model may also contain information
about other agents in the environment. For example, the environment for an
aircraft collision avoidance system contains the other aircraft in the airspace that ω
the agent must avoid. The other agents may also be human agents such as other θ
drivers or pedestrians in the environment of an autonomous vehicle. The presence
of other agents in the environment often increases our uncertainty in the outcome s = [θ, ω ]
of a particular action.
In many real-world systems, agents do not have access to their true state within Figure 1.4. The state s of an in-
the environment and instead rely on observations from sensors. We define the verted pendulum system can be
sensor component of a system as a mechanism for sensing information about the compactly represented as its cur-
rent angle from the vertical θ and
environment. Many real-world systems rely on multiple sensors, so the sensor its angular velocity ω.
component may contain multiple sensing modalities. For example, an autonomous
vehicle senses its position in the world using a combination of sensors such as
global positioning systems (GPS), cameras, and LiDAR. We model the sensor
component using an observation model O(o | s), which represents the probability
of producing observation o in state s. Observations come in multiple forms based
on the sensing modality. For example, GPS sensors output coordinates, while
camera sensors output image data. We call the set of all possible observations for
a system its observation space O .
An agent uses observations to select actions from a set of possible actions
known as the action space A. Agents may use a number of decision-making
algorithms or frameworks to select actions. While some agents select actions
based entirely on the observation, other agents use the observation to first estimate
the state and then select an action based on this estimate. Furthermore, some

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10 c hapter 1. introduction

Figure 1.5. A system consists of an


Agent Environment agent with policy π, an environ-
ment governed by transition model
Policy action a Transition Model T, and a sensor with observation
π T model O.

observation o state s
Sensor

Observation Model
O

agents may keep track of previous actions and observations internally to improve
their state estimate. For example, an aircraft that only observes its altitude may
keep track of previous altitude measurements to estimate its climb or descent
rate. We abstract these behaviors of the agent using the notion of a policy π,
which is responsible for selecting an action given the current observation and
information the agent has stored previously. An agent’s policy can be stochastic
or deterministic. A stochastic policy samples actions according to a probability
distribution, while a deterministic policy will always produce the same action
given the same information.
The transition model T (s0 | s, a) satisfies the Markov assumption, which requires
that the next state depend only on the current state and action. The state space,
action space, observation space, observation model, and transition model are all el-
ements of a sequential decision-making framework known as a partially observable
Markov decision process (POMDP).31 Figure 1.5 demonstrates how these elements 31
M. J. Kochenderfer, T. A. Wheeler,
fit into the components of a system. Appendix A provides implementations of and K. H. Wray, Algorithms for De-
cision Making. MIT Press, 2022.
these components for the example systems discussed in this book.
We analyze the behavior of a system over time by considering the sequence
of states, observations, and actions that the agent experiences. This sequence
is known as a trajectory. We generate trajectories by performing a rollout of the
system (algorithm 1.2). A rollout begins by sampling an initial state from the
initial state distribution associated with the environment. At each time step, the
sensor produces an observation based on the current state, the agent selects an

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.4. validation algorithms 11

function step(sys::System, s) Algorithm 1.2. A function that per-


o = sys.sensor(s) forms a rollout of a system sys to
a = sys.agent(o) a depth d and returns the result-
s′ = sys.env(s, a) ing trajectory τ. It samples an ini-
return (; o, a, s′) tial state from the initial state dis-
end tribution associated with the envi-
ronment. It then repeatedly calls
function rollout(sys::System; d) the step function, which steps the
s = rand(Ps(sys.env)) system forward in time. The step
τ = [] function takes in the current state
for t in 1:d s, produces an observation o from
o, a, s′ = step(sys, s) the sensor, gets the action a from
push!(τ, (; s, o, a)) the agent based on this observa-
s = s′ tion, and determines the next state
end s′ from the environment.
return τ
end

Step 1 Step 2 Step 3 Step 4 Figure 1.6. Example trajectory of


depth d = 4 for the inverted pendu-
s = [0.2, 0.0] s = [0.2, −0.2] s = [0.2, −0.3] s = [0.2, 0.0] lum system. At each time step, the
sensor produces a noisy observa-
o = [0.3, 0.0] o = [0.3, −0.1] o = [0.1, −0.3] o = [0.2, −0.1]
tion of the true state, and the agent
a = −4.5 a = −4.1 a = 1.4 a = −1.7 tries to keep the pendulum upright
by selecting a torque to apply at the
base of the pendulum.

action based on the observation, and the environment transitions to a new state
based on the action. We repeat this process to a desired depth d to generate a
trajectory τ = (s1 , o1 , a1 , . . . , sd , od , ad ) where si+1 ∼ T (· | si , ai ), oi ∼ O(· | si ),
and ai ∼ π (· | oi ). Figure 1.6 shows an example trajectory for the inverted
pendulum system.

1.4.2 Specification
A specification ψ is a formal expression of a requirement that the system must
satisfy when deployed in the real world. These requirements may be derived from
domain knowledge or other systems engineering principles. Some industries

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12 c hapter 1. introduction

have regulatory agencies that govern requirements. These agencies are especially
common in safety-critical industries. For example, the FAA and the FDA in the
United States provide regulations and requirements for aircraft and healthcare
systems, respectively.
We express specifications by translating operating requirements to logical
formulas that can be evaluated on trajectories.32 For example, the specification 32
Chapter 3 discusses this process
for an aircraft collision avoidance system is that the agent should not collide with in detail.

other aircraft in the airspace. Given a trajectory, we want to check whether any of
the states in the trajectory represent a collision.
Algorithm 1.3 defines a general framework for specifications that we will use
throughout this book. Evaluating a specification on a trajectory results in a Boolean
value that indicates whether the specification is satisfied. We consider a trajectory
to be a failure if the specification is not satisfied. Example 1.1 demonstrates this
idea on a simple grid world system. We can also derive higher-level metrics from
specifications such as the probability of failure or the expected cost of failure.

abstract type Specification end Algorithm 1.3. Definition of a speci-


function evaluate(ψ::Specification, τ) end fication. We evaluate specifications
isfailure(ψ::Specification, τ) = !evaluate(ψ, τ) on trajectories. We consider a tra-
jectory to be a failure if the specifi-
cation is not satisfied.

In the grid world example shown on the right, the agent’s goal is to navigate Example 1.1. Example trajectories
evaluated against a specification
to the green goal state while avoiding the red obstacle state. Therefore, given for the grid world system.
a trajectory, the specification ψ will be satisfied if the trajectory contains the
goal state and does not contain the obstacle state. The green trajectory in the
figure satisfies the specification, while the red trajectory represents a failure.
Chapter 3 will discuss how to express this specification as a logical formula.

1.4.3 Algorithm Outputs


Validation algorithms provide a variety of outputs that help us understand the
behavior of a system. These outputs can be used to make decisions about the
system’s design, requirements, and deployment. Different validation algorithms
are designed to output different metrics. The algorithms presented in this book
support the following categories of analysis:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.4. validation algorithms 13

Falsification Failure Distribution Failure Probability Figure 1.7. Failure analysis outputs
for a simple system where failures
occur to the left of the dashed red
line with likelihood represented by
the height of the black curve. The
plot on the left shows a set of fail-
ure samples that could be identi-
fied through falsification. The plot
in the middle highlights the shape
of the failure distribution, and the
shaded region in the plot on the
right corresponds to the probabil-
ity of failure.

• Failure analysis: Common types of failure analysis include falsification, failure


distribution estimation, and failure probability estimation (figure 1.7). Falsifi-
cation involves searching for possible scenarios that result in a failure. Some
falsification algorithms also use a probabilistic model of the system to search
for the most likely failure scenario. Other algorithms use this model to draw
samples from the full distribution over failures or to estimate the probability
of failure. We can use the results of failure analysis to inform future design
decisions. Depending on the type and severity of the failure modes, system
designers may enhance the system’s sensors, change the agent’s policy, revise
the system’s requirements, adapt the training of human operators, or bring
in other mitigations. Designers may also simply recognize the failure modes 33
These requirements are based on
as limitations and move on or use them as grounds to abandon the project the type and severity of the failure.
More information can be found in
altogether. Furthermore, an estimate of the probability of failure can be used to T. L. Arel, Safety Management Sys-
make decisions about the system’s deployment. For example, the FAA places tem Manual, Air Traffic Organiza-
tion, Federal Aviation Administra-
requirements on the probability of failure for aircraft systems before they can tion, 2022.
be deployed in the airspace.33

• Formal guarantees: Some algorithms output formal guarantees, or proofs, that


a system satisfies a specification. One common type of formal guarantee is a
reachability guarantee, in which we determine the set of states that a system
could reach over time. The result can be used to prove that a system will never
enter a dangerous state. For example, we could prove that an aircraft collision
avoidance system will never reach a collision state. Formal guarantees are
always based on a set of assumptions such as the set of possible initial states.
If the assumptions are violated, the guarantees may no longer hold.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
14 c hapter 1. introduction

• Explanations: The ability to explain the behavior of a system helps us build


confidence that it is operating as intended. Explanations can take many forms.
We may want to explain why an agent made a decision at a particular instance
in time or identify the root cause of a failure trajectory we found through falsi-
fication. We can use explanations during design to debug the system, identify
potential failure modes, and suggest possible improvements. Explanations can
also be used to build trust with stakeholders and regulatory bodies.

• Runtime assurances: The validation metrics we compute before deploying a


system are typically based on a set of assumptions about its operating environ-
ment. If the operating environment changes during deployment, these metrics
may no longer be valid. Runtime monitoring algorithms check whether these
assumptions are being violated during operation and provide assurances that
the system is operating as intended. We can use runtime monitoring to detect
when the system deviates from its intended behavior and provide alerts to
operators.

In most real-world settings, we cannot guarantee that a system will behave


as intended using a single validation algorithm or metric. Instead, we use a
combination of these techniques to build a safety case. This idea is inspired by the
Swiss cheese model of accident causation (figure 1.8).34 This model views validation 34
J. Reason, “Human Error: Mod-
algorithms as slices of Swiss cheese35 with holes, or limitations, that may cause els and Management,” British Med-
ical Journal, vol. 320, no. 7237,
us to miss potential failure modes. If we stack enough slices of Swiss cheese pp. 768–770, 2000.
together, the holes in one slice will be covered by the cheese in another slice. By 35
Swiss cheese is a type of cheese
using a combination of validation algorithms, we increase our chances of catching that is known for having holes in
its slices.
potential failure modes before they could occur during operation.

1.5 Challenges

Validating that a decision-making agent will behave as intended when deployed in


the real world is a challenging problem. Several factors contribute to this difficulty:

• Complexity of the agent: It can be difficult to predict how a decision-making


agent will behave in all possible scenarios. For example, the autonomy stack of
a self-driving car contains multiple components that interact with one another
in complex ways. This complexity makes it challenging to understand how the
system will react to different inputs such as sensor data, maps, and traffic laws.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.5. challenges 15

Failure Formal Runtime Figure 1.8. The swiss cheese model


Explanations
Analysis Guarantees Assurances for safety validation. Each layer
represents a different validation al-
gorithm. The holes in each layer
represent the limitations of the val-
idation algorithm. By stacking the
layers together, we prevent poten-
tial failures from getting through
to deployment.

Furthermore, it is especially difficult to predict the behavior of decision-making


agents that use machine learning models such as neural networks. These
models are often difficult to interpret and can exhibit unexpected behaviors.

• Complexity of the environment: As the capabilities of autonomous agents increase,


they are deployed in increasingly complex environments. For example, self-
driving cars must navigate through environments with pedestrians, traffic
signs, construction, and other vehicles. To validate these agents, we must be
able to properly model this complexity. Another challenge arises when agents
use complex sensors to perceive their environment. For example, for systems
that use camera sensors, we need to understand the set of images the camera
could produce from the environment.

• Cost and safety: Testing systems in the real world is expensive and can lead
to safety issues. For example, testing an aircraft collision avoidance system
involves operating aircraft in close proximity with one another for long periods
of time. For this reason, we often rely on simulation to test systems before
deploying them in the real world. We must be careful to ensure that the simu-
lated system accurately models the real-world system. However, capturing the
full complexity of the real world in simulation can result in simulators that are
computationally expensive to run.

• Edge cases: Systems designed for safety-critical applications tend to behave


safely in the vast majority of scenarios. However, rare edge cases can lead to

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
16 c hapter 1. introduction

catastrophic failures. Because these edge cases occur infrequently, they are
often difficult to identify.

1.6 Overview

This section outlines the remaining chapters of the book, which can be organized
into several categories:

• Problem formulation: Chapters 2 and 3 discuss techniques to formulate validation


problems. Specifically, chapter 2 relates to the system, which is the first input
to validation algorithms. We discuss how to build computational models of
each system component using data and domain knowledge. The accuracy of
the validation process depends on the accuracy of these models. Therefore, we
also discuss techniques to validate the accuracy of these models. Chapter 3
addresses the specification, which is the second input to validation algorithms.
In this chapter, we discuss techniques to translate operating requirements for
systems to formal specifications on their behavior.

• Sampling-based methods: Chapters 4 to 7 discuss methods that use trajectory


samples from a system to analyze its behavior. Since it is often impossible to
sample all possible behaviors of a system, these techniques typically focus on
failure analysis rather than formal guarantees. Chapters 4 and 5 discuss effi-
cient techniques to search for possible failures of a system using optimization
and planning algorithms respectively. Chapter 6 outlines a set of techniques to
draw samples from the full distribution over failures for a system, and chap-
ter 7 discusses efficient techniques to estimate the probability of failure from
samples.

• Formal methods: Chapters 8 to 10 discuss formal methods that provide guaran-


tees on the behavior of a system. These methods can be used to systematically
search for failures of a system or to prove the absence of failures if there are
none. Chapter 8 discusses reachability techniques that compute the set of states
that a system could reach over time. We can use the results of this analysis
to determine whether to system reaches any states that violate the specifica-
tion. Chapter 9 extends these techniques to systems with nonlinear models. In
chapter 10, we work perform reachability analysis on discrete systems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
1.6. overview 17

• Runtime monitoring and explainability: Chapters 11 and 12 discuss techniques to


explain a system’s decisions and monitor its behavior. Chapter 11 outlines a set
of methods that can be used to explain the behavior of a system to its operators
and other stakeholders. In chapter 12, we discuss a form of online validation
called runtime monitoring, which checks whether a system is operating as
intended during deployment.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2 System Modeling

Applying validation algorithms directly to real-world systems is often cost pro-


hibitive, unsafe, and impractical due to the constraints of the physical world.
To address the limitations of real-world testing, we build models of the system
components and perform validation on these models. This chapter begins by
discussing probability distributions, which are useful in modeling system compo-
nents that involve uncertain outcomes. We introduce a number of model classes
and discuss how they can be used to represent system components. We then
discuss methods for selecting the parameters of these model classes based on
observed data or domain knowledge. We also discuss techniques to construct
models of other agents in the environment that may be interacting with our sys-
tem. Because the validity of any analysis that uses these models depends on their
accuracy, it is important to assess whether the model adequately captures the
behavior of the real world system. We conclude by discussing techniques for
validating the performance of a model.

2.1 Model Building

As outlined in section 1.4.1, a system can be described by its environment model


T (s0 | s, a), agent model π ( a | o ), and observation model O(o | s). Building these
models requires the following three steps:

1. Select a model class. A model class is a set of mathematical models defined by a


set of parameters.

2. Select the parameters for the model class. This process involves selecting the pa-
rameters that best represent the system based on available data or expert
knowledge.
20 c hapter 2. system modeling

3. Validate the model. Once selected, the model should be validated to ensure that
it accurately represents the system.

In this chapter, we will discuss the different model classes that can be used to
represent the system components and the methods for selecting the parameters
of these models.
There are a variety of challenges when building models. We want to select a
model class that is expressive enough to capture the true system, which requires
capturing all possible scenarios the system may encounter. For example, a model
of an aircraft collision avoidance system must account for all possible pilot and
intruder behaviors. However, complex models can be difficult to use for validation.
Therefore, we want to ensure that we select the simplest model class that can
accurately represent the behavior of the system.1 Additionally, building models 1
This idea is captured in a quote
requires data and expert knowledge, which may require significant effort to pro- from British staistician George E. P.
Box (1919–2013), which states that
duce. A final challenge is selecting the objective and optimization technique used ‘‘all models are wrong, but some
to determine the best model parameters. Given these challenges, it is important are useful.’’ G. E. Box, “Science
and Statistics,” Journal of the Amer-
that we carefully validate the performance of the final model. ican Statistical Association, vol. 71,
no. 356, pp. 791–799, 1976.

2.2 Probability

Many systems have components with multiple possible outcomes and uncertainty
over which outcome will occur. To build mathematical models that account for
this uncertainty, we use the concept of probability.2 The probability of a particular 2
A detailed overview of proba-
outcome is a number between 0 and 1 that quantifies the likelihood of that out- bility theory is provided by E. T.
Jaynes, Probability Theory: The Logic
come occurring, relative to all possible outcomes. If one outcome is more likely to of Science. Cambridge University
occur than another, it has a higher probability. Press, 2003.

2.2.1 Probability Distributions


A probability distribution is a function that assigns probabilities to different out-
comes. Probability distributions are represented differently depending on whether
the outcomes are discrete or continuous. Distributions over discrete outcomes
are represented by probability mass functions. The probability mass function P( x )
for a discrete variable X assigns a probability to each possible value of X. To be a
valid probability mass function, the probabilities across all outcomes must sum

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.2. probability 21

to 1 such that
0.3
∑ P( x ) = 1 (2.1)
x

P( x )
0.2
where 0 ≤ P( x ) ≤ 1 for all x. Figure 2.1 shows an example of a probability mass
function for a discrete distribution. 0.1

Many distributions over continuous outcomes are naturally represented using


1 2 3 4 5 6
probability density functions. For many continuous distributions, the probability x
that a variable takes on a particular value is infinitesimally small. Therefore, Figure 2.1. A probability mass
unlike probability mass functions, probability density functions do not assign function for a distribution over a
variable X that can take on a value
probabilities to individual outcomes. Instead, they assign probabilities to intervals
between 1 and 6.
of possible outcomes. The probability that the value of a continuous variable X
falls between the values a and b is given by the integral of the probability density
function p( x ) over that interval:
Z b
P( a ≤ x ≤ b) = p( x ) dx (2.2)
a

Figure 2.2 shows an example of this process. To be a valid probability density


function, the integral of the probability density function over all possible outcomes
P( a ≤ x ≤ b)
must integrate to 1 such that
Z ∞
p( x ) dx = 1 (2.3)
−∞

where p( x ) ≥ 0 for all x. The support of a continuous distribution is the set of all
a b
values x for which p( x ) > 0. x
Probability distributions are a common type of model class because they are
often represented using probability mass or density functions that are determined Figure 2.2. A probability density
function for a continuous distri-
by a set of parameters θ. For example, the probability density function of a bution over a variable x. We can
common distribution called the Gaussian distribution (also known as the normal find the probability that x falls be-
tween two values a and b by in-
distribution) is parameterized by its mean µ and variance σ2 such that θ = [µ, σ2 ] tegrating the probability density
(example 2.1). For a discrete distribution, the parameters θ typically correspond function over that interval.
to the probability mass associated with each possible outcome. In general, we
will use Pθ ( x ) and pθ ( x ) to denote the probability mass or probability density
function of a distribution with parameters θ.
We can form more complex distributions by mixing together simpler distri-
butions. Distributions formed in this way are known as mixture models. Many
common distributions such as the Gaussian distribution are unimodal, meaning
that they have a single peak. We can represent complex multimodal distributions

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
22 c hapter 2. system modeling

One common distribution used to describe continuous variables is the Gaus- Example 2.1. The Gaussian dis-
tribution for modeling continuous
sian distribution (also called the normal distribution) N (µ, σ2 ). A Gaussian variables. The mean of a Gaussian
distribution is parameterized by its mean µ and its variance σ2 . The proba- distribution controls the location of
the center of the distribution, while
bility density function for a Gaussian distribution is given by the variance controls the spread of
the distribution.
( x − µ )2
 
  1
N x | µ, σ2 = √ exp − (2.4)
2πσ2 2σ2

where N x | µ, σ2 represents the probability density function evaluated at x




given a mean µ and variance σ2 . The mean controls the location of the center
of the distribution, while the variance controls the spread of the distribution.
The plots below show examples of Gaussian distributions with different
means and variances.
µ=0 µ=1 µ=0
σ2 = 1 σ2 = 1 σ2 = 2

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x x x

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.2. probability 23

as mixtures of unimodal distributions. For example, a Gaussian mixture model


is a mixture model that represents a distribution as a combination of multiple
Gaussian distributions (example 2.2).

A Gaussian mixture model is a mixture model that is simply a weighted Example 2.2. An example of a
Gaussian mixture model.
average of various Gaussian distributions. The parameters of a Gaussian mix-
ture model include the parameters of the Gaussian distribution components
2 , as well as their weights ρ . The density is given by
µ1:n , σ1:n 1:n

n
2
p( x | µ1:n , σ1:n , ρ1:n ) = ∑ ρi N (x | µi , σi2 ) (2.5)
i =1

where the weights must sum to 1.


We can create a Gaussian mixture model with components µ1 = 5, σ1 = 2
and µ2 = −5, σ2 = 4, weighted according to ρ1 = 0.6 and ρ2 = 0.4. The plot
below shows the density of two components scaled by their weights.

scaled components
0.10 mixture density
p( x )

0.05

0.00
−10 −5 0 5 10
x

We can also represent complex distributions as transformations of simpler


distributions. Suppose we have a variable Z that is distributed according to a
simple, unimodal distribution pz . We can transform Z into a more complex dis-
tribution X by applying a transformation f such that X = f ( Z ). If f is invertible
and differentiable, the distribution over X is

p x ( x ) = pz ( g( x ))| g0 ( x )| (2.6)

where g is the inverse of f . Multiplying the original density by the absolute value
of the derivative of g corrects for the stretching or shrinking of the distribution
that occurs when transforming the variable. Figure 2.3 transforms a Gaussian
distribution into a multimodal distribution. Normalizing flows are a class of models

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
24 c hapter 2. system modeling

Figure 2.3. Transforming a Gaus-


px (x)
sian distribution pz (z) into a multi-
modal distribution p x ( x ) by apply-
ing an invertible and differentiable
pz (z) transformation.
x = z1/3

−2 0 2 −2 0 2
z x

Figure 2.4. Examples of a gener-


ative model that transforms sam-
ples from calls to a pseudorandom
number generator that produces
samples uniformly between 0 and
x = exp(z) sin(8z) 1 to samples from a complex distri-
bution.

3
A comprehensive introduction to
normalizing flows is provided in
0 0.2 0.4 0.6 0.8 1 −2 0 2
I. Kobyzev, S. J. Prince, and M. A.
z x
Brubaker, “Normalizing Flows: An
Introduction and Review of Cur-
rent Methods,” IEEE Transactions
on Pattern Analysis and Machine In-
that use this idea to transform simple distributions into complex distributions by telligence, vol. 43, no. 11, pp. 3964–
3979, 2020.
applying a series of invertible transformations.3 4
Pseudorandom number se-
For some problems, we may not be able to produce an analytical form for the quences, such as those produced
probability density function of the distribution. However, we can still generate by a sequence of calls to rand, are
deterministic given a particular
samples from the distribution by applying transformations to samples from a seed but appear random.
pseudorandom number generator.4 We refer to models represented in this way as 5
I. Goodfellow, J. Pouget-Abadie,
generative models. Generative adversarial networks (GANs) are an example of a M. Mirza, B. Xu, D. Warde-Farley, S.
Ozair, A. Courville, and Y. Bengio,
generative model that learns to generate samples from complex distributions by “Generative Adversarial Nets,” Ad-
transforming samples from a simple distribution into samples that resemble the vances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 27,
complex distribution.5
2014.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.2. probability 25

2.2.2 Joint Distributions


A joint distribution is a probability distribution over multiple variables. A distribu-
tion over a single variable is called a univariate distribution, and a joint distribution X Y Z P( X, Y, Z )

over multiple variables is called a multivariate distribution. Joint distributions repre- 0 0 0 0.08
0 0 1 0.31
sent the likelihood of multiple outcomes occurring simultaneously. For example, 0 1 0 0.09
the joint distribution over two discrete variables X and Y is represented by the 0 1 1 0.37
probability mass function P( x, y), which outputs the probability that both X = x 1 0 0 0.01
1 0 1 0.05
and Y = y. 1 1 0 0.02
We use different strategies to represent joint distributions depending on whether 1 1 1 0.07
the variables are discrete or continuous. For discrete variables, we can represent
Table 2.1. Example of a joint distri-
the joint distribution as a table such as the one shown in table 2.1. The table as- bution involving binary variables
signs a probability to each possible combination of outcomes. These probabilities X, Y, and Z. This distribution has
8 parameters θ1 , . . . , θ8 that repre-
represent the parameters of the distribution. sent the probabilities of each possi-
We often want to represent joint distributions over many variables with many ble combination of outcomes.
possible outcomes, which can require a large number of parameters. If we make
additional assumptions about the structure of the joint distribution such as inde-
pendence between variables, we can use other representations such as decision
trees or Bayesian networks to reduce the number of parameters required to rep-
resent the distribution.6 We can represent continuous joint distributions using 6
For more details on represent-
multivariable functions. For example, a common distribution used to model un- ing complex probability distribu-
tions, see chapter 2 of M. J. Kochen-
certainty in multiple continuous variables is the multivariate Gaussian distribution derfer, T. A. Wheeler, and K. H.
(example 2.3). Wray, Algorithms for Decision Mak-
ing. MIT Press, 2022. A compre-
hensive overview is provided by
2.2.3 Conditional Distributions D. Koller and N. Friedman, Proba-
bilistic Graphical Models: Principles
A conditional distribution is a distribution over a variable given the value of one or and Techniques. MIT Press, 2009.

more other variables. The definition of conditional probability states that

P(y, x )
P(y | x ) = (2.8)
P( x )

where P(y | x ) is read as ‘‘probability of y given x’’ and represents the probability
that the variable Y takes on the value y given that the variable X takes on the value
x. The agent, environment, and observation models introduced in section 1.4.1
are all conditional distributions. For example, the transition model T (s0 | s, a)
is a conditional distribution over the next state s0 given the current state s and
action a.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
26 c hapter 2. system modeling

The multivariate Gaussian distribution extends the Gaussian distribution Example 2.3. The multivariate
Gaussian distribution, a common
over n variables using the following probability density function: multivariate distribution used to
  model uncertainty in multiple con-
  1 1 > −1 tinuous variables.
N x | µ, Σ2 = exp − ( x − µ ) Σ ( x − µ ) (2.7)
(2π )n/2 |Σ|1/2 2

where x is a vector in Rn , µ is the mean vector, and Σ is the covariance matrix.


The mean vector µ controls the location of the center of the distribution,
while the covariance matrix Σ controls the spread of the distribution. The off-
diagonal elements of the covariance matrix control the correlation between
the values of each variable. The entries of the mean vector and covariance
matrix are parameters that fully describe a multivariate Gaussian distribution
(with some conditions on the parameters of the covariance matrix). The plots
below show examples of the probability density functions of multivariate
Gaussian distributions with different mean vectors and covariance matrices.
Brighter contours indicate higher probability density.

µ = [0, 0] µ = [0, 5] µ = [3, 3]


Σ = [1 0; 0 1] Σ = [3 0; 0 3] Σ = [4 2; 2 4]
10
x2

−10
−10 0 10 −10 0 10 −10 0 10
x1 x1 x1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 27

A common model class used to model the uncertainty in one continuous vari- Example 2.4. The conditional
Gaussian distribution. The plot be-
able conditioned on the value of another continuous variable is the conditional low shows the probability density
Gaussian distribution. Specifically, we represent the conditional distribution for the conditional Gaussian model
p(y | x ) = N ( x, 102 ).
pθ (y | x ) as a Gaussian distribution with a mean that depends on the value
of x:
10
p θ ( y | x ) = N ( y | f θ 0 ( x ), σ 2 )
0
where f θ0 is a function of x with parameters θ0 , and the full set of parameters

y
for the model is θ = [θ0 , σ2 ]. We often select f θ0 based on domain knowledge −10
of the physical laws that govern the system. For example, if we know that a
−10 0 10
sensor produces noisy measurements of the true state, we may set f θ0 ( x ) = x x
so that the measurements will be centered around the true state. The figure in
the caption shows an example of a conditional Gaussian distribution where
the mean of the distribution is determined by the function f θ0 ( x ) = x and
the variance is 102 . Brighter colors indicate higher probability density.

Conditional distributions can be represented using probability mass or density X Y Z P( X | Y, Z )


functions. For discrete variables, we can represent a conditional distribution as a 0 0 0 0.08
table similar to a joint distribution. Table 2.2 provides an example. For continuous 0 0 1 0.15
0 1 0 0.05
variables, we can represent a conditional distribution by defining a probability 0 1 1 0.10
density function that depends on the conditioning variables. For example, we 1 0 0 0.92
could represent the conditional distribution p(y | x ) as a Gaussian distribution 1 0 1 0.85
1 1 0 0.95
with a mean that depends on the value of x (example 2.4). We can also represent 1 1 1 0.90
conditional distributions in which some variables are discrete and others are
Table 2.2. An example of a condi-
continuous. Sigmoid models, for example, are a common class of models that
tional distribution involving the bi-
represent the conditional probability of a binary variable given a continuous nary variables X, Y, and Z.
variable.7 7
For more details on represent-
ing conditional distributions, see
section 2.4 of M. J. Kochenderfer,
2.3 Parameter Learning T. A. Wheeler, and K. H. Wray, Al-
gorithms for Decision Making. MIT
Press, 2022.
Once we have selected a model class, we need to determine the parameters of the
model that best represent the system. We refer to the process of selecting these
parameters as parameter learning.8 We can learn parameters using data, expert 8
In the field of machine learning,
knowledge, or a combination of both. This section will focus on two methods this process is often referred to as
training the model.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
28 c hapter 2. system modeling

to learn parameters from data.9 We will assume that we have a dataset D of m 9


Techniques for learning param-
observations o1:m . Each observation is a pair of input and output values such that eters from expert knowledge are
related to the preference elicita-
oi = (xi , yi ). Our goal is to learn the parameters θ given the dataset D.10 tion techniques discussed in sec-
tion 3.3.3. An example is pro-
vided by S. M. Katz, A.-C. LeBihan,
2.3.1 Maximum Likelihood Parameter Learning and M. J. Kochenderfer, “Learning
an Urban Air Mobility Encounter
In maximum likelihood parameter learning, we search for the parameters of a distri- Model from Expert Preferences,”
bution that maximize the likelihood of observing the data. The maximum likelihood in Digital Avionics Systems Confer-
ence (DASC), 2019.
estimate is 10
This section focuses on learn-
θ̂ = arg max P( D | θ) (2.9) ing model parameters from data,
θ which is an important component
where P( D | θ) is the likelihood of the data given the parameters θ. There of the field of machine learning.
There are other methods not cov-
are two challenges associated with maximum likelihood parameter learning. ered in this section such as adver-
One challenge is to choose the appropriate probability model for P( D | θ). We sarial training. A broad introduc-
tion to the field is provided by sev-
often assume that the samples in our data D are independently and identically
eral textbooks. C. M. Bishop and
distributed, which means that our samples D = o1:m are drawn from a distribution H. Bishop, Deep Learning: Founda-
oi ∼ P(· | θ) with tions and Concepts. Springer Nature,
m 2023. T. Hastie, R. Tibshirani, and
P( D | θ ) = ∏ P ( oi | θ ) (2.10) J. Friedman, The Elements of Statisti-
cal Learning: Data Mining, Inference,
i =1
and Prediction, 2nd ed. Springer Se-
where P(oi | θ) = Pθ (yi | xi ). ries in Statistics, 2001. K. P. Murphy,
The other challenge is performing the maximization in equation (2.9). A com- Probabilistic Machine Learning: An
Introduction. MIT Press, 2022.
mon approach is to maximize the log-likelihood, often denoted as `(θ). Since the
log-transformation is monotonically increasing, maximizing the log-likelihood
produces an equivalent solution to maximizing the likelihood:11 11
Although it does not matter
whether we maximize the natural
θ̂ = arg max ∑ log P(oi | θ) (2.11) logarithm (base e) or the common
θ i logarithm (base 10) in this equa-
tion, throughout this book we will
Computing the sum of log-likelihoods tends to be more numerically stable com- use log( x ) to mean the logarithm
pared to computing the product of many small probability masses or densities. of x with base e.

Maximizing the log-likelihood forms the basis of many common objective func-
tions used in machine learning. For example, maximizing the log-likelihood of
the parameters of a conditional Gaussian distribution leads to the least-squares
objective function, which is commonly used in regression problems (example 2.5).
Algorithm 2.1 provides a general algorithm for maximum likelihood param-
eter learning. We can apply several optimization algorithms to maximize the
log-likelihood (see section 4.6). Example 2.6 uses algorithm 2.1 to learn the pa-
rameters of a conditional Gaussian observation model for the inverted pendulum

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 29

Example 2.5. Derivation of the


Suppose we want to find the parameters θ0 of the conditional Gaussian
least-squares objective function by
distribution introduced in example 2.4 by maximizing the log-likelihood maximizing the log-likelihood of a
of a dataset of m observations. The optimal parameters correspond to the conditional Gaussian distribution.

solution of the following optimization problem:


m
0
θ̂ = arg max ∑ log pθ0 (yi | xi )
θ0 i =1
m
= arg max ∑ log N (yi | f θ0 ( xi ), σ2 )
θ0 i =1
m
(y − f θ0 ( xi ))2
 
1
= arg max ∑ log √ exp − i
θ0 i =1 2πσ2 2σ2
m  √ (y − f θ0 ( xi ))2

= arg max ∑ log(1) − log( 2πσ2 ) − i
θ0 i =1
2σ2
m
= arg max ∑ −(yi − f θ0 ( xi ))2
θ0 i =1
m
= arg min ∑ (yi − f θ0 ( xi ))2
θ0 i =1

The result minimizes the sum of the squared errors between the model
outputs and the true outputs and is often referred to as the least-squares
objective function. This result also extends to the multivariate case.

struct MaximumLikelihoodParameterEstimation Algorithm 2.1. Maximum likeli-


likelihood # p(y) = likelihood(x; θ) hood parameter estimation algo-
optimizer # optimization algorithm: θ = optimizer(f) rithm. The algorithm takes a like-
end lihood function, which returns a
distribution over the output given
function fit(alg::MaximumLikelihoodParameterEstimation, data) the input and parameters. It also
f(θ) = sum(-logpdf(alg.likelihood(x, θ), y) for (x,y) in data) takes in an optimization algorithm,
return alg.optimizer(f) which takes in a function and re-
end turns a minimum. Given a dataset,
the fit function returns the maxi-
mum likelihood estimate of the pa-
rameters.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
30 c hapter 2. system modeling

Suppose we have a dataset of states and observations for the inverted pendu- Example 2.6. Learning a condi-
tional Gaussian observation model
lum system (shown in the caption). For simplicity, we will assume in this for the inverted pendulum system.
example that the pendulum state only consists of its current angle. We can The plot shows the learned obser-
vation model with brighter colors
model the observation model O(o | s) as a conditional Gaussian distribution indicating higher probability den-
with a mean that depends on the state of the pendulum such that sity. The data points are plotted
on top of the observation model in
O ( o | s ) = N ( o | f θ0 ( s ) , σ 2 ) (2.12) pink.

We will further assume that the observation is a linear function of the state 0.2
such that f θ0 (s) = θ1 s + θ2 . We can learn the parameters θ = [θ1 , θ2 , σ2 ]
using algorithm 2.1 with the following code:
0

y
using Optim
likelihood(x, θ) = Normal(θ[1] * x + θ[2], exp(θ[3]))
optimizer(f) = minimizer(optimize(f, zeros(3), Optim.GradientDescent()))
alg = MaximumLikelihoodParameterEstimation(likelihood, optimizer) −0.2
θ = fit(alg, data) −0.2 0 0.2
x
The code uses the Optim.jl package to perform gradient descent optimiza-
tion to learn the parameters θ starting with an initial guess of θ = [0, 0, 0].
The optimized parameters result in the following model:

O(o | s) = N (o | 1.02s + 0.00, 0.052 )

These results indicate that the observation model is centered around the true
state of the pendulum with a small amount of noise. As noted in example 2.5,
determining θ1 and θ2 in this way is equivalent to optimizing the least squares
objective. The figure in the caption shows the learned observation model
behind the samples. Brighter colors indicate higher probability density.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 31

system. Depending on the model class and optimization algorithm, algorithm 2.1
may not find the global minimum. However, for many common model classes,
we can perform this optimization analytically instead. Example 2.7 derives an
analytical solution for the maximum likelihood estimate of the parameters of a
discrete distribution, while examples 2.8 and 2.9 derive analytical solutions for
the parameters of Gaussian and conditional Gaussian distributions.

Suppose we have a binary variable X that takes on the value 1 with probability Example 2.7. Maximum likelihood
parameter learning for a binary
θ and the value 0 with probability 1 − θ. The probability of a sequence of m variable and a variable with k pos-
samples with n occurrences of 1 is sible values.

P ( D | θ ) = θ n (1 − θ ) m − n

The log-likelihood of the parameter θ is

`(θ ) = log(θ n (1 − θ )m−n )


= n log θ + (m − n) log(1 − θ )

To find the maximum likelihood estimate of θ, we set the derivative of the


log-likelihood with respect to θ to zero:

∂ n m−n
`(θ ) = − =0
∂θ θ 1−θ
n
Solving for θ results in the maximum likelihood estimate θ̂ = m . Computing
the maximum likelihood estimate for a variable X that can assume k values
results in a similar formula. The maximum likelihood estimate for P( xi | n1:k )
is given by
n
θ̂i = k i
∑ j =1 n j
where n1:k are the observed counts for the k different values.

Algorithm 2.1 requires that we have all of the data required to learn the pa-
rameters. In practice, we may have missing data. For example, when we train a
Gaussian mixture models, we may not know which component of the mixture
generated each data point. Furthermore, when we learn the transition model and
observation models, we may only have access to the observations and actions

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
32 c hapter 2. system modeling

Example 2.8. Analytical solution


In a Gaussian distribution, the log-likelihood of the mean µ and variance σ2
for finding the parameters of a
with m samples is given by Gaussian distribution using max-
imum likelihood parameter learn-
√ ∑ ( o − µ )2 ing.
`(µ, σ2 ) = −m log( 2π ) − m log σ − i i 2 (2.13)

We can use the standard technique for finding the maximum of a function
by setting the partial derivative of ` with respect to each parameter to 0 and
solving for the parameter:

∂ 2 ∑i (oi − µ̂)
∂µ `( µ, σ ) = =0 (2.14)
σ̂2
2 m ∑i (oi − µ̂)2
∂σ `( µ, σ ) = − σ̂ +

=0 (2.15)
σ̂3
After some algebraic manipulation, we get

1 1
µ̂ =
m ∑ oi σ̂2 =
m ∑(oi − µ̂)2 (2.16)
i i

where µ̂ is the sample mean.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 33

Consider a linear Gaussian model p(y | x) = N (y | Ax + b, Σ). Suppose we Example 2.9. Analytical solution
for finding the parameters of a lin-
want to find the maximum likelihood estimate of A and b given a dataset ear Gaussian model using maxi-
of m observations. As shown in example 2.5, this process is equivalent to mum likelihood parameter learn-
ing.
minimizing the sum of the squared errors between the model outputs and
the true outputs:
m
arg min ∑ kAxi + b − yi k2
A,b i =1

Let the matrix X be defined such that each row is a data point xi augmented
with a one in the final column, and let Y be a matrix such that each row is a
data point yi . If we let θ = [A b]> , we can rewrite the optimization problem
as
arg minkXθ − Yk2
θ

Setting the gradient of the objective function with respect to θ to zero and
solving for θ results in the following closed-form solution:

θ̂ = (X> X)−1 X> Y

where (X> X)−1 X> is often referred to as the pseudoinverse of X.


We can use the following code to analytically solve for θ1 and θ2 in the
linear Gaussian model in example 2.6:
X = hcat(s, ones(length(s)))
θ₁, θ₂ = pinv(X) * o

where the pinv function computes the pseudoinverse of the matrix X. The
result is θ1 = 1.02 and θ2 = 0.00, which matches the result from example 2.6.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
34 c hapter 2. system modeling

taken by the agent and not the true state of the environment. In these cases, we
can use the expectation-maximization (EM) algorithm to learn the parameters of
the model, which involves iterative improvement of the parameter estimate.12 12
An overview of the EM algo-
rithm is provided in section 4.4 of
M. J. Kochenderfer, T. A. Wheeler,
2.3.2 Bayesian Parameter Learning and K. H. Wray, Algorithms for De-
cision Making. MIT Press, 2022.
In Bayesian parameter learning, we estimate a distribution over model parameters
given the data. We write this distribution as P(θ | D ).13 This distribution can 13
If θ is continuous, the distribu-
help us quantify our uncertainty about the true value of θ. We can convert this tion is represented by a probabil-
ity density p(θ | D ) instead of
distribution into a point estimate by computing the expectation: a probability mass. In this case,
the summations in equations (2.17)
θ̂ = Eθ∼ P(·| D) [θ] = ∑ θP(θ | D) (2.17) and (2.19) change to integrals.
θ

In some cases, however, the expectation may not be an acceptable estimate, as


illustrated in figure 2.5. An alternative is to use the maximum a posteriori estimate: 3

θ̂ = arg max P(θ | D ) (2.18) 2


θ

p(θ )
This estimate corresponds to a value of θ that is assigned the greatest density. 1
This is often referred to as the mode of the distribution. As shown in figure 2.5,
0
the mode may not be unique. 0 0.2 0.4 0.6 0.8 1
We can derive an expression for P(θ | D ) in terms of the likelihood model θ

introduced in section 2.3.1 using Bayes’ rule:14 Figure 2.5. An example of a distri-
bution where the expected value
of θ is not a good estimate. The ex-
P ( D | θ) P (θ)
P (θ | D ) = (2.19) pected value of 0.5 has a lower den-
∑θ P ( D | θ) P (θ) sity than occurs at the extreme val-
ues of 0 or 1.
In addition to the likelihood model P( D | θ), we need to specify a prior distribution 14
Bayes’ rule can be derived from
P(θ) over the parameters. The prior distribution encodes our beliefs about the the definition of conditional proba-
bility and is named for the English
values of the parameters before observing the data. The output of equation (2.19) statistician and Presbyterian min-
is often referred to as the posterior distribution. ister Thomas Bayes (c. 1701–1761)
who provided a formulation of
In general, computing the posterior distribution using equation (2.19) is chal-
this theorem. A history is pro-
lenging because the denominator is often difficult or impossible to compute vided by S. B. McGrayne, The The-
analytically. The number of terms in the summation scales exponentially with ory That Would Not Die. Yale Uni-
versity Press, 2011.
the number of parameters, and for continuous parameters, the integral is often
intractable. For some model classes and priors, however, an analytical solution is
possible. Example 2.10 shows an example of Bayesian parameter learning for a
simple model class using a conjugate prior. A conjugate prior is a prior distribution

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 35

that, when combined with a likelihood model, results in a posterior distribution


that is in the same class as the prior distribution.15 15
Distributions in the natural expo-
If we cannot compute the posterior distribution analytically, we can approx- nential family have conjugate pri-
ors. The natural exponential fam-
imate it with a set of samples using probabilistic programming. Probabilistic pro- ily includes many common dis-
gramming languages allow us to specify the prior and likelihood models such tributions such as the Gaussian,
Bernoulli, and Poisson distribu-
that we can automatically generate samples from the posterior distribution.16 tions.
We will discuss probabilistic programming techniques in more detail in chap-
ter 6. Algorithm 2.2 provides a probabilistic programming implementation of 16
The Turing.jl package, for ex-
Bayesian parameter learning. It takes in the prior and likelihood models and uses ample, provides a common proba-
bilistic programming interface. H.
a sampling algorithm from a probabilistic programming package to generate m Ge, K. Xu, and Z. Ghahramani,
samples from the posterior distribution. “Turing: a Language for Flexible
Probabilistic Inference,” in Inter-
Example 2.11 provides an implementation of Bayesian parameter learning for national Conference on Artificial In-
the inverted pendulum system, and figure 2.6 shows the results for different telligence and Statistics (AISTATS),
dataset sizes. In general, we decrease our uncertainty about the parameters as 2018.

we observe more data, so the posterior distribution becomes more concentrated


with more data. Bayesian parameter learning also provides a principled way to
incorporate prior knowledge into the learning process. The prior distribution can
encode expert knowledge about the parameters, which can be particularly useful
when we have limited data.

2.3.3 Generalization
An important metric to consider when selecting model parameters is generaliza-
tion performance. The generalization performance of a model is a measure of its
performance over the distribution over its full input space, including points that
were not used to train the model. We measure generalization performance with
respect to a performance metric. A common performance metric is the average
log-likelihood that the model assigns to points in a dataset sampled from the
distribution over the input space. We want to select the parameters with the
best generalization performance. This section discusses techniques for estimating
generalization performance.
It may be tempting to estimate the generalization performance by computing
the performance metric on the training data. However, performing well on the
training data does not necessarily indicate good generalization performance.
Complex models may perform well on the training set, but they may not provide
good predictions at other points in the input space. This concept is often referred

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
36 c hapter 2. system modeling

Suppose we have a binary variable X that takes on the value 1 with probabil- Example 2.10. Bayesian parameter
learning for a binary variable us-
ity θ and the value 0 with probability 1 − θ. The likelihood of observing a ing the Beta distribution as a prior.
sequence of m samples with n occurrences of 1 is given by The plot shows the Beta distribu-
tion with different datasets.
P ( D | θ ) = θ n (1 − θ ) m − n

which corresponds to a Binomial distribution. The Beta distribution is a


conjugate prior for the Binomial distribution. Given a prior distribution of
p(θ ) = Beta(θ | α, β), the posterior distribution is

p(θ | D ) = Beta(θ | α + n, β + m − n)

The distribution Beta(1, 1) assigns uniform probability to also possible values


of θ between 0 and 1. The plot below shows examples of the Beta distribution
with different datasets. As more data is observed, the posterior distribution
becomes more concentrated indicating less uncertainty in the parameter
value.

6 prior
m = 3, n = 2
m = 10, n = 7
m = 20, n = 15
4
p(θ | D )

m = 40, n = 30

0
0 0.2 0.4 0.6 0.8 1
θ

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.3. parameter learning 37

struct BayesianParameterEstimation Algorithm 2.2. Bayesian parame-


likelihood # p(y) = likelihood(x, θ) ter estimation algorithm using the
prior # prior distribution Turing.jl probabilistic program-
sampler # Turing.jl sampler ming package. The algorithm takes
m # number of samples from posterior a likelihood function similar to the
end one used in algorithm 2.1, a prior
distribution over the model param-
eters, a probabilistic programming
function fit(alg::BayesianParameterEstimation, data) sampler from Turing.jl, and the
x, y = first.(data), last.(data) number of samples to generate
@model function posterior(x, y) from the posterior distribution.
θ ~ alg.prior The fit function creates a proba-
for i in eachindex(x) bilistic program that specifies that
y[i] ~ alg.likelihood(x[i], θ) the parameters are drawn from the
end prior and the likelihood model gen-
end
erates each data point. The func-
return Turing.sample(posterior(x, y), alg.sampler, alg.m)
tion returns m samples from the
end
posterior distribution.

Consider the linear Gaussian observation model for the inverted pendulum Example 2.11. Implementation
of Bayesian parameter learning
introduced in example 2.6, and suppose we want to use Bayesian parameter for the inverted pendulum ob-
learning to sample a distribution over the parameters θ = [θ1 , θ2 , σ2 ]. We servation model. The results are
can use the following code to generate m = 1000 samples from the posterior shown in figure 2.6. A detailed
distribution over θ using algorithm 2.2: overview of the NUTS algorithms
is provided by M. D. Hoffman, A.
likelihood(x, θ) = Normal(θ[1] * x + θ[2], exp(θ[3]))
Gelman, et al., “The No-U-Turn
prior = MvNormal(zeros(3), 4I)
Sampler: Adaptively Setting Path
alg = BayesianParameterEstimation(likelihood, prior_dist, NUTS(), 1000)
θ = fit(alg, data) Lengths in Hamiltonian Monte
Carlo.,” Journal of Machine Learn-
The code uses the NUTS, or No U-Turn Sampler, from the Turing.jl package ing Research (JMLR), vol. 15, no. 1,
pp. 1593–1623, 2014.
to generate samples from the posterior distribution. Figure 2.6 shows the
results for different dataset sizes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
38 c hapter 2. system modeling

20 samples 100 samples 500 samples Figure 2.6. Learning the param-
eters of a linear Gaussian ob-
servation model for the inverted
0.2
pendulum system given different
amounts of data from the dataset
in example 2.6. The top row shows
0
y

the data points used for each col-


umn, and the remaining three rows
show the posterior distribution
−0.2 over the parameters θ1 , θ2 , and σ
for different amounts of data. As
−0.2 0 0.2 −0.2 0 0.2 −0.2 0 0.2 more data is observed, the pos-
x x x terior distribution becomes more
concentrated around the true pa-
rameter values.
p ( θ1 | D )

0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2


θ1 θ1 θ1
p ( θ2 | D )

−0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1 −0.1 −0.05 0 0.05 0.1
θ2 θ2 θ2
p(σ | D )

0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1
σ σ σ

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.4. agent models 39

to as overfitting (figure 2.7). Therefore, instead of selecting the model parameters


that best fit the training data, we should select the model parameters that result
in the best performance on separate dataset that the model has not seen before.

y
A simple approach to estimating the generalization performance using an
unseen dataset is the holdout method, which partitions the available data into a
test set and a training set. We use the training set to learn the model parameters x
and the test set for evaluation. Depending on the size and nature of the dataset,
Figure 2.7. Example where a com-
we may use different ratios to split the training and test data ranging from 50 % plex model (black line) fits the
train and 50 % test to 99 % train and 1 % test. Using too few samples for training training data (black) perfectly but
does not generalize well to other
can result in poor fits, whereas using too many will result in poor generalization
data points (blue). A simpler lin-
estimates. ear model (blue line) provides the
Using a train-test partition can be wasteful because our model tuning can take best fit when considering all data-
points.
advantage only of a segment of our data. We can often obtain better results using
k-fold cross validation.17 To perform this technique, we randomly partition the data 17
This method is also known as ro-
into k segments of approximately equal size. We then train k models, one on each tation estimation.

subset of k − 1 sets, and we use the withheld set to estimate the generalization
performance. The cross-validation estimate of generalization performance is the
mean generalization performance over all folds.18 18
Another common approach re-
lated to cross-validation is the boot-
strap method, which involves re-
2.4 Agent Models sampling the dataset with replace-
ment to estimate the generalization
performance. B. Efron, “Bootstrap
For some systems, the environment may contain other agents that we need to Methods: Another Look at the Jack-
incorporate into our environment model. Depending on the available data and knife,” in Breakthroughs in Statis-
tics: Methodology and Distribution,
the assumptions we make about the other agents, we can use different techniques Springer, 1992, pp. 569–593.
to model their behavior. This section discusses three categories of techniques for
modeling other agents.

2.4.1 Imitation Learning


Imitation learning is a technique for learning a policy by observing the behavior of
another agent. A common technique for imitation learning is behavioral cloning.
Given a dataset of state-action pairs from the expert agent, we can use behavioral
cloning to learn a policy that maps states to actions. Behavioral cloning methods
learn a policy πθ ( a | s) by finding a θ that maximizes the likelihood of the actions
taken in the dataset. Once we select a model class to represent the policy, we

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
40 c hapter 2. system modeling

Original Agent Cloned Agent Figure 2.8. Behavioral cloning of


a grid world agent. The behavioral
clone is trained on the set of trajec-
tories shown in the left plot. The
right plot shows rollouts of the
learned policy, which appear simi-
lar to the original trajectories.

can use the techniques from section 2.3.1 to learn the parameters of the policy.
Figure 2.8 shows an example of behavioral cloning for a grid world agent.
If the model class used for behavioral cloning is not expressive enough or
the dataset contains errors, the learned policy may not be optimal. This result
can lead to cascading errors, which occur when small errors compound during a
rollout and eventually lead to states that are poorly represented in the training
data. The policy of a cloned agent may not generalize well to these states, causing
inaccurate behavior. One way to address the problem of cascading errors is to Figure 2.9. Cascading errors in be-
havioral cloning for a grid world
correct the learned policy with additional data. Sequential interactive demonstration agent. The clone is trained on a
methods such as data set aggregation (DAgger)19 alternate between collecting new trajectory that does not reach any
data in states reached by the trained policy and using this data to improve the states above the goal state. While
the original agent (blue) is able to
policy. correctly turn back toward the goal
Another common technique for imitation learning is inverse reinforcement learn- in this region, the clone (purple)
continues to move away from the
ing. In inverse reinforcment learning, we assume that the expert agent is opti- goal.
mizing an unknown reward function when selecting its actions, and our goal 19
S. Ross, G. J. Gordon, and J. A.
is to determine this reward function given a dataset of trajectory rollouts from Bagnell, “A Reduction of Imitation
Learning and Structured Predic-
the expert agent. Common techniques for inverse reinforcement learning select a tion to No-Regret Online Learn-
parametric function form for the reward function and learn the parameters of the ing,” in International Conference on
reward function according to a particular objective. Artificial Intelligence and Statistics
(AISTATS), vol. 15, 2011.
One common objective for learning reward function parameters involves max-
imizing the margin between the reward of the expert agent and the reward of 20
P. Abbeel and A. Y. Ng, “Appren-
ticeship Learning via Inverse Re-
other agents. This technique is known as maximum margin inverse reinforcement
inforcement Learning,” in Interna-
learning.20 Another common objective is to maximize the entropy of the distribu- tional Conference on Machine Learn-
tion over trajectories produced by the learned policy, which is known as maximum ing (ICML), 2004.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.4. agent models 41

entropy inverse reinforcement learning.21 Once we have learned the reward function, 21
B. D. Ziebart, A. Maas, J. A. Bag-
we can use it to model the policy of the agent.22 nell, and A. K. Dey, “Maximum En-
tropy Inverse Reinforcement Learn-
ing,” in AAAI Conference on Artifi-
cial Intelligence (AAAI), 2008.
2.4.2 Behavior Models 22
An overview of methods for
If we have a model of the utility of the an agent, we can use a behavior model to learning policies from reward func-
tions is provided by M. J. Kochen-
predict its actions. In particular, suppose we have a utility function U (s, a) that derfer, T. A. Wheeler, and K. H.
assigns a utility to each state-action pair. We can model the behavior as a policy Wray, Algorithms for Decision Mak-
ing. MIT Press, 2022.
that selects actions to maximize the utility of the agent. This policy is given by

π ( a | s) = arg max U (s, a) (2.20)


a

The policy in equation (2.20) is known as the best response policy.


Some agents are not perfectly rational optimizers of their utility functions.
Humans, for example, do not always select actions that maximize their utility.23 23
Several recent books discuss
We can model this behavior using a softmax response policy.24 The principle un- apparent human irrationality. D.
Ariely, Predictably Irrational: The
derlying the softmax response model is that agents are more likely to make errors Hidden Forces That Shape Our Deci-
in their optimization that are less costly. Given a precision parameter λ ≥ 0, the sions. Harper, 2008. J. Lehrer, How
We Decide. Houghton Mifflin, 2009.
softmax response policy is given by 24
This response is sometimes re-
ferred to as a logit response or quan-
exp(λU (s, a)) tal response.
π ( a | s) = (2.21)
∑ a0 exp(λU (s, a0 ))

As λ approaches 0, the policy selects actions uniformly at random, and as λ


approaches infinity, the policy approaches the best response policy. We can treat
λ as a parameter that can be learned from data using, for example, maximum
likeilihood estimation.

2.4.3 Interaction Models Figure 2.10. If the intruder aircraft


(gray) chooses to descend, the best
Agents that operate in the presence of other agents may base their decisions on action for our aircraft (purple) is
their belief over the behavior of the other agents.25 For example, consider an to climb. In general, the best action
for our aircraft is the opposite of
aircraft collision avoidance scenario in which an intruder aircraft is approaching the action chosen by the intruder.
at the same altitude as our aircraft. In this scenario, we want our aircraft to take 25
An overview of many behav-
the opposite action of the intruder aircraft (figure 2.10). If the intruder chooses ioral models is provided in C. F.
Camerer, Behavioral Game Theory:
to climb, our best action is to descend, while if the intruder chooses to descend, Experiments in Strategic Interaction.
our best action is to climb. Princeton University Press, 2003.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
42 c hapter 2. system modeling

We can model this interaction between agents using interaction models.26 For 26
The topic of interaction models
example, a hierarchical interaction model specifies the depth of rationality of an is closely related to the field of
game theory. Several introductory
agent by a level of k ≥ 0. A level 0 agents selects its action without regard to the books include D. Fudenberg and
actions of other agents. A level 1 agent selects its action by assuming that all other J. Tirole, Game Theory. MIT Press,
1991. Y. Shoham and K. Leyton-
agents are level 0 agents. In general, a level k agent selects its action by assuming Brown, Multiagent Systems: Algo-
that all other agents are level k − 1 agents. Figure 2.11 shows an example of a this rithmic, Game Theoretic, and Logical
model for an aircraft collision avoidance scenario. Foundations. Cambridge University
Press, 2009.
Another common behavior model is the hierarchical softmax model, which
accounts for the fact that agents may have different levels of rationality.27 A level 0 27
This approach is sometimes
called quantal-level-k or logit-level-k.
agent selects actions uniformly at random. A level 1 agent selects actions according
D. O. Stahl and P. W. Wilson, “Ex-
to the softmax response policy with a precision parameter λ that assumes that perimental Evidence on Players’
all other agents are level 0 agents. A level k agent selects actions according to a Models of Other Players,” Journal
of Economic Behavior & Organization,
softmax model of the other players playing level k − 1. We can learn the k and λ vol. 25, no. 3, pp. 309–327, 1994.
parameters from data using maximum likelihood estimation.

2.5 Model Validation

Since the validity of any downstream analysis of a system depends on the accuracy
of the models we use, it is important to rigorously validate our models. We can
use a variety of features to validate a model. Given a dataset, we can compare
characteristics of the model distribution to the empirical distribution of the data.
We can also compute features by comparing rollouts of the model to rollouts of the
true system. For example, given a model of aircraft collision avoidance behavior,
we can compare the average miss distance of the aircraft when using the model
to average miss distance of the true system trajectories. This section discusses
common model validation techniques that compare features of the model to the
true system.

2.5.1 Visual Diagnostics


One important part of the model validation process involves ensuring that the
distribution over model features matches that of the true system. For example, we
may want to confirm that our analytical model of sensor observations matches
a dataset of real sensor measurements. In this case, the distribution from our
model is represented analytically, while the distribution from the true system
is represented as a set of samples. In other cases, both the model distribution

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 43

Figure 2.11. A hierarchical interac-


k=0 tion model for an aircraft collision
avoidance scenario in which the in-
truder aircraft (gray) is approach-
ing at the same altitude as our air-
craft (purple). A level 0 agent se-
lects the best action according to
k=1 its policy, which is to climb. A level
1 agent assumes that the intruder
has the policy of a level 0 agent and
selects the best action given this as-
sumption, which is to descend. A
level 2 agent assumes that the in-
truder has the policy of a level 1
agent and chooses to climb.

k=2

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
44 c hapter 2. system modeling

and true distribution are represented as a set of samples. For example, we may
want to compare the distribution over airspeed from rollouts of an aircraft en-
counter model to the distribution over airspeed from trajectories of true aircraft
encounters.28 28
M. J. Kochenderfer, M. W. M. Ed-
One way to compare two feature distributions is to compare their probability wards, L. P. Espindle, J. K. Kuchar,
and J. D. Griffith, “Airspace En-
density functions. We can plot the probability density function of the model on the counter Models for Estimating Col-
same plot as the probability density function of the data. For distributions that are lision Risk,” AIAA Journal on Guid-
ance, Control, and Dynamics, vol. 33,
represented as a set of samples, we can plot an approximate probability density no. 2, pp. 487–499, 2010.
by creating a histogram of the samples. We may also compare the cumulative
distribution functions of a variable X (P( X ≤ x )) for the model and data. If we 1
do not have an analytical model of the cumulative distribution function, we can model
0.8

Empirical CDF
data
plot the empirical cumulative distribution function of the samples. Figure 2.12
0.6
compares the empirical cumulative distribution function of two sets of samples.
0.4
Another common visual diagnostic is the quantile-quantile plot (Q-Q plot). The
α-quantile of a distribution is the value q for which 0.2

0
P( X ≤ q) = α (2.22) −2 0 2
x

A Q-Q plot compares the quantiles of the model distribution to the quantiles of Figure 2.12. Comparison of the
the data. The horizontal axis of a Q-Q plot represents the quantiles of the model empirical cumulative distribution
function for two sets of samples
distribution, while the vertical axis represents the quantiles of the data. If the
represented by the blue and pur-
model distribution matches the data, the points in the Q-Q plot will lie on the ple dots. The function represents
line that passes through the origin with a slope of 1. the fraction of samples below each
value of x.
We can also compare distributions using calibration plots. The horizonal axis of a
calibration plot corresponds to values of α between 0 and 1, while the vertical axis
corresponds to the fraction of data points that lie below the α-quantile of the model.
Similar to the Q-Q plot, a well-calibrated model will produce a calibration plot that
lies on the line that passes through the origin with a slope of 1. Figure 2.13 shows
examples of probability density, cumulative distribution, Q-Q, and calibration
plots for a set of samples and four different analytical models.

2.5.2 Summary Metrics


It is often desirable to summarize the differences between the modeled and true
distribution using a single quantity. For example, we may want to compare the
probability density function of the model to the data using a single number. One
common quantity used to compare two densities is the Kullback-Leibler divergence

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 45

PDF CDF Q-Q Calibration


0.8 1 1
0.8 2 0.8
0.6

p( X ≤ x )
0.6 0.6
p( x )

αtrue
qtrue
0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
0.8 1 1
0.8 2 0.8
0.6
p( X ≤ x )

0.6 0.6
p( x )

αtrue
qtrue
0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
0.8 1 1
0.8 2 0.8
0.6
p( X ≤ x )

0.6 0.6
p( x )

αtrue
qtrue

0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
0.8 1 1
0.8 2 0.8
0.6
p( X ≤ x )

0.6 0.6
p( x )

αtrue
qtrue

0.4 0
0.4 0.4
0.2 0.2 0.2
−2
0 0 0
−2 0 2 −2 0 2 −2 0 2 0 0.2 0.4 0.6 0.8 1
x x qmodel αmodel

Figure 2.13. Visual diagnostics that


compare a set of samples (gray) to
four possible models (blue). Each
row shows a different model. As
shown in the plots, the model in
the top row fits the data better than
the models in the remaining rows.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
46 c hapter 2. system modeling

(K-L divergence).29 The K-L divergence between two densities p( x ) and q( x ) is 29


The K-L divergence is named
defined as after American mathematicians
Solomon Kullback (1907–1994)
 
p( x )
Z
DKL ( p k q) = p( x ) log dx (2.23) and Richard Leibler (1914–2003),
q( x ) who introduced the concept. S.
The K-L divergence is 0 if p( x ) = q( x ) for all x and greater than 0 otherwise. If Kullback and R. A. Leibler, “On
Information and Sufficiency,” The
the model and data distributions are represented as samples, we can estimate the Annals of Mathematical Statistics,
K-L divergence using densities estimated from their histograms.30 vol. 22, no. 1, pp. 79–86, 1951.
30
The K-L divergence is part of a broad class of divergences used in informa- Another approach is to fit a ker-
nel density estimate to the samples
tion theory and statistics called the F-divergences.31 Another common divergence and then compute the K-L diver-
measure in this class is the Jensen-Shannon divergence,32 which is a symmetric gence between the two kernel den-
sity estimates.
version of the K-L divergence. The F-divergences do not necessarily have all of the 31
A detailed overview is provided
properties of a distance metric. For example, the K-L divergence is not symmetric, by F. Liese and I. Vajda, “On Diver-
meaning that DKL ( p k q) 6= DKL (q k p). Furthermore, the K-L divergence is not gences and Informations in Statis-
tics and Information Theory,” IEEE
defined if the support of p( x ) is not a subset of the support of q( x ). We can use Transactions on Information Theory,
the Wasserstein distance to compare two densities with different supports.33 vol. 52, no. 10, pp. 4394–4412, 2006.
32
We can summarize the cumulative distribution plot by calculating the maxi- Named for Danish mathemati-
cian Johan Jensen (1859–1925) and
mum distance between the cumulative distribution functions of the two distribu- American mathematician Claude
tions. This distance is called the Kolmogorov-Smirnov statistic (K-S statistic) and is Shannon (1916–2001)
33
defined as The Wasserstein distance is
named after Russian-American
DKS = max | P( X ≤ x ) − Q( X ≤ x )| (2.24) mathematician Leonid Vaserstein
x
(1944–pres.). It is also known as
where P( X ≤ x ) and Q( X ≤ x ) are the cumulative distributions of the model and the earth mover’s distance because it
data.34 We can compute a similar metric from the calibration plot by computing can be interpreted as the amount
of work required to transform
the maximum distance between the points on the calibration plot and the line a pile of earth representing one
passing through the origin with a slope of 1. This quantity is referred to as distribution to a pile of earth
the maximum calibration error (MCE). It is also common to compute the expected representing the other.
34
This statistic is often used in a
calibration error (ECE) by averaging the distances. Figure 2.14 illustrates these K-S test to check whether two sets
metrics for the two sets of samples shown in figure 2.12. of samples are drawn for the same
distribution. It is named after So-
viet mathematicians Andrey Kol-
2.5.3 Comparing Multiple Features mogorov (1903–1987) and Nikolai
Smirnov (1900–1966).
We often want to compare multiple features of a model to the true system. One
option is to compare one feature at a time using the techniques in sections 2.5.1
and 2.5.2. However, it is also important to ensure that the model accurately
captures the relationships between features of the true system, and checking one
feature at a time may cause us to miss these relationships. Figure 2.15 shows
an example in which performing visual diagnostics on the individual features

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 47

Figure 2.14. Illustration of the K-


S statistic, MCE, and ECE for two
K-S Statistic MCE ECE sets of samples. The black arrows
1 1 1 indicate the quantity of interest for
model each plot. The ECE is computed by
0.8 0.8 0.8
Empirical CDF

data averaging the lengths of the arrows


0.6 0.6 0.6 on the plot.

αdata

αdata
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
−2 0 2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x αmodel αmodel

produces misleading results. The model distribution does not match the data 1.2
1.1
points in two-dimensional space, but the individual feature distributions match 1
0.9
the data.
0.911.11.2
Several techniques allow us to check whether a model accurately captures
the relationships between features of the true system. One technique involves
creating a single feature that captures the relationships between the features of
the true system. We can then use techniques discussed in sections 2.5.1 and 2.5.2
to compare the single feature of the model to the single feature of the true sys-
tem. Figure 2.16 shows an example of creating a single feature that models the
Figure 2.15. Example in which per-
relationship between the features in figure 2.15. One drawback of this approach forming visual diagnostics on in-
is that it requires domain knowledge to create the single feature. dividual features can produce mis-
leading results. The model distri-
Another way to compare multiple features is to extend the metrics discussed bution (blue contours indicating
in section 2.5.2 to multivariate distributions. Many of the comparison metrics points of equal density) does not
for probability density functions have straightforward extensions. The K-L di- match the data points (gray) in
two-dimensional space. However,
vergence, for example, can be extended to multivariate distributions by using the individual feature distributions
the probability density of the joint distribution of the model and data for the of the model match the individual
feature distributions of the data.
variables of interest (figure 2.17). In constrast, the visual diagnostics and metrics
that use the cumulative distribution or quantile functions are less straightforward 35
Multiple definitions of the quan-
tile function in higher dimensions
to extend because the quantile function is not defined in multiple dimensions.
have been proposed. P. Chaudhuri,
Therefore, these metrics require extensions of the quantile function to higher “On a Geometric Notion of Quan-
dimensions.35 tiles for Multivariate Data,” Jour-
nal of the American Statistical Associ-
We may also want to compare the distribution over a feature conditioned on the ation, vol. 91, no. 434, pp. 862–872,
value of another feature. For example, we may want to compare the distribution 1996.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
48 c hapter 2. system modeling

Original Data Projected Distribution Figure 2.16. Creating a single fea-


ture that captures the relationships
between the features in figure 2.15
by projecting the model and data
onto the line y = x. This feature
allows us to identify a mismatch
between the model (blue) and the
data (gray), as shown in the right-
most plot. The pink contours show
an alternative model that better
matches the data.

DKL = 0.001 3
1.2 1.2
DKL = 0.001 3
1.2 Figure 2.17. Comparison of the fea-
1.1 1.1 1.1 tures of two possible models to a
1 1 1 set of data sampled from the true
0.9 0.9 0.9
1.2 system using K-L divergence. If
0.911.11.20.911.11.2 0.911.11.2 we calculate the K-L divergence of
DKL = 0.001 3

DKL = 0.001 3
1.1 each feature separately, the models
appear to match the data equally
1
well. However, if we calculate the
0.9 K-L divergence of the joint distri-
DKL = 0.692 7 DKL = 0.001 3 bution of the features, we see that
the pink model matches the data
0.911.11.2
better than the blue model.

over sensor observations conditioned on the true state. One way to make this
comparison is to partition the conditioning variable into a set of bins and then
compare the distribution over the feature in each bin using the metrics described 36
R. Bhattacharyya, S. Jung, L.
in this section. Figure 2.18 shows an example of this technique. It is also possible Kruse, R. Senanayake, and M. J.
to create a single calibration plot for the conditional distribution checking the Kochenderfer, “A Hybrid Rule-
Based and Data-Driven Approach
α-quantile for each x-value checking how often the corresponding y-value is
to Driver Modeling Through Par-
below the α-quantile of the model distribution. ticle Filtering,” IEEE Transactions
on Intelligent Transportation Systems,
no. 2108.12820, 2021.
2.5.4 Subjective Evaluation 37
This test was first proposed
by English mathematician and
We can also evaluate models based on expert knowledge. One evaluation metric computer scientist Alan Turing
is the ability of an expert to distinguish between samples produced by the model (1912–1954) in an 1950 essay. He
and samples produced by the true system.36 A model represents the true system originally called the test the imita-
tion game in which a human judge
well if an expert cannot distinguish between the generated samples and the true interacts with a machine and a hu-
samples. This idea is similar to the Turing test, which was proposed as a way to man and must determine which is
which. A. M. Turing, “Computing
test whether a machine has human intelligence.37 Machinery and Intelligence,” Mind,
vol. 59, pp. 433–460, 1950.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.5. model validation 49

10 Figure 2.18. Analysis of samples


8 from the conditional distribution
p(y | x ) for the model (blue) and
6 true system (gray). The top plot
y

4 shows samples from each distribu-


2 tion for different values of x. The
background colors each represent
0 a bin containing a range of x values.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
We can analyze the performance of
x the conditional model in each bin
by comparing the distributions of
0.8 0.8 0.8 0.8 y values using the visual diagnos-
0.6 0.6 0.6 0.6 tics in section 2.5.1. The plots in the
p(y)

remaining rows show the visual di-


0.4 0.4 0.4 0.4
agnostics for each bin. As we can
0.2 0.2 0.2 0.2 see, the model mismatch increases
0 0 0 0 for larger values of x.
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
y y y y

1 1 1 1
0.8 0.8 0.8 0.8
p (Y ≤ y )

0.6 0.6 0.6 0.6


0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0 0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
y y y y

10 10 10 10
8 8 8 8
6 6 6 6
qdata

4 4 4 4
2 2 2 2
0 0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
qmodel qmodel qmodel qmodel

1 1 1 1
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
αdata

0.4 0.4 0.4 0.4


0.2 0.2 0.2 0.2
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
αmodel αmodel αmodel αmodel

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
50 c hapter 2. system modeling

Poor Model Representation 7 Figure 2.19. Validating a model


of a grid world agent using ex-
Pair 1 Pair 2 Pair 3 Pair 4 pert knowledge. Each column rep-
resents a pair of rollouts from the
true system and the model in a ran-
dom order. The expert’s selection
is highlighted in green if they cor-

User
rectly identified the true system
and in red if they incorrectly identi-
fied the model. The top row shows
Accuracy: 100 %
an example where the model does
Test Failed 7 not represent the true system well,
and the expert is able to deter-
mine the true model in each pair.
The bottom row shows an exam-
ple where the model represents the
true system well, and the expert is
unable to distinguish between the
Good Model Representation 3 true system and the model.

Pair 1 Pair 2 Pair 3 Pair 4

User
Accuracy: 50 %
Test Passed 3

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
2.6. summary 51

We can evaluate this metric by showing an expert pairs consisting of one set
of rollouts from the true system and one set of rollouts from the model. We can
then ask the expert to identify which set of rollouts was produced by the true
system. We can quantify the performance of the model by measuring the expert’s
accuracy in distinguishing between the two sets. If the expert’s accuracy is around
50 %, their performance is no better than random guessing, and we can conclude
that the model is a good representation of the true system. Figure 2.19 shows an
example of this test for a model of a grid world agent.

2.5.5 Sensitivity Analysis


No matter what parameter learning scheme we use, the resulting model is unlikely
38
to be a perfect representation of the behavior of our system. Because there will J. P. Chryssanthacopoulos and
M. J. Kochenderfer, “Collision
always be inherent uncertainty in the parameters that we select, we often want to Avoidance System Optimization
understand how sensitive our downstream analysis might be to the particular with Probabilistic Pilot Response
choice of model parameters. We can make small perturbations to the parameters Models,” in American Control
Conference (ACC), 2011.
and check how much the resulting analysis changes. For example, we might have
a collision avoidance system whose safety is influenced by the pilot response time
to its resolution advisories.38 If the probability of a mid-air collision is highly
sensitive to particular parameter settings for the pilot response model, then it
would suggest that additional analysis and data collection may be merited. If the
probability of a mid-air collision is not highly sensitive, we can be more justified

f (θ )
in proceeding with the learned parameters. Sensitivity will be discussed in more
detail in section 11.3.1. A notional example is illustrated in figure 2.20.

θ1 θ2
2.6 Summary
θ

• To accurately model a system, we need to build models of the agent, environ- Figure 2.20. Sensitivity analysis of
ment, and sensor. a model parameter θ with respect
to downstream quantity of inter-
• The general process for creating a model involves selecting a model class, est f (θ ). If our learned parameter
is θ1 , we may be more confident
learning the parameters of the model, and validating the model.
in our downstream quantity com-
pared to θ2 where there is much
• Probability distributions are a common type of model class that assigns proba- greater sensitivity. A small pertur-
bilities to different outcomes. bation to θ1 is unlikely to change
f (θ ), but a small perturbation to θ2
• We can learn the parameters of from data using maximum likelihood estimation might.
or Bayesian estimation.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
52 c hapter 2. system modeling

• For systems that operate in environments with other agents, it is important to


incorporate models of these agents into the environment model.

• We can validate models using test data or expert knowledge.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3 Property Specification

In the previous chapter, we focused on creating an accurate model of the sys-


tem. The final step in defining a validation problem is to formalize the operating
requirements of the system as a specification, which is a precise mathematical
expression that defines the objectives of a system. Specifications are often de-
rived from metrics, which map the performance of a system to a real number.
We begin by discussing common metrics used to measure the performance of
stochastic systems. We also discuss how to create composite metrics that capture
trade-offs between different performance objectives. We then show how to write
specifications as logical formulas using propositional logic, first-order logic, and
temporal logic. Finally, we discuss a special case of a temporal specification called
a reachability specification and show how to convert temporal logic specifications
into reachability specifications.

3.1 Properties of Systems

We describe the behavior of a system using metrics and specifications. A metric is a


function that maps system behavior to a real number. For example, a common
metric used to evaluate aircraft collision avoidance systems is the miss distance
between two aircraft. A specification is a function that maps system behavior
to a Boolean value. Therefore, specifications are always either true or false. For
example, a specification for the grid world system might be to reach the goal
without hitting an obstacle.
Sometimes specifications can be derived from metrics. For example, given a
metric that measures the probability of collision for an aircraft collision avoidance
system, we can create a specification that requires the probability of collision to
be less than a certain threshold. We can also derive metrics from specifications.
54 c hapter 3. property specification

Using the grid world specification, we could define a metric that measures the
distance between the agent and the goal or obstacle.
We use metrics or specifications to evaluate individual trajectories, sets of
trajectories, or probability distributions over trajectories. The miss distance be- 400

tween two aircraft can be used to measure the performance of an aircraft collision 200
miss

h (m)
avoidance system in a single encounter scenario (figure 3.1), and the net return 0 distance
can be used to measure the performance of a one outcome of a financial trading −200
strategy over time. We can also create metrics or specifications that operate over
−400
a set of trajectories. For example, we can compute the average miss distance or
net gain over a set of possible trajectories or specify a threshold on the number
400
of trajectories that result in a collision. The remainder of this chapter discusses
200
techniques to formally express metrics and specifications.

h (m)
average
0 miss distance

3.2 Metrics for Stochastic Systems −200


−400
For stochastic systems, we often compute metrics over the full distribution of 40 30 20 10 0
trajectories. Given a function f (τ ) that maps an individual trajectory τ to a real- tcol (s)
valued metric, we are interested in summarizing the distribution over the output
Figure 3.1. Example of a metric for
of f (τ ) (figure 3.2). The remainder of this section outlines several metrics used an aircraft collision avoidance sys-
to summarize distributions. tem over an individual trajectory
(top) and over a set of trajectories
(bottom).
3.2.1 Expected Value
A common metric used to summarize a distribution is its expected value. The
expected value represents the average output of a function given a distribution expected
miss distance
over its inputs. It is defined as
Z
Eτ ∼ p(·) [ f (τ )] = f (τ ) p(τ ) dτ (3.1)

where p(τ ) is the probability distribution over trajectories. While it is not always 0 1,000 2,000
possible to evaluate the expected value analytically, we can estimate it using a Miss Distance (m)

variety of techniques such as the ones discussed in chapter 7.


Figure 3.2. Distribution over the
The expected value of a binary metric represents a probability. For instance, the miss distance metric for an air-
consider a binary metric f (τ ) that evaluates to 1 if an the agent hits an obstacle craft collision avoidance system.
We can summarize this distribu-
and 0 otherwise. The expected value of this metric is the probability that the agent tion with another metric such as
hits an obstacle. In general, the expected value of a binary metric derived from a the expected value of the miss dis-
tance.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.2. metrics for stochastic systems 55

Figure 3.3. Three distributions


over the miss distance metric for
an aircraft collision avoidance sys-
tem. While all distributions have
the same mean, they have differ-
ent variances (decreasing from left
to right). The distribution with the
lowest variance is most likely to op-
erate safely.
0 1,000 2,000 0 1,000 2,000 0 1,000 2,000
Miss Distance (m) Miss Distance (m) Miss Distance (m)

specification is the probability that a randomly sampled trajectory will satisfy the
specification. We could also derive a high-level specification from this probability
by requiring that the probability of satisfying the specification is greater than a
certain threshold.1 1
H. Hansson and B. Jonsson, “A
Logic for Reasoning about Time
and Reliability,” Formal Aspects of
3.2.2 Variance Computing, vol. 6, pp. 512–535,
1994.
Another common summary metric is the variance, which measures the spread of
the distribution. The variance of a metric f (τ ) is defined as

Varτ ∼ p(·) [ f (τ )] = Eτ ∼ p(·) [( f (τ ) − Eτ ∼ p(·) [ f (τ )])2 ] (3.2)

Intuitively, the variance measures how much the metric f (τ ) deviates from its
expected value. A low variance indicates that the metric tends to be consistent
across different trajectories, while a high variance indicates that the metric varies
significantly. It is important to consider both the expected value and variance of a
metric when evaluating system performance (figure 3.3).

3.2.3 Value at Risk


When we are concerned with safety, we may want to use more conservative
metrics that focus on worst-case outcomes. One such metric is the value at risk
(VaR). Suppose we have a metric f (τ ) for individual trajectories in which higher
values indicate worse outcomes. This type of metric is often referred to as a risk
metric. The VaR is the highest risk value that f (τ ) is guaranteed not to exceed
with probability α, which corresponds to the α-quantile of the distribution. For a
particular value of α, a higher VaR indicates a more risky system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
56 c hapter 3. property specification

α = 0.9 α = 0.7 α = 0.5 α = 0.3 α = 0.1 Figure 3.4. Effect of α on VaR and
CVaR. Higher values for α corre-
expected value spond to more conservative risk es-
VaR timates.
CVaR

3.2.4 Conditional Value at Risk


Another common metric derived from VaR is the conditional value at risk (CVaR),2 2
The conditional value at risk is
which is the expected value of the metric f (τ ) given that it exceeds the VaR: also known as the mean excess
loss, mean shortfall, and tail value
at risk. R. T. Rockafellar and S.
CVaRα [ f (τ )] = Eτ ∼ p(·) [ f (τ ) | f (τ ) ≥ VaRα [ f (τ )]] (3.3) Uryasev, “Optimization of Con-
ditional Value-at-Risk,” Journal of
In other words, CVaR is the expected value of the (1 − α)-fraction of worst-case Risk, vol. 2, pp. 21–42, 2000. It is
also a kind of coherent risk measure,
outcomes. A higher CVaR indicates that the system is more likely to perform which means that it satisfies some
poorly in the worst-case scenarios. Example 3.1 shows the VaR and CVaR of a additional mathematical proper-
risk metric for an aircraft collision avoidance system. Higher values of α push ties. Another coherent risk mea-
sure, not discussed here, is the en-
the VaR closer to the worst-case outcome and correspond to more conservative tropic value at risk. A. Ahmadi-Javid,
risk estimates. As α approaches 1, the CVaR approaches the risk of the worst-case “Entropic Value-At-Risk: A New
Coherent Risk Measure,” Journal
outcome. As α approaches 0, the CVaR approaches the expected value of the risk of Optimization Theory and Applica-
metric (figure 3.4). tions, vol. 155, no. 3, pp. 1105–1123,
2011.

3.3 Composite Metrics 1

In many real-world settings, we must select one of several system designs or 0.8
Collision Rate

strategies for final deployment, and metrics allow us to make an informed deci- 0.6
sion. For example, we might compare the performance of two aircraft collision
0.4
avoidance systems by computing the probability of collision over a set of aircraft
0.2
encounters for each system. In these cases, we are often concerned with multiple
metrics. For example, an aircraft collision avoidance system should minimize 0
0 0.2 0.4 0.6 0.8 1
collisions while issuing a small number of alerts to pilots, and a financial trading Alert Rate
strategy may aim to maximize return while minimizing risk.
Figure 3.5. Tradeoff between the
It is often the case that multiple metrics describing system performance are at
alert rate and collision rate for an
odds with one another, and some system designs may perform well on one metric aircraft collision avoidance system.
but poorly on another. For instance, an aircraft collision avoidance system that Each point represents a different
system design.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.3. composite metrics 57

Suppose a desired separation for the aircraft in an aircraft collision avoid- Example 3.1. VaR and CVaR for
the loss of separation metric for an
ance environment is 2,000 m. We can define a risk metric f (τ ) to summarize aircraft collision avoidance system.
the loss of separation as 2,000 m minus the miss distance. A higher loss of
separation indicates higher risk. The plots below show the VaR and CVaR for
the loss of separation metric for three different distributions over outcomes.

expected value
VaR
CVaR

0 1,000 2,000 0 1,000 2,000 0 1,000 2,000


Loss of Separation (m) Loss of Separation (m) Loss of Separation (m)

Although all three distributions have the same expected value, the VaR and
CVaR decrease as we move from left to right. The distribution with the lowest
VaR and CVaR is the least risky because it has better worst-case outcomes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
58 c hapter 3. property specification

minimizes the number of collisions may also increase the number of alerts issued
to pilots, while one that minimizes alerts may increase the number of collisions 1
(figure 3.5). In such cases, we can combine multiple metrics into a single composite 0.8

Collision Rate
metric that captures the trade-offs between different objectives.
0.6
We can compare systems with multiple metrics using the concept of Pareto
0.4
optimality. A system design is Pareto optimal3 if we cannot improve one metric
without worsening another. Given a set of system designs, the Pareto frontier 0.2
consists of the subset of designs that are Pareto optimal. The Pareto frontier 0
0 0.2 0.4 0.6 0.8 1
illustrates the trade-offs between metrics. Figure 3.6 shows the Pareto frontier
Alert Rate
for the aircraft collision avoidance systems shown in figure 3.5. Composite metrics
allow system designers to select a single point on the Pareto frontier. Figure 3.6. Pareto frontier for a set
of aircraft collision avoidance sys-
tem designs. The points that com-
3.3.1 Weighted Metrics prise the Pareto frontier are high-
lighted in blue.
Weighted metrics combine multiple metrics using a vector of weights that re- 3
Pareto optimality is a topic that
flect the relative importance of each metric. Suppose we have a set of metrics was originally explored in the field
of economics. It is named after
f 1 (τ ), f 2 (τ ), . . . , f n (τ ) that we wish to combine into a single metric. The most Italian economist Vilfredo Pareto
basic weighted metric is the weighted sum, which is defined as (1848–1923).

n
f (τ ) = ∑ wi f i ( τ ) = w> f ( τ ) (3.4)

Collision Rate
i =1

where w = [w1 , . . . , wn ] is a vector of weights and f(τ ) = [ f 1 (τ ), . . . , f n (τ )] is a


vector of metrics. The weighted sum allows us to balance the trade-offs between
different metrics by adjusting the weights, and each set of weights will correspond
to a point or set of points on the Pareto frontier. 0 0.2 0.4 0.6 0.8 1
Alert Rate

3.3.2 Goal Distance Metrics Figure 3.7. Composite metric for


an aircraft collision avoidance sys-
Another way to combine metrics is to compute the L p norm4 between f(τ ) and a tem using the L2 norm between the
point and the goal point (blue star).
goal point:
The goal point is the utopia point
f (τ ) = kf(τ ) − f goal k p (3.5) of no alerts and no collisions. The
color of each point represents the
where f goal is typically selected to be the utopia point. The utopia point is the point value of the composite metric with
in metric space that represents the best possible outcome for each metric. While the selected point highlighted in
green.
the utopia point is often unattainable, it provides a reference point for comparing 4
An overview of the L p norm op-
different system designs. Figure 3.7 shows an example of the goal metric for the erator is provided in appendix B.
aircraft collision avoidance problem.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.3. composite metrics 59

Suppose we want to create a composite metric for an aircraft collision avoid- Example 3.2. Using the weighted
sum composite metric to select an
ance system that balances the alert rate and collision rate. Using the weighted aircraft collision avoidance system
sum method, we define the composite metric as the weighted sum of the design along the Pareto frontier.
alert rate and collision rate. Selecting a weight vector then allows us to choose
a point on the Pareto frontier. The plots below show the Pareto frontier for
two different weight vectors. The first weight vector (w1 = [0.8, 0.2]) gives
more weight to minimizing the alert rate, while the second weight vector
(w2 = [0.2, 0.8]) gives more weight to minimizing the collision rate. The
points are colored according to the value of the composite metric.

0.8 w1
Collision Rate

0.6
w2
0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Alert Rate Alert Rate

The weight vector will be perpendicular to the Pareto frontier at the best
design point. The weight vector w1 is shown in blue for the first design point
and w2 is shown in blue for the second design point. The best design points
are highlighted in green.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
60 c hapter 3. property specification

The weighted exponential sum is composite metric that combines the weighted
sum and goal metrics as follows:
n
f (τ ) = ∑ wi (fi (τ ) − fgoal ) p (3.6)
i =1

where p ≥ 1 is an exponent similar to that used in L p norms. The weights wi must


be positive and sum to 1. The weighted exponential sum allows us to balance the
trade-offs between different metrics while also considering the distance to the
utopia point. Other more sophisticated weighting methods such as the weighted
min-max metric and the exponential weighted metric build on these ideas.5 5
For more information on com-
posite metrics, see T. W. Athan
and P. Y. Papalambros, “A Note
3.3.3 Preference Elicitation on Weighted Criteria Methods for
Compromise Solutions in Multi-
Creating a composite metric using weights requires us to specify the relative Objective Optimization,” Engineer-
ing Optimization, vol. 27, no. 2,
importance of each metric. However, even domain experts may have difficulty pp. 155–176, 1996.
translating their preferences to a set of precise numerical weights. Preference
elitication allows us to infer a set of weights based on expert responses to a set 1
of preference queries. For example, we might present a domain expert with a 0.8
pairwise query containing the metrics of two possible system designs and ask

wcollision
0.6
them to select the preferred design. By repeating this process for multiple different
pairwise queries of system designs, we can infer the weight that the expert assigns 0.4

to each metric. 0.2


In this section, we focus on inferring the weights of a weighted sum composite 0
0 0.2 0.4 0.6 0.8 1
metric using pairwise queries. There are other schemes for eliciting preferences, walert
such as ranking multiple system designs, but pairwise queries have been shown
to pose minimal cognitive burden on the expert.6 . We will also restrict ourselves Figure 3.8. The space of possible
weights for the aircraft collision
to weight vectors with positive entries that sum to a value less than or equal to 1. avoidance weighted sum metric.
Figure 3.8 shows the space of possible weights for the aircraft collision avoidance 6
V. Conitzer, “Eliciting Single-
example. Peaked Preferences Using Compar-
ison Queries,” Journal of Artificial In-
Suppose we query the expert with a pair of metric vectors f1 and f2 and find telligence Research, vol. 35, pp. 161–
that the expert prefers f 1 to f 2 . For the weighted sum metric to be consistent with 191, 2009.
the preference, we must select a weight vector w such that

w > f1 < w > f2 (3.7)


>
w ( f1 − f2 ) < 0 (3.8)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.3. composite metrics 61

where we assume that lower values for the composite metric are preferable.7 In 7
If higher values are preferable,
effect, the response to the query further constrains the space of possible weight the inequality in equation (3.8)
should be reversed.
vectors (example 3.3).

Suppose we want to infer the weights for a composite metric that combines the Example 3.3. The effect of a prefer-
ence query on the space of possible
alert rate and collision rate for an aircraft collision avoidance system. When weight vectors for the aircraft colli-
we query a domain expert or stakeholder with system designs f1 = [0.8, 0.4] sion avoidance example.
and f2 = [0.4, 0.8], we find that the expert prefers f1 to f2 . In other words,
the expert prefers the system design with the higher alert rate and lower
collision rate. Since the weight vector must be consistent with this preference
(equation (3.8)), we can further constrain the space of possible weight vectors
as shown in the figure below.

1
f2 f2
0.8
w T f1 < w T f2
wcollision

0.6
f1 f1
0.4

0.2 w T f1 > w T f2
w T f1 = w T f2
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
walert walert walert

The purple shaded region in the center plot shows the space of possible
weight vectors consistent with the expert’s preference. The plot on the right
shows the space of possible weight vectors consistent with the expert’s prefer-
ence and the constraint that the weights must sum to 1. We can further refine
the space of possible weight vectors by querying the expert with additional
pairs of system designs.

By querying the expert with multiple pairs of system designs, we can iteratively
refine the space of possible weight vectors (figure 3.9). To minimize the number
of times we must query the expert, it is common to select pairs of system designs 8
This method is known as Q-
that with maximally reduce the space of possible weights. For example, one Eval. V. S. Iyengar, J. Lee, and M.
Campbell, “Q-EVAL: Evaluating
method is to select the query that comes closest to bisecting the space of possible Multiple Attribute Items Using
weights.8 After querying the expert a desired number of times, we can select a Queries,” in ACM Conference on
Electronic Commerce, 2001.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
62 c hapter 3. property specification

Figure 3.9. The effect of multiple


preference queries on the space of
Query 1 Query 2 Query 3 Final Weight Space possible weight vectors for the air-
1 craft collision avoidance example.
f2 f2 f2 The blue shaded regions show the
0.8 f1
space of possible weight vectors
wcollision

0.6 f1 before obtaining the expert’s pref-


f1
0.4 erence, and the purple shaded re-
gions show the weight vectors con-
0.2
sistent with the expert’s preference.
0 The space of possible weight vec-
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
tors before the next query is the in-
walert walert walert walert tersection of these regions.

set of weights from the refined weight space to create a composite metric that
reflects the expert’s preferences. While we could select any value for w that is
consistent with the expert’s responses, it is common to select the weight vector
that maximally separates the system designs that were presented to the expert.

3.4 Logical Specifications

A logical specification ψ formally defines an operating requirement for a system


using a logical formula. A logical formula is a precise expression that evaluates to
either true or false. Logical specifications can be used to describe requirements
for both individual trajectories and trajectory distributions. For example, a logical
specification on an individual trajectory for an aircraft collision avoidance system
might check whether the aircraft collide at any point in the trajectory. A logical
specification over the entire distribution of aircraft collision avoidance trajectories
might require that the probability of collision is less than a certain threshold. We
can express logical formulas using several different types of logic. This section
introduces two common types of logic.

3.4.1 Propositional Logic


Propositional logic constructs logical formulas by connecting propositions using 9
A detailed overview of propor-
logical operators.9 A proposition is a statement that is either true or false. The basic tional logic is provided by M. Huth
building block of propositional logic is an atomic proposition, which is a proposition and M. Ryan, Logic in Computer
Science: Modelling and Reasoning
that cannot be further decomposed. The two most basic logical expressions are about Systems. Cambridge Univer-
sity Press, 2004.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.4. logical specifications 63

Suppose we wish to express the following statement using propositional Example 3.4. Constructing a
propositional logic formula from
logic: ‘‘If the agent is in a safe state, then the agent is not in a collision a statement.
state.’’ Let the variable S represent whether the agent is in a safe state and C
represent whether the agent is in a collision state. The propositional logic
statement is S → ¬C (read as ‘‘S implies not C’’). In this statement, S and C
are atomic propositions because they cannot be broken down further. The
logical formula S → ¬C is itself a proposition that can be combined with
other propositions to create more complex formulas.

Expression Explanation Construction Table 3.1. Propositional logic op-


erators and their equivalent con-
¬P Negation (Not): Inverts a Boolean value. —
struction using negation and con-
P∧Q Conjunction (And): Evaluates to true if both P and Q are true. — junction. The constructions build
P∨Q Disjunction (Or): Evaluates to true if either P or Q are true. ¬(¬ P ∧ ¬ Q) from previous expressions for con-
P→Q Implication: Evaluates to true unless P is true and Q is false. ¬P ∨ Q venience (e.g., the use of ∨ in im-
P↔Q Biconditional: Evaluates to true when both P and Q are equivalent. ( P ∧ Q) ∨ (¬ P ∧ ¬ Q) plication).

negation (‘‘not’’) and conjunction (‘‘and’’). All other logical expressions such as
disjunction (‘‘or’’), implication (‘‘if-then’’), and biconditional (‘‘if and only if’’) can
be constructed using negation and conjunction. Example 3.4 demonstrates the
construction of a propositional logic formula from a statement.
Table 3.1 shows the propositional logic operators and their construction using
negation and conjunction. We can describe propositional logic formulas using
truth tables, which show the value of the formula as a function of its inputs.
Figure 3.10 shows truth tables for each of the basic propositional logic operators.
Logical operators can also be illustrated as logic gates (figure 3.11), which are
fundamental building blocks for digital circuits.10 Example 3.5 implements the 10
R. Page and R. Gamboa, Essen-
logical operators as functions in Julia. tial Logic for Computer Science. MIT
Press, 2019.

3.4.2 First-Order Logic


First-order logic extends propositional logic by introducing the notion of predicates
and quantifiers.11 It uses variables to represent objects in a domain and predicate 11
First-order logic is also known
functions to evaluate propositions over these objects. For example, we could create as predicate logic. M. Huth and
M. Ryan, Logic in Computer Science:
a variable x to represent the state of an agent and a predicate function P( x ) Modelling and Reasoning about Sys-
that returns true if the agent is in a safe state and false otherwise. We combine tems. Cambridge University Press,
2004.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
64 c hapter 3. property specification

Figure 3.10. Truth tables for the


P Q P∧Q P Q P∨Q
propositional logic operators using
P ¬P atomic propositions P and Q. The
false false false false false false
truth tables show the outputs of
false true false true false false true true each logical operator for all possi-
true false true false false true false true ble combinations of Boolean values
for P and Q.
true true true true true true

P Q P∨Q P Q P↔Q
false false false false false true
false true true false true false
true false true true false false
true true true true true true

P P
Figure 3.11. Logical operators rep-
P∧Q P∨Q P ¬P
Q Q resented using logic gates.

AND gate OR gate NOT gate

P P↔Q

P
P→Q Q
Q

IMPLICATION gates BICONDITIONAL gates

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 65

Consider two atomic propositions, P and Q. The basic operations of negation Example 3.5. Julia implementa-
tions of propositional logic oper-
(!), conjunction (&&), and disjunction (||) are already implemented in most ators.
programming languages including Julia. Implication P → Q can be defined
as the operator ⟶ given the Boolean values of P and Q:
julia> ⟶(P,Q) = !P || Q # \longrightarrow<TAB>
⟶ (generic function with 1 method)
julia> P = true;
julia> Q = false;
julia> P ⟶ Q
false

For the biconditional P ↔ Q, we can use the == sign:


julia> P = false;
julia> Q = false;
julia> P == Q
true

predicates to create propositions using logical operators. For instance, if we have


a predicate function Q( x ) that returns true if the agent is in a collision state, we
can create the proposition P( x ) → ¬ Q( x ) to express that the agent is not in a
collision state when it is in a safe state.
Quantifiers allow us to evaluate propositions over a collection of variables. The
universal quantifier ∀ (‘‘for all’’) returns true if all variables in the domain satisfy
the proposition. The existential quantifier ∃ (‘‘there exists’’) returns true if at least
one variable in the domain satisfies the proposition. These quantifiers allow us to
create specifications over full system trajectories by setting the domain to be the
set of all states in the trajectory. Example 3.6 demonstrates the use of quantifiers
to define an obstacle avoidance specification over a trajectory.

3.5 Temporal Logic

Temporal logic extends first-order logic to specify properties over time. It is partic-
ularly useful for specifying properties of dynamical systems because it allows us
to describe how trajectories should evolve. This section outlines three common
types of temporal logic.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
66 c hapter 3. property specification

Let x be a variable that represents the state of the agent in the grid world Example 3.6. Universal and ex-
istential quantifiers for an obsta-
problem where we must avoid an obstacle (red), and define the domain X cle avoidance problem. The red re-
as the set of states that comprise a particular trajectory. We define a predicate gion indicates an obstacle while the
green region indicates the goal.
function O( x ) that evaluates to true if x is an obstacle state and false otherwise.
To define a specification ψ1 that states ‘‘for all states in the trajectory, the
agent does not hit an obstacle,’’ we can use the formula:

ψ1 = ∀ x ¬O( x )

The examples below show evaluations of ψ1 for two different trajectories.

ψ1 = true ψ1 = f alse

Suppose we also want the agent to reach a goal state while avoiding the
obstacle. We can create an additional predicate G ( x ) that evaluates to true
if x is a goal state and false otherwise. We then create ψ2 to represent the
statement ‘‘for all states in the trajectory, the agent does not hit an obstacle
and there exists a state in the trajectory in which the agent reaches the goal’’
using the following formula:

ψ2 = (∀ x ¬O( x )) ∧ (∃ x G ( x ))

The examples below show evaluations of ψ2 for two different trajectories.

ψ2 = true ψ2 = f alse

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 67

Until: PU Q Figure 3.12. Examples of the binary


temporal operator until and unary
Always:  P Eventually: ♦ P temporal operators eventually and
P
always. The temporal operator is de-
P P
Q fined from time t to the end of the
P ♦P sequence. Note that always holds as
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 long as each subsequent ψ is true.
PU Q
1 2 3 4 5 6 7 8 9 10 time time
time

3.5.1 Linear Temporal Logic


Linear temporal logic (LTL) is a type of temporal logic that assumes a linear sequence
of states.12 It introduces three main temporal operators.13 Given a proposition 12
A. Pnueli, “The Temporal Logic
of Programs,” in Symposium on
P, the always ( P) operator specifies that P must be true at all time steps in the
Foundations of Computer Science
future. The eventually (♦ P) operator requires that P be true at some point in the (SFCS), 1977. Computation tree logic
future. Given another proposition Q, the until (P U Q) operator specifies that P (CTL) is another common tem-
poral logic that operates over
must be true at least until Q becomes true. multiple future paths. A detailed
Table 3.2 outlines the three LTL operators and their construction, and algo- overview is provided in C. Baier
rithm 3.1 evaluates LTL specifications over the sequence of states in a trajectory. and J.-P. Katoen, “Principles of
Model Checking,” in MIT Press,
The until operator can be written using first-order logic quantifiers, and the other 2008, ch. 6.
13
two operators build on the until operator. Figure 3.12 shows the values of these Other common operators include
next and weak until.
operators over a trajectory, and example 3.7 shows how to construct an LTL
specification for the grid world problem.

Expression Explanation Construction Table 3.2. LTL operators. The


propositions Pt and Qt represent
∀ t 0 (0 t0 t)¬ Q0t

PU Q Until: P is true at least until Q becomes true. ∃t Qt ∧ ≤ < → Pt0
whether P and Q are true at at time
♦P Eventually: P will be true at some time in the future. >U P t, and the > symbol indicates static
P Always: P is true at every time in the future. ¬♦(¬ P) truth.

struct LTLSpecification <: Specification Algorithm 3.1. Definition of LTL


formula # formula specified using SignalTemporalLogic.jl specification. The formula is eval-
end uated over the sequence of states
evaluate(ψ::LTLSpecification, τ) = ψ.formula([step.s for step in τ]) in the trajectory starting at the first
time step.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
68 c hapter 3. property specification

For a navigation problem, let ψ be the LTL property specification that states Example 3.7. LTL formula for an
obstacle avoidance problem where
‘‘eventually reach the goal after passing through the checkpoint and always avoid the a blue checkpoint must be reached
obstacle.’’ First, we define the following predicate functions: before the green goal while avoid-
ing the red obstacle.
F (st ) : the state s at time t contains an obstacle ψ = true
G (st ) : the state s at time t is the goal
C (st ) : the state s at time t is the checkpoint

The specification can be defined using LTL as follows:

ψ = ♦ G ( s t ) ∧ ¬ C ( s t ) U G ( s t ) ∧ ¬ F ( s t )

This formula requires that the agent reaches the goal (♦G (st )) but that the
goal is not reached until the checkpoint (¬ G (st ) U C (st )). Additionally, the
agent must always avoid obstacles (¬ F (st )). The figure in the caption
shows an example trajectory that satisfies this specification. The following
code constructs the LTL specification:
F = @formula sₜ -> sₜ == [5, 5]
G = @formula sₜ -> sₜ == [7, 8]
C = @formula sₜ -> sₜ == [8, 3]
ψ = LTLSpecification(@formula ◊(G) ∧ 𝒰(¬G, C) ∧ □(¬F))

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 69

3.5.2 Signal Temporal Logic


Signal temporal logic (STL) extends LTL to specify properties over signals.14 A 14
STL was first introduced in O.
signal is a real-valued sequence of points in discrete time that represent the state Maler and D. Nickovic, “Monitor-
ing Temporal Properties of Con-
of a system over time.15 STL introduces two key extensions to LTL to handle real- tinuous Signals,” in International
valued signals. The first extension is the ability to specify properties over a time Symposium on Formal Techniques in
Real-Time and Fault-Tolerant Systems,
interval [ a, b]. For example, we can write ♦[ a,b] P to specify that P will eventually 2004.
be true within the time interval [ a, b].16 15
These points may be sampled at
The second extension is the introduction of predicates that map real-valued regular or irregular intervals from
a continuous-time function.
signals to truth values. Specifically, it introduces the predicate µc (st ) that returns 16
When a time range is omitted, we
true if assume the positive time path of
µ(st ) > c (3.9) [0, ∞).

where µ(·) is a real-valued function that operates on the state (example 3.8).
Table 3.3 defines the specifications for the continuum world, inverted pendulum,
and collision avoidance example problems using STL. Algorithm 3.2 provides a
framework for evaluating STL specifications over a trajectory given a time interval.

Suppose we want to implement the following STL formula in code: Example 3.8. Julia implementation
of an STL formula.
‘‘eventually the signal will be greater than 0.5.’’ We can use the
SignalTemporalLogic.jl package to define the predicate µ and the formula
ψ as follows:
julia> using SignalTemporalLogic
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5];
julia> μ = @formula sₜ -> sₜ > 1.0;
julia> ψ = @formula ◊(μ);
julia> ψ(τ) # check if formula is satisfied
true

The formula is satisfied since the signal eventually becomes greater than 1.

struct STLSpecification <: Specification Algorithm 3.2. Definition of an STL


formula # formula specified using SignalTemporalLogic.jl specification for an interval.
I # time interval (e.g. 3:10)
end
evaluate(ψ::STLSpecification, τ) = ψ.formula([step.s for step in τ[ψ.I]])

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
70 c hapter 3. property specification

System Property Implementation

Continuum World
‘‘Reach the goal without
hitting the obstacle’’ G = @formula s->norm(s.-[6.5,7.5])≤0.5
G (st ): st is in the goal region F = @formula s->norm(s.-[4.5,4.5])≤0.5
F (st ): st is in the obstacle region ψ = @formula ◊(G) ∧ □(¬F)
ψ = ♦ G ( s t ) ∧ ¬ F ( s t )

Inverted Pendulum
−π/4 π/4 ‘‘Keep the pendulum balanced’’
θ B = @formula s->abs(s[1])≤π/4
B(st ): |θt | ≤ π/4
ψ = @formula □(B)
ψ =  B(st )

Aircraft Collision Avoidance


400

200 ‘‘Ensure at least 50 meters relative


altitude between 40 and 41 seconds’’
h (m)

S = @formula s->abs(s[1])≥50
0
S(st ): |ht | ≥ 50 ψ = @formula □(40:41, S)
−200
ψ = [40,41] S(st )
−400
1 11 21 31 41
Time (s)

Table 3.3. Signal temporal logic


formulas for three of the exam-
ple problems used throughout the
book.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.5. temporal logic 71

One benefit of expressing properties using STL is the ability to calculate a


robustness metric using the specification. Robustness measures how ‘‘close’’ a
signal is to satisfying a specification. For example, the robustness of the predicate
µc (st ) is defined as
ρ(st , µc ) = µ(st ) − c (3.10)
If the predicate is false for st , the robustness will be negative, and if it is true, the
robustness will be positive. The signal becomes closer to satifying the specification
as the robustness approaches zero and further from not satisfying the specification
as the robustness increases from zero.
Given propositions P and Q that correspond to predicates µc (st ) and µd (st ),
we can also define robustness formulas for the propositional logic operators ¬ P,
P ∧ Q, P ∨ Q, and P → Q as follows:

ρ(st , ¬ P) = −ρ(st , P) (3.11)


 
ρ(st , P ∧ Q) = min ρ(st , P), ρ(st , Q) (3.12)
 
ρ(st , P ∨ Q) = max ρ(st , P), ρ(st , Q) (3.13)
 
ρ(st , P → Q) = max −ρ(st , P), ρ(st , Q) (3.14)

Intuitively, the robustness of a conjunction is the minimum of the robustness


of its components since both components must hold, and the robustness of a
disjunction is the maximum of the robustness of its components since only one
component must hold.
We can also define robustness over the temporal operators:

ρ(st , ♦[ a,b] P) = max ρ ( s t0 , P ) (3.15)


t0 ∈[t+ a,t+b]

ρ(st , [ a,b] P) = min ρ ( s t0 , P ) (3.16)


t0 ∈[t+ a,t+b]
  
ρ(st , P U [ a,b] Q) = max min ρ(st0 , Q), min ρ(st00 , P) (3.17)
t0 ∈[t+ a,t+b] t00 ∈[t,t0 ]

In general, the robustness of a temporal operator is the maximum or minimum


of the robustness of its components over the specified time interval. We take
the maximum over all time steps for the eventually operator to get the best-case
signal because the signal must satisfy the property at only one time step in the
interval. Conversely, we take the minimum over all time steps for the always

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
72 c hapter 3. property specification

Let µ0 (st ) be a predicate function that is true if st is greater than 0. The Example 3.9. Robustness of the for-
mulas ψ1 = ♦µ0 and ψ2 = µ0
following code computes the robustness of the formulas ♦µ0 and µ0 over a over a signal τ.
signal τ:
4
julia> using SignalTemporalLogic
ρ1
julia> τ = [-1.0, -3.2, 2.0, 1.5, 3.0, 0.5, -0.5, -2.0, -4.0, -1.5]; 2
julia> μ = @formula sₜ -> sₜ > 0.0;
julia> ψ₁ = @formula ◊(μ); 0

s
julia> ρ₁ = ρ(τ, ψ₁)
3.0 −2
julia> ψ₂ = @formula □(μ); ρ2
julia> ρ₂ = ρ(τ, ψ₂) −4
-4.0
2 4 6 8 10
Time
The robustness of the formula ♦µc is the maximum difference between
the signal and the threshold. We would have to decrease all of our signal
values by at least this value to make the formula false. The robustness of the
formula µc is the minimum difference between the signal and the threshold.
We would have to increase all of our signal values by at least this value to
make the formula true. The figure in the caption shows signal values that
determine the robustness for each formula.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.6. reachability specifications 73

operator because the signal must satisfy the property at all time steps in the
interval. Example 3.9 demonstrates this concept.
We can use the robustness metric to assess how close a given system trajectory is
to a failure. Furthermore, if we are able to compute the gradient of the robustness
metric with respect to certain inputs to the system, we can understand how these
inputs affect the overall safety of the system. We will use this idea throughout
the book to understand system behavior. For example, we can uncover the failure
modes of a system by using the robustnes metric to guide the simultor towards a
failure trajectory (see chapter 4 for more details).
Taking the gradient of the robustness metrics requires that the robustness
formula is differentiable over the input space. However, the min and max func-
tions that commonly occur in STL fomulas are not differentiable everywhere.
To address this challenge, we can use smooth approximations of the min and
max functions, such as the softmin and softmax functions, respectively.17 These 17
K. Leung, N. Aréchiga, and
functions are defined as M. Pavone, “Backpropagation
Through Signal Temporal Logic
∑id si exp(−si /w) Specifications: Infusing Logical
softmin(s; w) = (3.18) Structure into Gradient-Based
∑dj exp(−s j /w) Methods,” The International Journal
of Robotics Research, vol. 42, no. 6,
∑id si exp(si /w) pp. 356–370, 2023.
softmax(s; w) = (3.19)
∑dj exp(s j /w)
4
where s is a signal of length d and w is a weight. As w approaches infinity, the ρ1
2
softmin and softmax functions approach the mean function. As w approaches
ρ̃1
0
zero, the softmin and softmax functions approach the min and max functions

ρ̃
(figure 3.13). We call the robustness metric that uses the softmin and softmax −2
functions the smooth robustness metric. Figure 3.14 shows the gradient of the −4
smooth robustness metric for different values of w.
0 5 10 15
w

3.6 Reachability Specifications Figure 3.13. Smooth robustness


metric ρ̃ for the formula in exam-
A reachability specification is a special type of temporal logic specification that ple 3.9. The robustness metric ρ1 is
shown as a blue dashed line, and
describes a state or set of states that a system should or should not reach during
the mean of the points in the trajec-
its execution. Let S T ⊆ S represent the target set of states and define the predicate tory is shown in gray. When w = 0,
function R(st ) to be true if st ∈ S T and false otherwise. If our goal is to reach the the smooth robustness metric ρ̃1 is
equal to the robustness metric ρ1 .
target set, the reachability specification has the following form: When w is large, the smooth robust-
ness metric approaches the mean
ψ = ♦ R(st ) (3.20) of the trajectory.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
74 c hapter 3. property specification

w=0 w=1 w=2 w=5 w = 20 Figure 3.14. The gradient of the


smooth robustness function for the
formula in example 3.9 with re-
spect to the signal values for differ-
∇ρ̃

ent values of w used in the smooth


robustness metric ρ̃. When w = 0,
the gradient is only nonzero at the
point corresponding to the maxi-
Time Step Time Step Time Step Time Step Time Step
mum robustness. As w increases,
the gradient becomes nonzero at
all points in the trajectory. Since
the smooth robustness approaches
If our goal is to avoid the target set, we write the reachability specification as the mean of the trajectory as w
increases, the gradient becomes
more uniform.
ψ = ¬♦ R ( s t ) = ¬ R ( s t ) (3.21)

Writing specifications in this form is useful because many algorithms related


to formal methods and model checking are centered around reachability speci-
fications. For example, the algorithms in chapters 8 to 10 determine whether a
system could reach a target set. For some systems, such as the inverted pendulum
system example 3.10, the reachability specification is the most natural way to
express the desired behavior. However, it is possible to convert other types of
specifications into reachability specifications using various techniques. In fact, we
can convert any LTL specification into a reachability specification by augmenting
the state space of the system.

Let S T be the set of states for the inverted pendulum system where the Example 3.10. Reachability specifi-
cation for the inverted pendulum
pendulum has tipped over. In other words, S T is the set of states where the system.
angle θ is outside the range [−π/4, π/4]. Our goal is to avoid reaching this
set of states, so we define the reachability specification as

ψ = ¬♦ R ( s t ) (3.22)

where R(st ) is the predicate function that checks if the state is in the target
set.

The first step in converting an LTL specification into a reachability specification


is to represent the LTL formula as a Büchi automaton.18 A Büchi automaton consists 18
Büchi automata are named af-
of a set of states Q, an initial state q1 ∈ Q, a set of atomic propositions Π, a ter Swiss mathematician Julius
Richard Büchi (1924–1984).
transition function δ, and a set of accepting states. The transition function δ maps

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.6. reachability specifications 75

a state and an instantiation of truth values for the atomic propositions to the next
state. The accepting states are the states that are visited infinitely often when the
automaton accepts an infinite sequence of states. Example 3.11 shows a simple
Büchi automaton with two states and two propositions.

The figure below shows a simple Büchi automaton that accepts an infinite Example 3.11. Example of a Büchi
automaton with two states and two
sequence of states if the sequence satisfies the LTL formula A ∧ B. atomic propositions.

¬( A ∧ B) >

A∧B
start q1 q2

The automation has two states Q = {q1 , q2 }, where q1 is the initial state and
q2 is the accepting state. The automation has two atomic propositions A and
B. The transition function is defined for all possible combinations of truth
values for the automic propositions:

δ ( q1 , A ∧ B ) = q2
δ ( q1 , A ∧ ¬ B ) = q1
δ ( q1 , ¬ A ∧ B ) = q1
δ ( q1 , ¬ A ∧ ¬ B ) = q1
δ(q2 , −) = q2

The diagram above compactly summarizes the transition function as δ(q1 , A ∧


B) = q2 and δ(q1 , ¬( A ∧ B)) = q1 . The accepting state is denoted using the
double circle.
19
More details are provided in C.
It is possible to represent any LTL formula as a Büchi automaton (exam- Baier and J.-P. Katoen, Principles of
Model Checking. MIT Press, 2008.
ple 3.12).19 The accepting trajectories of the Büchi automaton satisfy the cor- Open source software packages
responding LTL formula. To obtain a reachability specification from the Büchi such as Spot can be used to do the
conversion automatically. A. Duret-
automaton, we must augment the state space of the system of interest. The new Lutz, “Manipulating LTL Formu-
state space is the product of the states of the system and the states of the Büchi las Using Spot 1.0,” in Automated
automaton: Technology for Verification and Analy-
sis, 2013. The Spot.jl package pro-
(s, q) ∈ S × Q (3.23) vides an interface to the Spot li-
brary.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
76 c hapter 3. property specification

Suppose we have an LTL formula that specifies that we need to visit a check- Example 3.12. Conversion of an
LTL formula to a Büchi automaton.
point before reaching a goal, written as

♦G ∧ ¬ G U C

where G is an atomic proposition that represents whether we reach the


goal and C is an atomic proposition that represents whether we reach the
checkpoint. We can convert this formula into the Büchi automaton using
Spot.jl as follows:
using Spot
a = translate(LTLTranslator(), ltl"◊(G) ∧ ¬G𝒰C")

The resulting automaton is shown below. It has 4 states and the same
atomic propositions as the LTL formula. The accepting state is q4 , and the
LTL formula is satisfied if the automaton visits q4 infinitely often, or in other
words, if a trajectory reaches q4 . The state q2 represents the state where the
agent has reached the goal but has not reached the checkpoint. Once this
state has been reached, the agent will remain in this state forever with no
chance of reaching the accepting state and satisfying the LTL formula. This
state is often ommited in practice to reduce the size of the automaton.

>

¬C ∧ ¬ G >
q2
¬C ∧ G
C∧G
start q1 q4
C ∧ ¬G G

q3

¬G

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.7. summary 77

The transition model for the new state space is defined by the transition model of
the system T and the transition model of the Büchi automaton δ:

 T (s0 | s, a) if q0 = δ(q, L(s))
T ((s0 , q0 ) | (s, q), a) = (3.24)
0 otherwise

where L(s) is a labeling function that maps a state s to values for the atomic
propositions of the Büchi automaton. For example, a labeling function for the
system in example 3.12 would map the state st to the values of that specify whether
it is a goal state or checkpoint state.
We refer to the system with the augmented state space as the product system.
The reachability specification for the product system is

ψ = ♦ R((st , qt )) (3.25)

where R((st , qt )) is a predicate function that returns true if qt is an accepting


state of the Büchi automaton and false otherwise. Checking whether the product
system satifies the reachability specification is equivalent to checking whether
the original system satisfies the LTL formula. Figure 3.15 shows the product
system with the grid world as the original system and the LTL specification in
example 3.12.

3.7 Summary

• Metrics and specifications allow us to quantify and express the desired behavior
of a system.

• For stochastic systems, we often compute metrics over the full distribution of
possible outcomes.

• In situations where we are interested in multiple metrics, we can create a


composite metric that accounts for the relative importance of each metric.

• Logical specifications allow us to formally express requirements for a system


using logical formulas.

• Propositional logic and first-order logic allow us to express properties over a


set of propositions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
78 c hapter 3. property specification

> Figure 3.15. Converting an LTL


specfication (see example 3.12) for
the grid world problem to a reach-
¬C ∧ ¬ G > ability specification by creating
q2 a product system with an aug-
¬C ∧ G
mented state space. The original
C∧G system is shown on the left, the
start q1 q4
C ∧ ¬G G Büchi automaton is shown in the
middle, and the product system
q3 is shown on the bottom. We start
in the gray grid world until we ei-
ther reach the checkpoint (blue) or
goal (green). If we reach the goal
¬G in the gray grid world, we transi-
tion to the red grid world and re-
Original System Büchi Automaton main there forever. If we reach the
checkpoint in the gray grid world,
we transition to the blue grid world
and remain there until we reach
the goal. If we reach the goal in
( q2 , s ) the blue grid world, we transition
to the green grid world, which
represents an accepting state for
the Büchi automaton. The dashed
green line in the Büchi automaton
does not appear in the product sys-
( q1 , s ) tem because we cannot reach the
checkpoint and goal at the same
( q4 , s ) time in the grid world system. The
set of target states for the reacha-
bility problem is the set of states in
the green grid world.

( q3 , s )

Product System
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
3.7. summary 79

• Temporal logic extends first-order logic to express properties about how sys-
tems evolve over time.

• Linear temporal logic (LTL) and Signal Temporal Logic (STL) are two common
temporal logics used in control and verification.

• Reachability specifications are a special type of temporal logic specification


that describe a state or set of states that a system should or should not reach
during its execution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4 Falsification through Optimization

The first set of validation algorithms we will explore relate to falsification. Fal-
sification is the process of finding trajectories of a system that violate a given
specification. Such trajectories are sometimes referred to as counterexamples, failure
trajectories, or falsifying trajectories. We will refer to them in this textbook as failures
for simplicity. The beginning of the chapter introduces a naïve algorithm for find-
ing failures based on direct sampling, with the rest of the chapter focused on more
sophisticated algorithms that use optimization techniques to guide the search
for failures. Optimization-based falsification relies on the concept of disturbances,
which control the behavior of the system. We demonstrate how to frame the
falsification problem as an optimization over disturbance trajectories and outline
several techniques to perform the optimization.

4.1 Direct Sampling

When performing falsification, we want to find any trajectory τ that violates


a given specification ψ, written as τ 6∈ ψ. Algorithm 4.1 uses direct sampling Figure 4.1. Monte Carlo falsifica-
to search for such trajectories.1 It performs m rollouts and returns all failure tion applied to the grid world prob-
lem with m = 100 and d = 50. The
trajectories. Figure 4.1 shows an example of direct falsification applied to the grid
probability of slipping is set to 0.8.
world problem. The algorithm samples 96 trajecto-
Algorithm 4.1 may struggle for systems with rare failure events. For a system ries before finding a failure. The
failure trajectory is shown in red.
with probability of failure pfail , we will require 1/pfail samples on average to 1
This type of sampling is often
observe a single failure. In fact, we can infer a distribution over the number of referred to as Monte Carlo sam-
samples required to find a failure. The probability of finding the first failure on pling, named after the Monte Carlo
casino in Monaco. Similar to gam-
the kth sample is equivalent to the probability of sampling k − 1 successes with bling, the algorithm depends on
probability 1 − pfail and one failure with probability pfail . We therefore write the random chance.
82 c hapter 4. falsification through opti mization

struct DirectFalsification Algorithm 4.1. The direct falsifica-


d # depth tion algorithm for finding failures.
m # number of samples The algorithm performs rollouts to
end a depth d to generate m samples of
the system sys. It then filters these
function falsify(alg::DirectFalsification, sys, ψ) samples and returns the ones that
d, m = alg.d, alg.m violate the specification ψ. If no fail-
τs = [rollout(sys, d=d) for i in 1:m] ures are found, the algorithm re-
return filter(τ->isfailure(ψ, τ), τs) turns an empty vector.
end

probability mass function of the distribution as 0.2

P(k) = (1 − pfail )k−1 pfail (4.1) 0.15

P(k)
0.1
where k ∈ N.
0.05
Equation (4.1) corresponds to the probability mass function of a geometric
0
distribution with parameter pfail . Figure 4.2 shows an example of a geometric 1 2 3 4 5 6 7 8
distribution. The expected value of this distribution, 1/pfail , corresponds to the k
average number of samples required to find a failure. Example 4.1 illustrates this
Figure 4.2. The probability mass
relationship for the aircraft collision avoidance problem. Systems with very low function of a geometric distribu-
failure probabilities will require a large number of samples for direct falsification. tion with parameter pfail = 0.2.
The expected value of this distri-
For example, some aviation systems have failure probabilities on the order of bution is 1/pfail = 5.
10−9 . These systems require 1 billion samples on average to observe a single
failure event. The remainder of the chapter discusses more efficient falsification
techniques.

4.2 Disturbances

We can systematically search for failures by taking control of the sources of ran-
domness in the system. We control these sources of randomness using disturbances.
To incorporate disturbances into a system, we rewrite its sensor, agent, and en-
vironment models by breaking up their stochastic and deterministic elements.
For example, the observation model o ∼ O(· | s) can be written as a deterministic
function of the current state s and a stochastic disturbance xo such that

o = O(s, xo ), xo ∼ Do (· | s) (4.2)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.2. disturbances 83

Suppose we want to find failures of an aircraft collision avoidance system Example 4.1. Direct falsification ap-
plied to the aircraft collision avoid-
using direct falsification. In this scenario, a failure is a collision between two ance problem with different lev-
aircraft, which occurs when the relative altitude to the intruder aircraft h is els of noise applied to the transi-
tions. There are four state variables
within ±50 m and the time to collision tcol is zero. The collision avoidance for the collision avoidance problem.
environment applies additive noise with standard deviation σ to the relative These plots show how two of these
vertical rate of the intruder aircraft ḣ at each time step. This noise accounts for state variables evolve for each tra-
jectory. The horizontal axis is the
variation in pilot response to advisories and the intruder flight path. The plots time to collision tcol , and the ver-
below use different values of σ and show the trajectory samples produced tical axis is the altitude relative to
the intruder aircraft h.
before finding the first failure with the first failure trajectory highlighted in
red.
σ = 5m σ = 3m σ = 2m
400

200
h (m)

−200

−400
40 30 20 10 0 40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s) tcol (s)

As σ decreases, failures become less likely, and more trajectories are required
to find a failure. In this example, the first failure is found after 41 samples
with σ = 5 m, 84 samples with σ = 3 m, and 522 samples with σ = 2 m.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
84 c hapter 4. falsification through optimization

where O(s, xo ) is a deterministic function and Do (· | s) is a disturbance distribution.


For example, a disturbance applied to an additive noise sensor controls the amount
of sensor noise added to the true state to produce an observation. Example 4.2
demonstrates this concept for a Gaussian noise sensor model.

Suppose we model a sensor using a Gaussian noise model such that O(o | Example 4.2. Separating the sto-
chastic and deterministic elements
s) = N (o | s, Σ). We can rewrite this sensor model as of a sensor with a Gaussian noise
model.
o = s + xo , xo ∼ N (· | 0, Σ)

We can then define this sensor using the following code:


struct GaussianNoiseSensor <: Sensor
Do # distribution = Do(s)
end
(sensor::GaussianNoiseSensor)(s) = s + rand(sensor.Do(s))
(sensor::GaussianNoiseSensor)(s, xo) = s + xo

In this code, Do represents the nominal disturbance distribution Do ( xo |


s) = N ( xo | 0, Σ). Since it is a conditional distribution, we represent it as a
function that takes in a state s and outputs a distribution. The disturbance in
this sensor model does not depend on the state, so the function returns the
same distribution regardless of its input. The first function represents the
original sensor model and adds noise sampled from Do to the true state s to
produce an observation. The second function allows us to deterministically
produce an observation for state s given a disturbance xo.

The agent’s policy and the environment’s transition model can also be decom-
posed:
a = π (o, x a ), x a ∼ Da (· | s) (4.3)
s0 = T (s, a, xs ), xs ∼ Ds (· | s, a) (4.4)

where π (o, x a ) and T (s, a, xs ) are deterministic functions and Da (· | s) and


Ds (· | s, a) are disturbance distributions. In this textbook, we will wrap these three
components into a single disturbance x and disturbance distribution D. Given
a current state and disturbance distribution, we can sample a disturbance and
produce an observation, action, and next state (algorithm 4.2). For system com-
ponents that are modeled using deterministic functions, applying a disturbance
has no effect.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.3. fuzzing 85

struct Disturbance Algorithm 4.2. Implementation


xa # agent disturbance of a disturbance and disturbance
xs # environment disturbance distribution. The individual distur-
xo # sensor disturbance bance components are used to con-
end trol the agent, environment, and
sensor respectively. Since the com-
struct DisturbanceDistribution ponents of the disturbance dis-
Da # agent disturbance distribution tribution are conditional distribu-
Ds # environment disturbance distribution tions, we assume they are func-
Do # sensor disturbance distribution tions that take in the evidence vari-
end ables and output a sampleable dis-
tribution. Given a current state s
function step(sys::System, s, D::DisturbanceDistribution) and disturbance distribution D, the
xo = rand(D.Do(s)) step function samples a distur-
o = sys.sensor(s, xo) bance and uses it to produce an ob-
xa = rand(D.Da(o))
servation, action, and next state.
a = sys.agent(o, xa)
xs = rand(D.Ds(s, a))
s′ = sys.env(s, a, xs)
x = Disturbance(xa, xs, xo)
return (; o, a, s′, x)
end

4.3 Fuzzing

Unlike direct sampling, which samples from the nominal distribution over system 2
Fuzzing is a well-known concept
trajectories, we can find failures more efficiently by sampling from a trajectory in testing of traditional software.
It refers to the generation of off-
distribution designed to stress the system. We refer to this process as fuzzing.2 nominal inputs to a program to un-
Before we can perform fuzzing, we need to define the components of a trajectory cover potential bugs or failures and
was first introduced in B. P. Miller,
distribution. There are two sources of randomness in a trajectory rollout: the L. Fredriksen, and B. So, “An Em-
initial state and the disturbances applied at each time step. Therefore, we can fully pirical Study of the Reliability of
capture the distribution over trajectories by specifying an initial state distribution UNIX Utilities,” Communications of
the ACM, vol. 33, no. 12, pp. 32–44,
and a disturbance distribution for each time step (algorithm 4.3). 1990.

abstract type TrajectoryDistribution end Algorithm 4.3. Definition of


function initial_state_distribution(p::TrajectoryDistribution) end a trajectory distribution. The
function disturbance_distribution(p::TrajectoryDistribution, t) end initial_state_distribution
function depth(p::TrajectoryDistribution) end function returns the distribu-
tion over initial states. The
disturbance_distribution
function returns the disturbance
In the algorithms presented so far, we have been implicitly sampling from the distribution at time t. The depth
nominal trajectory distribution for a system. We can explicitly construct this distri- function returns the number
bution for a given system using algorithm 4.4. The nominal trajectory distribution of time steps in the trajectories
sampled from the distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
86 c hapter 4. falsification through opti mization

uses the default initial state and disturbance distributions for the components
of the system. Nominal trajectory distributions are stationary, meaning that the
disturbance distribution does not depend on time.

struct NominalTrajectoryDistribution <: TrajectoryDistribution Algorithm 4.4. The nominal tra-


Ps # initial state distribution jectory distribution for a system.
D # disturbance distribution We can construct this distribu-
d # depth tion for a given system sys and
end depth d using the default initial
state and disturbance distributions
function NominalTrajectoryDistribution(sys::System, d) specified by the components of
D = DisturbanceDistribution((o) -> Da(sys.agent, o), the system. Nominal trajectory dis-
(s, a) -> Ds(sys.env, s, a), tributions are stationary, so the
(s) -> Do(sys.sensor, s)) disturbance_distribution func-
return NominalTrajectoryDistribution(Ps(sys.env), D, d) tion returns the same value for any
end time input t.

initial_state_distribution(p::NominalTrajectoryDistribution) = p.Ps
disturbance_distribution(p::NominalTrajectoryDistribution, t) = p.D
depth(p::NominalTrajectoryDistribution) = p.d

We sample trajectories from a trajectory distribution by performing rollouts.


Algorithm 4.5 implements a trajectory rollout given a trajectory distribution. It
returns a trajectory τ = (s1 , o1 , a1 , x1 , . . . , sd , od , ad , xd ). If the initial state distribu-
tion and disturbance distributions correspond to the nominal distributions for
the system, algorithm 4.5 performs the same function as algorithm 1.2. However,
algorithm 4.5 also allows us to sample from a different trajectory distribution.
We can use it to perform fuzzing by specifying a trajectory distribution that is
designed to increase the likelihood of sampling failure trajectories. Example 4.3
demonstrates this technique on the inverted pendulum system.

function rollout(sys::System, p::TrajectoryDistribution; d=depth(p)) Algorithm 4.5. A function that per-


s = rand(initial_state_distribution(p)) forms a rollout of a system sys to a
τ = [] depth d given an initial state s and
for t = 1:d trajectory distribution p. It repeat-
o, a, s′, x = step(sys, s, disturbance_distribution(p, t)) edly calls the step function, which
push!(τ, (; s, o, a, x)) steps the system forward in time.
s = s′
end
return τ
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.3. fuzzing 87

Suppose we want to find failures of the inverted pendulum system with Example 4.3. Fuzzing applied
to the inverted pendulum system.
an additive noise sensor with Do (o | s) = N (o | 0, Σ) and Σ = 0.01I. If The plots in the top row show
we collect 100 samples with this nominal distribution, we do not find any the sampled disturbances for the
failures. However, if we define a new distribution and increase the standard sensor noise on each state vari-
deviation of the sensor noise on each variable from 0.1 to 0.15 (referred to able. The initial state distribution
as fuzzing), we are able to find two failures of the system in the first 100 is the same as the nominal initial
state distribution (algorithm A.6).
samples. The following code can be used to define the fuzzing distribution: The plots on the bottom row show
struct PendulumFuzzingDistribution <: TrajectoryDistribution the corresponding trajectories for
Σₒ # sensor disturbance covariance θ with failures highlighted in red.
d # depth By slightly increasing the standard
end deviation of the simulated sensor
function initial_state_distribution(p::PendulumFuzzingDistribution) noise, we are able to uncover two
return Product([Uniform(-π / 16, π / 16), Uniform(-1., 1.)]) failures.
end
function disturbance_distribution(p::PendulumFuzzingDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(zeros(2), p.Σₒ))
return D
end
depth(p::PendulumFuzzingDistribution) = p.d

The plots show the disturbances and trajectories for both distributions.

Nominal Fuzzing

0.5 0.5
xo,ω

xo,ω

0 0

−0.5 −0.5
−0.5 0 0.5 −0.5 0 0.5
xo,θ xo,θ

1 1
θ (rad)

θ (rad)

0 0

−1 −1
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Time (s) Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
88 c hapter 4. falsification through optimization

4.4 Falsification through Optimization

The falsification problem can be reformulated as a search over the space of initial
states and disturbances. Algorithm 4.6 performs a trajectory rollout given an initial
state and a sequence of disturbances. We refer to this sequence of disturbances as
a disturbance trajectory x = ( x1 , . . . , xd ). Unlike algorithm 4.5, algorithm 4.6 is
deterministic. The initial state s and disturbance trajectory x fully determine the
resulting trajectory τ.

function step(sys::System, s, x) Algorithm 4.6. A function that per-


o = sys.sensor(s, x.xo) forms a rollout of a system sys to
a = sys.agent(o, x.xa) a depth d given an initial state s
s′ = sys.env(s, a, x.xs) and disturbance trajectory 𝐱. It re-
return (; o, a, s′) peatedly calls the step function,
end which steps the system forward in
time. The step function takes in
function rollout(sys::System, s, 𝐱; d=length(𝐱)) the current state s and disturbance
τ = [] x and deterministically produces
for t in 1:d an observation o from the sensor,
x = 𝐱[t] gets the action a from the agent
o, a, s′ = step(sys, s, x) based on this observation, and de-
push!(τ, (; s, o, a, x)) termines the next state s′ from the
s = s′ environment.
end
return τ
end

To perform falsification in this context, we want to find an initial state s and


disturbance trajectory x that produce a trajectory τ such that τ ∈/ ψ. Optimization-
based falsification techniques use an objective function to guide this search. An
objective function f (τ ) maps a trajectory τ to a value related to its level of safety
with respect to ψ. We can then search for failures by minimizing this objective
over the space of initial states and disturbances as follows

minimize f (τ )
s,x
(4.5)
subject to τ = Rollout(s, x)

The rest of this chapter discusses different objective functions and optimization
techniques for solving the optimization problem in equation (4.5).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.5. objective functions 89

4.5 Objective Functions

Objective functions guide the search for failure trajectories. In general, a good
objective function should output lower values for trajectories that are closer to a
failure. The specific measure of closeness used is dependent on the application.
For example, in the aircraft collision avoidance problem, we may use the vertical
miss distance between the aircraft as the objective value.

4.5.1 Temporal Logic Robustness


If ψ is specified using a temporal logic formula, we can use its robustness measure
(see section 3.5.2) as an objective function such that f (τ ) = ρ(τ, ψ). Note that
τ itself is a function of the initial state and disturbance trajectory, and we can
also write this objective function as f (s, x) = ρ(Rollout(s, x), ψ). Algorithm 4.7
implements this objective function given a system and a specification.

function robustness_objective(x, sys, ψ; smoothness=0.0) Algorithm 4.7. Temporal logic ro-


s, 𝐱 = extract(sys.env, x) bustness objective. The function
τ = rollout(sys, s, 𝐱) takes in a vector of real values
𝐬 = [step.s for step in τ] x, a system sys, and a specifica-
return robustness(𝐬, ψ.formula, w=smoothness) tion ψ. It returns the smoothed
end robustness of the resulting trajec-
tory. If smoothness is set to 0, it re-
turns the robustness. The vector
x contains information about the
Since most optimization algorithms operate on a vector of real values, algo-
initial state and disturbances. The
rithm 4.7 takes in a vector of real values containing information about the initial extract function extracts an ini-
state and disturbances. The first step inside the objective function is to extract the tial state and disturbance trajectory
from x and is system specific.
initial state and disturbance trajectories in a way that is system specific. Exam-
ple 4.4 demonstrates this process for the inverted pendulum system. Given these
extracted values, we can perform a rollout of the system using algorithm 4.6 and
compute the corresponding robustness. For optimization algorithms that require
gradients of the objective function, we use the smoothed robustness instead.

4.5.2 Most Likely Failure


The use of an objective function in optimization-based falsification algorithms 3
Another common objective is to
allows us to move beyond a simple search for failures and incorporate other find the most severe failure accord-
ing to a severity metric. There may
objectives into the search. For example, instead of finding any failure, we may also be domain-specific objectives
want to find the most likely failure of a system.3 Determining the most likely such as obeying traffic laws in a
driving scenario.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
90 c hap ter 4. falsification through optimizati on

Suppose we want to compute the robustness objective for the inverted pendu- Example 4.4. Extracting an ini-
tial state and disturbance trajectory
lum system where the initial state is always s = [0, 0]. We write the extract from a vector of real values for the
function as follows: inverted pendulum system.
function extract(env::InvertedPendulum, x)
s = [0.0, 0.0]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 1:2:length(x)]
return s, 𝐱
end

The function extracts the sensor disturbances from the real-valued vector x
to create a disturbance trajectory 𝐱. It then returns the fixed initial state s and
the disturbance trajectory.

failure requires specifying the distribution over trajectories and using its prob-
ability density function to evaluate likelihoods. Assuming that the initial state
and disturbances are sampled independently from one another, the probability
density function of a trajectory distribution p is

d
p ( τ ) = p ( s1 ) ∏ D ( x i | s i , a i , o i ) (4.6)
i =1

where D ( x | s, a, o ) = Da ( x a | o ) Ds ( xs | s, a) Do ( xo | s). Algorithm 4.8 imple-


ments equation (4.6).

function Distributions.logpdf(D::DisturbanceDistribution, s, o, a, x) Algorithm 4.8. Probability density


logp_xa = logpdf(D.Da(o), x.xa) function of a trajectory distribution
logp_xs = logpdf(D.Ds(s, a), x.xs) p. We perform computations in log
logp_xo = logpdf(D.Do(s), x.xo) space for numerical stability. We
return logp_xa + logp_xs + logp_xo first compute the log likelihood of
end the initial state according the initial
state distribution. We then add the
function Distributions.pdf(p::TrajectoryDistribution, τ) log likelihood of each disturbance
logprob = logpdf(initial_state_distribution(p), τ[1].s) in the trajectory. The first function
for (t, step) in enumerate(τ) evaluates the log likelihood of a dis-
s, o, a, x = step turbance given a disturbance dis-
logprob += logpdf(disturbance_distribution(p, t), s, o, a, x) tribution D and the evidence vari-
end ables.
return exp(logprob)
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.5. objective functions 91

function likelihood_objective(x, sys, ψ; smoothness=0.0) Algorithm 4.9. Objective function


s, 𝐱 = extract(sys.env, x) for finding the most likely failure.
τ = rollout(sys, s, 𝐱) The function takes in a vector of
if isfailure(ψ, τ) real values x, a system sys, and
p = NominalTrajectoryDistribution(sys, length(𝐱)) a specification ψ. If the resulting
return -pdf(p, τ) trajectory is a failure, it returns
else the negative likelihood of the tra-
𝐬 = [step.s for step in τ] jectory under the nominal trajec-
return robustness(𝐬, ψ.formula, w=smoothness) tory distribution p. Otherwise, it
end returns the smoothed robustness
end of the trajectory (or the robustness
if smoothness is set to 0).

Given a trajectory distribution p, we define the most likely failure objective


(algorithm 4.9) as follows

ρ(τ, ψ) if τ ∈

f (τ ) = (4.7)
− p(τ ) otherwise

If the input trajectory does not produce a failure, equation (4.7) uses the robust-
ness to guide the search toward any failure. If the input does produce a failure
trajectory, it uses the negative likelihood of the trajectory to guide the search
toward more likely failures. Figure 4.3 compares a search for failures with a
search for the most likely failure on the grid world problem. While the robustness
objective finds failures that move directly toward the obstacle, the most likely
failure objective finds a failure that stays close to the nominal path.
The objective function in equation (4.7) leads to multiple practical challenges.
For example, to encourage the optimization algorithm to find failures, we must
ensure that failures never have a higher objective value than successes. Since
ρ(τ, ψ) ≥ 0 and − p(τ ) ≤ 0, equation (4.7) satisfies this condition. However, p(τ )
can be very small for long trajectories, which can lead to numerical stability issues.
Using log likelihood improves numerical stability but breaks the condition that
failures never have a higher objective value than successes.
This numerical instability as well as the discontinuity at the point of a failure cre-
ates challenges for first- and second-order optimization algorithms (section 4.6).
Furthermore, while the global minimum of the objective function in equation (4.7)
corresponds to the most likely failure of the system, many optimizers are only
guaranteed to find local minima. Due to this fact and the numerical stability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
92 c hapter 4. falsification through opti mization

Iteration 1 Iteration 5 Iteration 8 Converged Figure 4.3. A comparison of


optimization-based falsification on
the grid world using the robust-
ness objective and the likelihood
Robustness

objective. The plots show the pro-


gression of a population-based op-
timization algorithm, which will
be discussed in the next section.
The shaded gray path on the plots
in the final column represents the
most likely path for the system. The
robustness objective finds failures
Likelihood

that quickly move towards the ob-


stacle, while the most likely failure
objective find a failure that stays
close to the nominal path, only
veering toward the obstacle at the
end.

issues, other objective functions may lead to the discovery of more likely failures
in practice.
Another common objective for most likely failure analysis is

f (τ ) = ρ(τ, ψ) − λ log( p(τ )) (4.8)

where λ is a weighting parameter selected by the user (algorithm 4.10). This


objective is smooth and encourages the optimization algorithm to search simulta-
neously for trajectories that are both likely and close to failure.

function weighted_likelihood_objective(x, sys, ψ; smoothness=0.0, λ=1.0) Algorithm 4.10. Objective function


s, 𝐱 = extract(sys.env, x) that weights the tradeoff between
τ = rollout(sys, s, 𝐱) robustness and likelihood. The
𝐬 = [step.s for step in τ] function takes in a vector of real val-
p = NominalTrajectoryDistribution(sys, length(𝐱)) ues x, a system sys, and a specifica-
return robustness(𝐬, ψ.formula, w=smoothness) - λ * log(pdf(p, τ)) tion ψ. It returns a weighted combi-
end nation of the smoothed robustness
(or the robustness if smoothness
is set to 0) and the negative likeli-
hood under the nominal trajectory
distribution p.
4.6 Optimization Algorithms
4
M. J. Kochenderfer and T. A.
We can search for failures by applying a variety of optimization algorithms Wheeler, Algorithms for Optimiza-
tion. MIT Press, 2019.
to the optimization problem in equation (4.5).4 Algorithm 4.11 implements

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.6. optimization algorithms 93

struct OptimizationBasedFalsification Algorithm 4.11. The optimization-


objective # objective function based falsification algorithm for
optimizer # optimization algorithm finding failures. The algorithm first
end computes a system-specific objec-
tive function f from a generic objec-
function falsify(alg::OptimizationBasedFalsification, sys, ψ) tive function objective (as speci-
f(x) = alg.objective(x, sys, ψ) fied in section 4.5). It then runs the
return alg.optimizer(f, sys, ψ) optimizer and returns the results.
end

optimization-based falsification given an objective and optimization algorithm. It


computes the system-specific objective function f , runs the optimizer, and returns
its output. Example 4.5 applies algorithm 4.11 to find failures in the inverted
pendulum problem using an off-the-shelf optimization package.5 The choice of 5
Off-the-shelf optimization pack-
optimization algorithm depends on the complexity of the system under test and ages provide implementations of a
variety of optimization algorithms.
the level of access to the system’s internal model. The rest of this section outlines One such package in the Julia
several categories of optimization algorithms and compares their advantages and ecosystem is Optim.jl.
disadvantages in the context of falsification.
One category of optimization techniques is local descent methods. Local descent 1

methods start from an initial design point and incrementally improve it until

θ (rad)
some convergence criteria is met. At each iteration, they use a local model of 0
the objective function at the current design point to determine a direction of
improvement. They then take a step in this direction to compute the next design −1
point. Some methods use the gradient or Hessian of the objective function with 0 0.2 0.4 0.6 0.8 1
respect to the current design point to create the local model. These methods are Time (s)

called first-order and second-order methods, respectively. Figure 4.4 shows the
Figure 4.4. First-order method ap-
result of applying a first-order method called gradient descent to find failures for plied to falsify the inverted pendu-
the inverted pendulum example. lum example. The plot shows suc-
cessive iterations of the algorithm,
While the gradient and Hessian provide a very powerful signal for optimiza- with darker trajectories indicating
tion algorithms, they are not always available. 6 Some simulators do not provide later iterations. Failures are high-
access to the internal model of the system, making exact computation of the lighted in red. The algorithm gets
closer to a failure with each iter-
gradient infeasible. We often refer to such simulators as black-box simulators. An- ation until it eventually begins to
other category of optimization algorithms called direct methods is better suited find failures.
for systems with black-box simulators. They traverse the input space using only
6
Gradient information is a strong
information from function evaluations, eliminating the need for access to the enough signal to effectively op-
system’s internal model. timize machine learning models
with billions of parameters.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
94 c hapter 4. falsification through opti mization

The Optim.jl package provides implementations of several optimization al- Example 4.5. Applying a second-
order method called L-BFGS to
gorithms. In this example, we show how to use the Optim.jl implementation falsify the inverted pendulum ex-
of a second-order method called L-BFGS to falsify the inverted pendulum ample. We use the open-source
system. We define the optimizer function for algorithm 4.11 and run the implementation of L-BFGS in the
algorithm using the robustness objective as follows: Optim.jl package. The plot shows
the trajectory of the pendulum
using Optim
for the initial point (green) and
function lbfgs(f, sys, ψ)
the failure trajectory discovered af-
x₀ = zeros(42)
alg = Optim.LBFGS() ter one iteration (red). For more
options = Optim.Options(store_trace=true, extended_trace=true) information on the L-BFGS algo-
results = optimize(f, x₀, alg, options; autodiff=:forward) rithm, see J. Nocedal, “Updating
τs = [rollout(sys, extract(sys.env, iter.metadata["x"])...) Quasi-Newton Matrices with Lim-
for iter in results.trace] ited Storage,” Mathematics of Com-
return filter(τ->isfailure(ψ, τ), τs) putation, vol. 35, no. 151, pp. 773–
end 782, 1980.
objective(x, sys, ψ) = robustness_objective(x, sys, ψ, smoothness=1.0)
alg = OptimizationBasedFalsification(objective, lbfgs) 1
failures = falsify(alg, inverted_pendulum, ψ) Failure
Trajectory

θ (rad)
In this implementation, we are optimizing over a disturbance trajectory with 0
Initial
depth d = 21. Since each sensor disturbance is two-dimensional, the length of Trajectory
each design point is 42. The lbfgs function starts with an initial design point
−1
of all zeros, specifies options to store the results of each iteration, and runs 0 0.2 0.4 0.6 0.8 1
the algorithm using ForwardDiff.jl to compute gradients. It then extracts Time (s)
the initial state and disturbance trajectory from each iteration and performs
a rollout of the system. Finally, it filters the resulting trajectories to return
failure trajectories. It is important that we specify the objective as smoothed
robustness so that the gradients are well-defined. The plot on the right shows
the progression of the algorithm. L-BFGS converges to a failure trajectory
after a single iteration.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
4.7. summary 95

Local descent methods often get stuck in local optima. Population methods at-
tempt to overcome this drawback by performing optimization using a collection of
design points. The points in a population are sometimes referred to as individuals.
Population methods begin with an initial population that is spread out over the
design space. At each iteration, they use the current function value of each indi-
vidual to move the population toward the optimum. Because population methods
spread samples over the entire design space rather than incrementally improving
a single point, they may find a more diverse set of failures. For example, the
population method in figure 4.5 is able to find failures for the pendulum in both
directions. High-dimensional problems with long time horizons may require a
large number of samples to cover the design space. However, population methods
are often easy to parallelize, which can improve efficiency.

Iteration 1 Iteration 5 Iteration 10 Figure 4.5. Population method ap-


plied to falsify the inverted pendu-
1 lum example. The plots shows the
trajectories for the individuals in
θ (rad)

the population at three iterations.


0 Failures are highlighted in red. The
individuals in the population get
closer to a failure with each itera-
−1 tion, and the algorithm finds trajec-
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 tories that fail in both the positive
Time (s) Time (s) Time (s) and negative direction.

4.7 Summary

• Monte Carlo falsification requires 1/pfail samples on average to find a failure,


which can be computationally expensive for systems with rare failure events.

• Optimization-based falsification algorithms use optimization techniques to


find failures more efficiently.

• Disturbances are a useful concept for optimization-based falsification algo-


rithms, and we can reformulate the distribution over trajectories for a system
as a distribution over initial states and disturbances.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
96 c hapter 4. falsification through optimization

• We can formulate the falsification problem as an optimization problem by


defining an objective function and optimizing over initial states and distur-
bances.

• We can apply a variety of optimization algorithms to search for failures of a


system, and the choice of algorithm depends on the problem complexity and
the availability of the system’s internal model.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5 Falsification through Planning

The methods in the previous chapter find counterexamples by performing op-


timization over full trajectories. In many cases, we can increase efficiency by
considering a sequence of partial trajectories. In particular, this chapter discusses
methods that use planning algorithms to account for the temporal aspect of the
problem. Planning techniques break the falsification problem into a sequence of
smaller problems. We discuss several categories of planning algorithms that rely
on optimization, search, and reinforcement learning.

5.1 Shooting Methods

Shooting methods use optimization to find a feasible path between two points,1 1
The term shooting method is
and they can be used in the context of falsification to produce feasible failure based on the analogy of shooting
at a target from a cannon. Shoot-
trajectories. These methods break the trajectory optimization problem into a set ing methods start at an initial point
of smaller problems by optimizing over a sequence of trajectory segments. A and ‘‘shoot’’ trajectories toward a
target point until a feasible path
trajectory τ can be partitioned into n segments such that τ = (τ1 , . . . , τn ). Each between the initial point and tar-
trajectory segment τi is defined by an initial state si and a sequence of disturbances get is found. Shooting methods
xi of length di . Given si and xi , we can compute the resulting trajectory τi by orginated from research on bound-
ary value problems. A more de-
performing a rollout. tailed review with an implemena-
The defect between two trajectory segments is the distance between the final tion can be found in section 18.1 of
the reference by W. H. Press, S. A.
state of the first segment and the initial state of the second segment. A set of Teukolsky, W. T. Vetterling, and B. P.
trajectory segments forms a feasible trajectory if the defect of all consecutive Flannery, Numerical Recipes 3rd Edi-
trajectory segments is 0. In other words, the final state of τi must match the initial tion: The Art of Scientific Computing.
Cambridge University Press, 2007.
state of τi+1 for all i ∈ {1, . . . , n − 1}. This requirement leads to the following
98 c hapter 5. falsification through planning

optimization problem
minimize f (τ1 , . . . , τn )
s1 ,x1 ,...,sn ,xn

subject to τi = Rollout(si , xi ) for all i ∈ {1, . . . , n} (5.1)

Defect(τi , τi+1 ) = 0 for all i ∈ {1, . . . , n − 1}


where f is the falsification objective (see section 4.5). If n = 1, the optimization
problem is equivalent to optimizing over the entire trajectory, and equation (5.1)
reduces to equation (4.5). This process is referred to as single shooting. For n > 1,
this process is referred to as multiple shooting.
Multiple shooting seemingly increases the complexity of the optimization
problem by adding more variables and constraints, but it can actually improve
efficiency, especially for systems in which small changes in the inputs applied at
the beginning of a trajectory have a significant effect on the end of the trajectory.
For example, consider the problem of finding a path through a maze where one
wrong turn at the beginning of the trajectory could ultimately lead to a dead end.
If we use single shooting, we must optimize over the entire path at once. If we use
multiple shooting, we can break the path into segments that focus on different
regions of the maze.

defect(τᵢ, τᵢ₊₁) = norm(τᵢ₊₁[1].s - τᵢ[end].s) Algorithm 5.1. Temporal logic ro-


bustness objective for multiple
function shooting_robustness(x, sys, ψ; smoothness=0.0, λ=1.0) shooting. The function takes in a
segments = extract(sys.env, x) vector of real values x, a system sys,
n = length(segments) and a specification ψ and returns
τ_segments = [rollout(sys, seg.s, seg.𝐱) for seg in segments] the objective in equation (5.2). The
τ = vcat(τ_segments...) smoothness parameter controls the
𝐬 = [step.s for step in τ] smoothness of the robustness func-
ρ = smooth_robustness(𝐬, ψ.formula, w=smoothness) tion, and λ controls the weighting
defects = [defect(τ_segments[i], τ_segments[i+1]) for i in 1:n-1] of the defect penalty. The defect
return ρ + λ*sum(defects) function computes the defect be-
end tween two trajectory segments. The
extract function extracts the tra-
jectory segments from the vector x
The constraint on the defect of consecutive trajectory segments in equation (5.1) and is system specific.
poses challenges for many optimization algorithms. In practice, we instead incor-
porate it as a soft constraint by adding it as a penalty in the objective:
n −1
minimize
s1 ,x1 ,...,sn ,xn
f (τ1 , . . . , τn ) + λ ∑ Defect(τi , τi+1 ) (5.2)
i =1
subject to τi = Rollout(si , xi ) for all i ∈ {1, . . . , n}

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.2. tree search 99

Iteration 2 Iteration 4 Iteration 6 Iteration 8 Converged Figure 5.1. Multiple shooting ap-
plied to the continuum world ex-
ample to find a path from an initial
point to the obstacle. We use four
trajectory segments, and the colors
denote which segment end points
should connect. The plots show the
trajectory segments at different iter-
ations of the L-BFGS optimization
algorithm.

where λ is a weighting parameter that controls how heavily the defect is penalized.
Algorithm 5.1 implements this objective when f is the temporal logic robustness.
We can apply any of the optimization algorithms discussed in section 4.6 to the
optimization problem in equation (5.2). Compared to the optimization problems
in the previous chapter, minimizing the defect between the trajectory segments
adds complexity to the problem. This added complexity can make it more diffi-
cult to find a feasible failure trajectory. Figure 5.1 shows an example that uses a
gradient-based optimization technique called L-BFGS2 to find a failure trajectory 2
J. Nocedal, “Updating Quasi-
for the continuum world problem. For systems with black-box simulators, the Newton Matrices with Limited
Storage,” Mathematics of Computa-
direct methods described in section 4.6 may struggle to find feasible failure tra- tion, vol. 35, no. 151, pp. 773–782,
jectories. Instead, we can use direct methods that were designed specifically for 1980.
multiple shooting.3 3
For an example of a multiple
shooting algorithm designed for
systems with black-box simulators,
5.2 Tree Search see A. Zutshi, J. V. Deshmukh, S.
Sankaranarayanan, and J. Kapinski,
“Multiple Shooting, CEGAR-Based
Tree search algorithms iteratively construct a tree structure that represents the Falsification for Hybrid Systems,”
space of possible trajectories. Each node in the tree represents a state, and each in International Conference on Embed-
ded Software, 2014.
edge represents a transition between states that is the result of applying a partic-
ular disturbance. Each path through the tree corresponds to a feasible trajectory
for the system. Tree search algorithms start in an initial state and iteratively grow
the tree in an attempt to find feasible failure trajectories.
At each iteration, these algorithms perform the steps illustrated in figure 5.2.
They first select a node from the tree to extend. This selection is typically based
on a heuristic designed to grow the tree toward failures. Next, they extend the
selected node by choosing a disturbance and adding a new child node at the
resulting next state. We can terminate the algorithm after a fixed number of
iterations or when a failure trajectory is discovered.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
100 chapter 5. falsification through pla nning

Algorithm 5.2 implements the generic tree search algorithm. It runs for a fixed
number of iterations before returning all failures in the tree. Algorithm 5.3 extracts
failure trajectories from a tree by enumerating all paths in the tree and checking
for failures. Specific implementations of tree search algorithms differ in how they
implement the select and extend functions. We discuss two categories of tree
search algorithms in the next two sections.

Current Tree Select Extend Figure 5.2. One iteration of tree


search. The nodes of the tree repre-
sent states s. Given a disturbance x,
s s s we produce an observation o and
an action a that lead us to the next
node. The edges represent these
o, a, x o, a, x o, a, x o, a, x o, a, x o, a, x transitions. The algorithm first se-
lects a node from the tree to extend.
It then chooses a disturbance to ex-
s s s s s s tend the selected node and adds
the resulting next state as a new
o, a, x o, a, x o, a, x o, a, x child.

s s s s

abstract type TreeSearch end Algorithm 5.2. Generic tree search


algorithm for finding failure trajec-
function falsify(alg::TreeSearch, sys, ψ) tories. The algorithm starts by ini-
tree = initialize_tree(alg, sys) tializing a tree with a single node. It
for i in 1:alg.k_max then iteratively selects a node from
node = select(alg, sys, ψ, tree) the tree using the select function
extend!(alg, sys, ψ, tree, node) and adds to its children using the
end extend! function. After k_max iter-
return failures(tree, sys, ψ) ations, the algorithm returns the
end set of failure trajectories in the tree.

5.3 Heuristic Search

Some tree search algorithms use heuristics to explore the space of possible tra-
jectories. The rapidly exploring random trees (RRT) algorithm, for example, uses
heuristics to iteratively extend the search tree toward randomly selected states in

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 101

function trajectory(node) Algorithm 5.3. Functions for


τ = [] extracting failure trajectories from
while !isnothing(node.parent) a tree. The failures function
pushfirst!(τ, (s=node.parent.state, node.edge...)) first finds the leaves of the tree
node = node.parent and extracts the corresponding
end trajectory from each leaf using
return τ the trajectory function. The
end trajectory function starts at
a leaf node and propagates
function failures(tree, sys, ψ) backward through the tree to
leaves = filter(node -> isempty(node.children), tree) construct the full trajectory. The
τs = [trajectory(node) for node in leaves] failures function then filters
return filter(τ -> isfailure(ψ, τ), τs) these trajectories for failures and
end returns the result.

the state space.4 In the context of falsification, we use RRT to efficiently explore 4
RRT was designed to efficiently
the space of possible disturbance trajectories in search of a failure trajectory. enumerate trajectories in high-
dimensional spaces, particularly
Algorithm 5.4 implements the select and extend steps for the RRT algorithm. for systems with complex dynam-
In the select step, RRT randomly samples a goal state and computes an objective ics. The algorithm was originally
proposed in the context of robotic
value for each node in the current tree based on the sampled goal state. This path planning. For more informa-
objective is typically related to the distance between each node and the goal state. tion on path planning algorithms,
The algorithm then selects the node with the lowest objective value to pass to see S. LaValle, “Planning Algo-
rithms,” Cambridge University Press,
the extend step. In the extend step, RRT selects a disturbance, simulates one step vol. 2, pp. 3671–3678, 2006.
forward in time from the selected node, and adds the resulting edge and child
node to the tree.
Several variants of RRT differ in how they sample goal states, compute objec-
tives, and select disturbances. Algorithm 5.5 implements a version of the RRT
algorithm that samples goal states uniformly from the state space. It then uses
the Euclidean distance between the each node and the goal state as the objective.
In the extend step, the disturbance is randomly sampled from the nominal dis-
turbance distribution for the system. Example 5.1 applies this algorithm to the
continuum world problem.

5.3.1 Goal Heuristics


Algorithm 5.5 uses the sampled goal state to select which node in the tree to
extend, but it does not use the goal state when selecting the disturbance used to
extend the selected node. We can improve the performance of the algorithm by

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
102 chapter 5. falsification through pla nning

struct RRT <: TreeSearch Algorithm 5.4. The rapidly


sample_goal # sgoal = sample_goal(tree) exploring random trees algorithm.
compute_objectives # objectives = compute_objectives(tree, sgoal) The algorithm is a type of tree
select_disturbance # x = select_disturbance(sys, node) search algorithm and implements
k_max # number of iterations both the select and extend!
end functions. The select function
samples a goal state according
mutable struct RRTNode to the sample_goal function and
state # node state computes an objective value for
parent # parent node each node in the tree using the
edge # (o, a, x) compute_objectives function.
children # vector of child nodes It then selects the node with
goal_state # current goal state the lowest objective value, sets
end its goal state, and returns it.
The extend! function selects a
function initialize_tree(alg::RRT, sys)
disturbance according to the
return [RRTNode(rand(Ps(sys.env)), nothing, nothing, [], nothing)]
select_disturbance function,
end
simulates one step forward in time,
and adds the results to the tree in
function select(alg::RRT, sys, ψ, tree)
the form of a new child node.
sgoal = alg.sample_goal(tree)
objectives = alg.compute_objectives(tree, sgoal)
node = tree[argmin(objectives)]
node.goal_state = sgoal
return node
end

function extend!(alg::RRT, sys, ψ, tree, node)


x = alg.select_disturbance(sys, node)
o, a, s′ = step(sys, node.state, x)
snew = RRTNode(s′, node, (; o, a, x), [], nothing)
push!(node.children, snew)
push!(tree, snew)
end

random_goal(tree, lo, hi) = rand.(Distributions.Uniform.(lo, hi)) Algorithm 5.5. Functions for the
RRT algorithm. The first function
function distance_objectives(tree, sgoal) samples a goal state uniformly
return [norm(sgoal .- node.state) for node in tree] from the state space. The lo and
end hi inputs specify the lower and up-
per bounds of the state variables.
function random_disturbance(sys, node) The second function computes the
D = DisturbanceDistribution(sys) Euclidean distance between each
o, a, s′, x = step(sys, node.state, D) node in the tree and the goal state.
return x The third function samples a dis-
end turbance from the nominal distur-
bance distribution for the system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 103

Suppose we want to apply RRT to search for failures for the continuum world Example 5.1. Basic RRT applied to
the continuum world example. The
system. We can use the following code to run the basic RRT algorithm for plots show snapshots of the search
100 iterations. tree after 5, 15, and 100 iterations.
select_goal(tree) = random_goal(tree, [0.0, 0.0], [10.0, 10.0]) The stars show the next goal state
compute_objectives(tree, sgoal) = distance_objectives(tree, sgoal) and highlighted nodes show the
select_disturbance(tree, node) = random_disturbance(tree, node) node selected to extend next.
alg = RRT(select_goal, compute_objectives, select_disturbance, 100)
failures = falsify(alg, cw, ψ)

The plots below show two snapshots of the search tree after 5 and 15 iterations
as well as the final tree after 100 iterations. After 100 iterations, RRT did not
find any failure trajectories. Although goal states are sampled throughout
the state space, the disturbances are sampled from the nominal disturbance
distribution. Since the nominal disturbance distribution represents only small
deviations from the nominal path, the tree closely follows the nominal path
toward the goal. We can improve the performance of the tree search using
the heuristics discussed in section 5.3.1.

Iteration 5 Iteration 15 Iteration 100

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
104 chapter 5. falsification through pla nning

using the goal state to select the disturbance. Specifically, we want to select the
disturbance that leads to the next state that is closest to the goal state.

function goal_disturbance(sys, node; m=10) Algorithm 5.6. Function for select-


D = DisturbanceDistribution(sys) ing a disturbance that leads to the
steps = [step(sys, node.state, D) for i in 1:m] next state that is closest to the goal
distances = [norm(node.goal_state - step.s′) for step in steps] state. The algorithm take m steps
return steps[argmin(distances)].x using the nominal disturbance dis-
end tribution. It then computes the dis-
tances between the next state from
each step and the goal state. It re-
Algorithm 5.6 uses sampling to search for a disturbance that results in a next turns the disturbance that resulted
in the lowest distance.
state that is close to the goal state. It draws m samples from the nominal distur-
bance distribution and simulates one step forward in time from the current node
using each sample. It then returns the disturbance that results in the next state 5
To improve performance and ef-
that is closest to the goal state. As m increases, the performance of the algorithm ficiency, more sophisticated opti-
mization algorithms can also be
improves but at a greater computational cost.5 Figure 5.3 demonstates this process used (see section 4.6).
on one step of RRT for the continuum world problem.

Current Tree Select Extend Figure 5.3. One iteration of RRT


applied to the continuum world
example using algorithm 5.6 with
m = 10 to select the disturbance
in the extend step. The algorithm
selects the node that is closest to
the goal state to extend and sam-
ples 10 disturbances to add to the
tree. It then selects the disturbance
that results in the next state that is
closest to the goal state and adds
the resulting edge and child node
to the tree.

In addition to improving the extend step of algorithm 5.5, we can improve


the select step by modifying how we sample the goal state. Instead of sampling
the goal state uniformly from the state space, we can use a heuristic to sample
goal states that are more likely to grow the tree toward failures. One technique
is to identify a failure region in the state space and sample goal states from this
region. A failure region is a region such that any trajectory that passes through
this region is a failure trajectory. For example, the failure region in the continuum
world problem is the set of states within the red obstacle. This technique is limited
to systems with specifications that depend only on the state. For specifications of

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 105

The failure region for the continuum world example is the set of states within Example 5.2. Example of sampling
goal states from the failure region
the red obstacle, which is a circle centered at (4.5, 4.5) with radius 0.5. We of the continuum world problem.
can uniformly sample from this region using the following code:
function failure_goal(tree)
r = rand(Uniform(0, 0.5))
θ = rand(Uniform(0, 2π))
return [4.5, 4.5] .+ [r*cos(θ), r*sin(θ)]
end

The code uniformly samples a radius between 0 and 5 and an angle between
0 and 2π. It then converts these samples to a state in the failure region.

temporal properties, identifying a failure region is not possible without augment-


ing the state space. Figure 5.4 shows the result of using this heuristic along with
algorithm 5.6 to apply RRT to the continuum world problem.

5.3.2 Coverage Heuristics


To uncover a variety of ways in which a system might fail, it is important that
Figure 5.4. RRT applied to the con-
we explore a diverse set of trajectories. We incorporate this idea into RRT using tinuum world problem using algo-
heuristics that are designed to maximize coverage of the state space. We assess rithm 5.6 with m = 10 to select the
disturbance in the extend step and
coverage using coverage metrics, which measure how well a set of samples fill a goal states sampled from the fail-
given space. In the context of tree search, we are interested in how well the states ure region. The algorithm was run
for 100 iterations and discovered
represented by the nodes in the tree fill the state space. We can then use these
the failure trajectories highlighted
coverage metrics in the select step to select the next goal state. in red.
One common coverage metric is related to the concept of dispersion. The dis-
persion of a set of points V in the bounded region S is the radius of the largest
ball that can be placed in S such that no point in V lies within the ball, written as
 
dispersion = sup min ks − si k (5.3)
s∈S si ∈V
Figure 5.5. Visualization of dis-
persion for two different sets of 10
where the outer optimization represents the supremum. A supremum is a general-
points. The blue set does not fill
ization of a maximum that allows solutions to exist when the largest ball merely the space as well as the green set.
approaches a particular size before containing one of the points in V . The norm We can find a larger ball that does
not contain any points in the blue
in equation (5.3) can be any norm. A common choice is the `2 -norm. Coverage is set than we can for the green set.
inversely related to dispersion. In other words, a set of points with high dispersion Therefore, the blue set has higher
dispersion and lower coverage.
will have low coverage (see figure 5.5).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
106 chapter 5. fa lsification through planning

Since dispersion considers only the largest ball that can be placed in S , it tends
to be a conservative measure of coverage. Furthermore, it is difficult to compute
for high-dimensional spaces. An approximate metric called average dispersion
overcomes these drawbacks.6 Average dispersion is computed on a grid of n 6
J. M. Esposito, J. Kim, and V. Ku-
points with spacing δ in each dimension. It is calculated as mar, “Adaptive RRTs for Validat-
ing Hybrid Robotic Control Sys-
n min(d j (V ), δ)
1 tems,” in Algorithmic Foundations
average dispersion =
n ∑ δ
(5.4) of Robotics, Springer, 2005, pp. 107–
j =1 121.

where d j (V ) is the distance from the jth grid point to the nearest point in V .

function average_dispersion(points, lo, hi, lengths) Algorithm 5.7. Algorithm for com-
points_norm = [(point .- lo) ./ (hi .- lo) for point in points] puting average dispersion of a set
ranges = [range(0, 1, length) for length in lengths] of points on a space bounded by
δ = minimum(Float64(r.step) for r in ranges) lo and hi. It uses a grid speci-
grid_dispersions = [] fied by lengths, which contains
for grid_point in Iterators.product(ranges...) the number of grid points in each
dmin = minimum(norm(grid_point .- p) for p in points_norm) dimension. The algorithm first nor-
push!(grid_dispersions, min(dmin, δ) / δ) malizes the points to lie in the unit
end hypercube. It then creates the grid
return mean(grid_dispersions) over the unit hypercube and com-
end putes the average dispersion using
equation (5.4).

Algorithm 5.7 computes average dispersion given a set of points and a bounded
region. The term in the numerator of equation (5.4) is the radius of the largest
ball centered at each grid point that does not contain any points in V or other grid
points. Dividing by δ ensures that the values for average dispersion range between
0 and 1, and subtracting the average dispersion from 1 results in a coverage metric 7
The average dispersion coverage
that ranges between 0 and 1.7 Figure 5.6 shows the difference between dispersion metric will be 1 if V contains all of
the grid points. A finer grid will
and average dispersion. result in better coverage estimates
Another common coverage metric is discrepancy. The key insight behind dis- but at a greater computational cost.
crepancy is that if a set of points covers a space evenly, then a randomly chosen
subset of the space should contain a fraction of samples proportional to the frac-
tion of volume occupied by the subset. Discrepancy is defined as the worst case
hyperrectangular subset:

#(V ∩ H) vol(H)
discrepancy = sup − (5.5)
H⊆S #(V ) vol(S)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 107

Dispersion Average Dispersion Figure 5.6. Visualization of the


difference between dispersion and
average dispersion on a set of
points. While dispersion finds the
largest ball that does not contain
any points in the set, average dis-
persion operates on a grid. The
gray dots indicate grid points, and
the cirles show the largest ball that
does not contain any grid points or
points in the set. The average dis-
persion is the average of the radii
of these circles normalized by the
grid spacing.

where H is a hyperrectangular subset of S and #(V ∩ H) and #(V ) are the number
of points in V that lie in H and the total number of points in V respectively. We use
vol(H) and vol(S) to denote the n-dimensional volume of H and S respectively,
which can be obtained by multiplying the side lengths.
The worst-case hyperrectangle that determines the discrepancy of a set of
points is typically a small region containing many points or a large region with
few points. Figure 5.7 visualizes the discrepancy metric. Discrepancy approaches
1 when all points overlap and approaches 0 when all possible hyperrectangular Figure 5.7. Visualization of the
discrepancy metric. The rectangles
subsets have their proper share of points. In general, discrepancy is difficult to indicate two candidates for the
compute exactly, especially in high dimensions. worst case rectangle used to define
discepancy. Discrepancy is deter-
Star discrepancy is a special case of discrepancy that is easier to compute and is mined by a rectangle with small
often used in practice. Instead of considering all possible hyperrectangular subsets, area and many points (top) or a
star discrepancy considers only hyperrectangular subsets of the unit hypercube rectangle with large area and few
points (bottom).
that have a vertex at the origin. We can always normalize any hyperrectangular
space S to the unit hypercube by dividing by the side length in each dimension. 8
E. Thiémard, “An Algorithm to
Compute Bounds for the Star Dis-
Given these constraints, it is possible to compute lower and upper bounds on
crepancy,” Journal of Complexity,
star discrepancy.8 We first partition the unit hypercube B into a finite number of vol. 17, no. 4, pp. 850–880, 2001.
Examples of other approximations
can be found in Y.-D. Zhou, K.-T.
Fang, and J.-H. Ning, “Mixture Dis-
crepancy for Quasi-Random Point
Sets,” Journal of Complexity, vol. 29,
no. 3-4, pp. 283–301, 2013.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
108 chapter 5. falsification through planning

subrectangles h ∈ Π. We then compute the bounds as

#(V ∩ h+ ) vol(h− ) vol(h+ ) #(V ∩ h− )


  
upper = max max − , −
h∈Π #(V ) vol(B) vol(B) #(V )
  −
#(V ∩ h ) vol(h ) − vol(h ) #(V ∩ h+ )
+ 
lower = max max − , −
h∈Π #(V ) vol(B) vol(B) #(V ) h+
(5.6)
h− h
where h+ and h− are hyperrectangular subsets derived from subrectangle h as
shown in figure 5.8.
The tightness of the upper and lower bounds in equation (5.6) depends on
the resolution of the partition. Finer partitions will lead to tighter bounds at a
greater computional cost. Algorithm 5.8 computes upper and lower bounds on Figure 5.8. Visualization of the hy-
perrectangular subsets for subrect-
star discrepancy using equation (5.6) given a set of points and a bounded region.
angle h used to compute upper and
We can subtract the value of star discrepancy from 1 to provide a coverage metric lower bounds on star discrepancy.
that ranges between 0 and 1. Figure 5.9 shows the upper and lower bounds on star
discrepancy for the sets of points in figure 5.5 as the resolution of the partition is
1
increased.

Star Discrepancy
We can use average dispersion or star discrepancy as a metric to select the goal 0.8

state in RRT. In particular, we want to select the goal state that would result in 0.6
the greatest increase in coverage if added to the current tree. While it is difficult 0.4
to determine the goal state exactly, we can approximate this process by drawing 0.2
samples from the state space, computing the difference in coverage for each
0
sample, and selecting the sample with the largest increase.9 The samples may 20 40 60 80 100
be selected from a grid (figure 5.10) or drawn uniformly from the state space Grid Resolution

(example 5.3). Figure 5.9. Upper and lower


Coverage metrics can also be used as termination conditions. It is not always bounds on star discrepancy for the
sets of points in figure 5.5. The grid
clear when to terminate tree search algorithms, especially if no failures are found.
resolution is the number of grid
One option is to terminate the search when state space coverage is sufficient. Since points in each dimension. The up-
not all states are necessarily reachable, coverage will not necessarily approach per and lower bounds approach
each other as the grid resolution
1 as the number of tree search iterations increases. We therefore cannot use the increases. The green set of points
magnitude of coverage as a termination condition by itself. Instead, we compute is more evenly distributed than the
blue set, so it has lower star discrep-
a growth metric such as
ancy.
9
Coverage(V ’) − Coverage(V ) High-dimensional problems may
growth = (5.7) require more sophisticated tech-
#(V 0 ) − #(V ) niques. T. Dang and T. Nahhal,
“Coverage-Guided Test Generation
where V and V 0 are the sets of points in the tree at the beginning and end of the for Continuous and Hybrid Sys-
current iteration. tems,” Formal Methods in System De-
sign, vol. 34, pp. 183–213, 2009.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 109

function star_discrepancy(points, lo, hi, lengths) Algorithm 5.8. Algorithm for com-
n, dim = length(points), length(lo) puting upper and lower bounds
𝒱 = [(point .- lo) ./ (hi .- lo) for point in points] on the star discrepancy of a set
ranges = [range(0, 1, length)[1:end-1] for length in lengths] of points on a space bounded
steps = [Float64(r.step) for r in ranges] by lo and hi. It uses a partition
ℬ = Hyperrectangle(low=zeros(dim), high=ones(dim)) specified by lengths, which con-
lbs, ubs = [], [] tains the number of subrectangles
for grid_point in Iterators.product(ranges...) in each dimension. The algorithm
h⁻ = Hyperrectangle(low=zeros(dim), high=[grid_point...]) first normalizes the points to lie in
h⁺ = Hyperrectangle(low=zeros(dim), high=grid_point .+ steps) the unit hypercube. It then creates
𝒱h⁻ = length(filter(v -> v ∈ h⁻, 𝒱)) the partition over the unit hyper-
𝒱h⁺ = length(filter(v -> v ∈ h⁺, 𝒱)) cube and computes the upper and
push!(lbs, max(abs(𝒱h⁻ / n - volume(h⁻) / volume(ℬ)), lower bounds on star discrepancy
abs(𝒱h⁺ / n - volume(h⁺) / volume(ℬ)))) using equation (5.6). We use the
push!(ubs, max(𝒱h⁺ / n - volume(h⁻) / volume(ℬ), LazySets.jl package to represent
volume(h⁺) / volume(ℬ) - 𝒱h⁻ / n)) hyperrectangles.
end
return maximum(lbs), maximum(ubs)
end

Average Dispersion Star Discrepancy Figure 5.10. Selecting the next goal
state for RRT applied to the con-
tinuum world problem using av-
erage dispersion and star discrep-
ancy coverage metrics. The plots
show the grid points used as can-
didates for the next goal state. The
color of each grid point indicates
the increase in coverage that would
result from adding that grid point
to the tree with darker colors indi-
cating a greater increase. For star
discepancy, the colors represent
the lower bound. The star indicates
the goal state selected by RRT. Be-
cause star discrepancy only focuses
on the worst-case hyperrectangle,
it is not as smooth as average dis-
persion.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
110 chapter 5. falsification through planning

Suppose we want to apply coverage heuristics when using RRT on the con- Example 5.3. RRT applied to the
continuum world problem using
tinuum world problem. The following code implements a version of the coverage heuristics. The plots illus-
select_goal function that uses coverage based on average dispersion to trate the effect of selecting the next
guide the search. goal state based on coverage rather
function select_goal(tree; m=5) than randomly selecting it.
a, b, lengths = [0, 0], [10, 10], [10, 10]
points = [node.state for node in tree]
sgoals = [rand.(Distributions.Uniform.(a, b)) for _ in 1:m]
dispersions = [average_dispersion([points..., sgoal], a, b, lengths)
for sgoal in sgoals]
coverages = 1 .- dispersions
return sgoals[argmax(coverages)]
end

We first collect the states visited so far from the nodes of the tree and sample
m potential goal states uniformly from the state space. We then compute the
new average dispersion if each goal state were added to the tree. The goal
state that results in the greatest increase in coverage is selected. The plots
show the resulting trees when using random goals and coverage-based goals.
Using the coverage-based goal selection results in a wider tree that covers
more of the state space.

Random Goal Coverage Goal

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.3. heuristic search 111

We can terminate the tree search when the growth metric is sufficiently small.
It is important to note that this growth metric does not provide any guarantees
about the coverage of the search tree. Even when growth is small, there may still be

Average Dispersion Coverage


1
unexplored regions of the state space that are reachable under rare circumstances.
0.8
For example, every state in the continuum world problem is reachable through a
sequence of disturbances, but the average dispersion coverage metric plateaus at 0.6

a number less than 1 (see figure 5.11). Some states are extremely unlikely to be 0.4
reached under the nominal disturbance model. 0.2

0
100 200 300
5.3.3 Alternative Objectives
Iteration
As noted in the previous chapter, we may want to go beyond a simple search for
Figure 5.11. The average disper-
failures and incorporate other objectives into the search process. For example, we sion coverage metric over iterations
may be interested in finding the shortest path to failure or the most likely failure. of RRT applied to the continuum
world problem.
We can incorporate these objectives into RRT by modifying how we compute the
objectives in the select step (algorithm 5.9).
First, we define a cost function c that maps a node to a cost of transitioning
to the node from its parent. For example, the cost might be a measure of the
distance between the node’s state and its parent’s state. To ensure that the tree
search algorithm is still encouraged to reach the goal, all costs must be positive.
The total cost of a path is the sum of the costs of all nodes in the path. Our goal is
to find the path to the goal with the lowest total cost.
We compute an objective for each node consisting of two components: the
total cost of the current path from the root to the node and an estimate of the
remaining cost to get from the node to the goal state. The remaining cost estimate 10
This algorithm is a simplified
comes from a heuristic function h. One potential heuristic is the distance from the version of the RRT∗ algorithm.
S. Karaman and E. Frazzoli, “In-
current node to the goal state. Algorithm 5.9 implements this process given a cost cremental Sampling-Based Algo-
function and heuristic function.10 It provides default cost and heuristic functions rithms for Optimal Motion Plan-
ning,” Robotics Science and Systems
that will guide the search toward the shortest path. VI, vol. 104, no. 2, pp. 267–274,
To search for the most likely failure, we can use a cost function related to the 2010.
negative log likelihood of the disturbance for the current node. We add a constant
factor of the maximum possible log likelihood according to the disturbance dis-
tribution to ensure that the cost is positive. For the heuristic function, we need to
estimate the log likelihood of the remaining path required to reach the goal state.
One option is to use the distance to the goal state as a proxy for this value since
longer paths tend to result in lower negative log likelihoods. Adding a scaling

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
112 chapter 5. falsification through planning

distance_c(node) = norm(node.parent.state .- node.state) Algorithm 5.9. Algorithm for com-


distance_h(node, sgoal) = norm(sgoal .- node.state) puting objectives based on a cost
function c and heuristic function
function cost_objectives(tree, sgoal; c=distance_c, h=distance_h) h. The algorithm first traverses the
costs = Dict() tree and accumulates the total cost
queue = [tree[1]] of each node. It then computes the
while !isempty(queue) heuristic for each node and adds it
node = popfirst!(queue) to the total cost to get the objective
if isnothing(node.parent) values. We supply default imple-
costs[node] = 0.0 mentations of the cost and heuristic
else functions that will encourage RRT
costs[node] = c(node) + costs[node.parent] to search for the shortest path.
end
for child in node.children
push!(queue, child)
end Nominal Path
end
heuristics = [h(sgoal, node) for node in tree] Most Likely
objectives = [costs[node] for node in tree] .+ heuristics Failure
return objectives
end Shortest
Failure

factor to the cost function to balance between the heuristic and cost may improve Figure 5.12. The nominal path
performance. Figure 5.12 shows the results from using RRT to find the shortest for the continuum world problem
path to failure and most likely failure for the continuum world problem. compared to the shortest path to
failure and the most likely failure
While algorithm 5.9 will often find a low cost path to failure, it is not necessarily path found by RRT. The most likely
guaranteed find the path with the lowest possible cost. Certain conditions on failure path stays closer to the nom-
inal path before moving toward the
the nature of the problem and the heuristic function are required to guarantee obstacle.
optimality. Algorithm 5.9 will converge to the optimal path if the state space 11
When these conditions are met,
and disturbance space are discrete and the heuristic function is admissible.11 A the algorithm is the same as the
A∗ search algorithm. P. E. Hart, N. J.
heuristic is admissible if it is guaranteed to never overestimate the cost of reaching Nilsson, and B. Raphael, “A For-
the goal state. In shortest path problems, the straight-line distance to the goal mal Basis for the Heuristic Deter-
mination of Minimum Cost Paths,”
state is an admissible heuristic. Example 5.4 demonstrates this result on the grid IEEE Transactions on Systems Sci-
world problem. ence and Cybernetics, vol. 4, no. 2,
pp. 100–107, 1968.

12
5.4 Monte Carlo Tree Search For a survey, see C. B. Browne, E.
Powley, D. Whitehouse, S. M. Lu-
cas, P. I. Cowling, P. Rohlfshagen, S.
Monte Carlo tree search (MCTS) (algorithm 5.10) is a tree search algorithm that Tavener, D. Perez, S. Samothrakis,
balances between exploration and exploitation.12 It explores by selecting nodes and S. Colton, “A Survey of Monte
Carlo Tree Search Methods,” IEEE
that have not been visited many times and exploits by biasing the search tree Transactions on Computational Intel-
toward paths that seem most promising. MCTS determines which paths are most ligence and AI in Games, vol. 4, no. 1,
pp. 1–43, 2012.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.4. monte carlo tree search 113

Since the state space and disturbance space for the grid world problem are Example 5.4. Example of using
RRT to find the shortest path to
discrete, we are guaranteed to find the shortest path to failure and the most failure and most likely failure for
likely failure path as long as we select an admissible heuristic function. For the grid world problem. The plots
show the search tree at different it-
the shortest path to failure, an admissible heuristic is the Euclidean distance erations of the algorithm. The most
between the current state and the goal state. This distance will always be less likely failure path stays closer to
than or equal to the actual cost of reaching the goal state since the shortest the nominal path (highlighed in
gray) before moving toward the ob-
path between two points is a straight line. For the most likely failure path, stacle.
we can use the likelihood of a straight line trajectory from the current state to
the goal state assuming that it used the most likely disturbance at each step.
The plots show the results. As in the continuum world problem (figure 5.12),
the most likely failure path stays closer to the nominal path before moving
toward the obstacle.
Iteration 10 Iteration 25 Converged
Shortest Path
Most Likely Path

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
114 chapter 5. fa lsification through planning

promising by maintaining a value function Q(s, x ) for each node in the tree. Given
a failure objective (section 4.5), Q(s, x ) represents the expected future objective
value when applying disturbance x from state s. MCTS searches for the path with
the lowest objective value.13 13
When the objective function is
In the select step, MCTS traverses the tree starting at the root node. At each the most likely failure objective,
this technique is sometimes re-
node, we determine whether to select it for the extend step based on its current ferred to as adaptive stress testing. R.
number of children and number of visits N (s). Specifically, we extend the node if Lee, O. J. Mengshoel, A. Saksena,
R. W. Gardner, D. Genin, J. Silber-
the number of children is less than or equal to kN (s)α , where k and α are algorithm mann, M. Owen, and M. J. Kochen-
hyperparameters. This process is referred to as progressive widening. If the number derfer, “Adaptive Stress Testing:
Finding Likely Failure Events with
of children exceeds this value, we continue to traverse the tree using a heuristic
Reinforcement Learning,” Journal
that balances between exploration and exploitation. of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020.

Iteration 100 Iteration 200 Failure Found Figure 5.13. MCTS applied to find
a failure in the continuum world
problem. Darker nodes and edges
were visited more often. MCTS
finds a failure (highlighted in red)
after 258 iterations.

A common heuristic is the lower confidence bound (LCB) (algorithm 5.11), which
is defined as s
log N (s)
Q(s, x ) − c (5.8)
N (s, x )
where N (s, x ) is the number of times we took the path corresponding to distur-
bance x from the node corresponding to state s. The first term in equation (5.8)
exploits our current estimate of how promising a particular path is based on the
value function, and the second term is an exploration bonus. The exploration
constant c controls the amount of exploration. Higher values will lead to more
exploration. We move to the child node with the lowest LCB value and repeat the
process until we reach a node that we can extend.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.4. monte carlo tree search 115

struct MCTS <: TreeSearch Algorithm 5.10. The Monte Carlo


estimate_value # v = estimate_value(sys, ψ, node) tree search algorithm. The algo-
c # exploration constant rithm is a type of tree search al-
k # progressive widening constant gorithm and implements both the
α # progressive widening exponent select and extend! functions. The
select_disturbance # x = select_disturbance(sys, node) select function traverses the tree
k_max # number of iterations using the lcb function as a guide
end until is reaches a node that can
be extended based on its number
mutable struct MCTSNode of children. The extend! function
state # node state samples a disturbance according
parent # parent node to the select_disturbance func-
edge # (o, a, x) tion and simulates the system one
children # vector of child nodes step forward in time from the
N # visit count current node. It then estimates
Q # value estimate
the value at the new node using
end
the estimate_value function and
adds it to the tree. Finally, it propa-
function initialize_tree(alg::MCTS, sys)
gates this information back up the
return [MCTSNode(rand(Ps(sys.env)), nothing, nothing, [], 1, 0)]
tree to update the visit counts and
end
mean value estimate for each node
function select(alg::MCTS, sys, ψ, tree) in the path.
c, k, α, node = alg.c, alg.k, alg.α, tree[1]
while length(node.children) > k * node.N^α
node = lcb(node, c)
end
return node
end

function extend!(alg::MCTS, sys, ψ, tree, node)


x = alg.select_disturbance(sys, node)
o, a, s′ = step(sys, node.state, x)
Q = alg.estimate_value(sys, ψ, s′)
snew = MCTSNode(s′, node, (; o, a, x), [], 1, Q)
push!(node.children, snew)
push!(tree, snew)
while !isnothing(node)
node.N += 1
node.Q += (Q - node.Q) / node.N
Q, node = node.Q, node.parent
end
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
116 chapter 5. falsification through planning

function lcb(node::MCTSNode, c) Algorithm 5.11. The lower confi-


Qs = [node.Q for node in node.children] dence bound algorithm. The al-
Ns = [node.N for node in node.children] gorithm computes the LCB for
lcbs = [Q - c*sqrt(log(node.N)/N) for (Q, N) in zip(Qs, Ns)] each child node according to equa-
return node.children[argmin(lcbs)] tion (5.8) and returns the child
end node with the lowest LCB.

In the extend step, MCTS samples a disturbance and simulates the system one
step forward in time from the current node. It then estimates the value at the
new node and adds it to the tree. A common technique to estimate this value is
to perform rollouts from the new node and evaluate their robustness. We can
also estimate the value using a heuristic such as distance to failure. Finally, we
propagate this information back up the tree to update the visit counts and mean
value estimate for each node in the path. Figure 5.13 shows the result of using
MCTS to find failures in the continuum world problem. The algorithm gradually
expands the tree toward the obstacle and visits promising nodes more often.
The tree search algorithms we have presented so far assumed deterministic
transitions between nodes. In other words, simulating disturbance x from state s
will always lead to the same next state s0 . However, we may not have control over
all sources of randomness for some real-world simulators, resulting in stochastic
transitions between nodes. One advantage of MCTS is that it can handle this
stochasticity. A technique called double progressive widening can be used to extend
the tree in these cases. Double progressive widening applies the progressive
widening condition to both the disturbance and next state.14 14
A. Couëtoux, J.-B. Hoock, N.
Sokolovska, O. Teytaud, and N.
Bonnard, “Continuous Upper Con-
5.5 Reinforcement Learning fidence Trees,” in Learning and In-
telligent Optimization (LION), 2011.

Reinforcement learning algorithms train agents to perform a task while they interact
with an environment.15 We can use reinforcement learning for falsification by 15
For an introduction to reinforce-
training an agent to cause a system to fail. To avoid confusing the reinforcement ment learning, see R. S. Sutton and
A. G. Barto, Reinforcement Learning:
learning agent with the agent in the system under test, we call the reinforcement An Introduction, Second Edition.
learning agent an adversary. MIT Press, 2018.
Figure 5.14 shows the overall setup. At each time step, the adversary interacts
with the system by selecting a disturbance x. The system then steps forward in
time and produces a reward r for the adversary related to the failure objective.
We refer to a series of these time steps as an episode. Reinforcement learning

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.6. simulator requirements 117

algorithms train the adversary to maximize reward using data gathered over a
System
series of episodes. Specifically, the adversary learns a policy πadv (s) that maps
states to disturbances. Once the adversary is trained, we can use it to search for
failures by performing rollouts of the system using disturbances selected by the x r
adversary’s policy.
Similar to MCTS, reinforcement learning algorithms balance between explo- Adversary
ration and exploitation. The adversary explores by trying different disturbances
in each state, and exploits by selecting disturbances that are likely to lead to a Figure 5.14. Reinforcement learn-
failure. Typically, the adversary will explore more at the beginning of training to ing for falsification. We train an ad-
versary to select disturbances that
gather data that it can later on exploit. Reinforcement learning algorithms balance will cause a system to fail. The ad-
between these two objectives to maximize sample efficiency. Sample efficient algo- versary receives feedback in the
form of a reward signal.
rithms require as few episodes as possible to learn an effective policy. A number
of sample efficient reinforcement learning algorithms have been developed, and
we can use off-the-shelf implementations of them to efficiently find failures of
complex systems.16 16
Off-the-shelf reinforcement
Another advantage of a reinforcement learning approach is its ability to gener- learning packages provide im-
plementations of a variety of
alize. The shooting methods and tree search algorithms discussed in this chapter reinforcement learning algorithms.
all required a specific initial state from which to find a failure path. Using rein- For example, see the Crux.jl
package in the Julia ecosystem.
forcement learning to find failures removes this necessity. Because the adversary
learns a policy over the entire state space, we can perform a rollout from any
initial state to search for a failure. Example 5.5 demonstrates this result on the
continuum world problem using an off-the-shelf reinforcement learning package.

5.6 Simulator Requirements

Selecting an appropriate falsification algorithm for a given system is often depen-


dent on the capabilities of the system simulator. Some commercial simulators, for
example, do not provide access to their internal models. Simulators also differ in
the aspects of the simulation that the user can control. The falsification algorithms
we discussed in this chapter and the previous chapter impose different require-
ments on the system simulator. Figure 5.15 summarizes these requirements.
To apply any of the algorithms from chapter 4, the simulator must be capable of
performing a rollout. For a black-box rollout, the simulator takes as input an initial
state s and a vector of disturbances x from the user and outputs the objective value
f (τ ). For these simulators, we can perform falsification using direct sampling or
fuzzing. We can also use optimization-based falsification algorithms that only rely

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
118 chapter 5. falsification through planning

To apply the reinforcement learning algorithms implemented in the Crux.jl Example 5.5. Example of using re-
inforcement learning to find fail-
package to the continuum world problem, we need to define the following: ures in the continuum world prob-
initial_state_dist = Product([Distributions.Uniform(0, 10), lem. The plots show rollouts of the
Distributions.Uniform(0, 10)]) adversary policy starting from dif-
function interact(s, x, rng) ferent initial states after different
_, _, s′ = step(cw, s, Disturbance(0, x, 0)) numbers of training episodes. Fail-
r = Float32(robustness(s, ψ.formula) - robustness(s′, ψ.formula)) ure trajectories are highlighted in
norm(s′ - [4.5, 4.5]) < 0.5 ? r += 10.0 : nothing red. The adversary is able to find
return (sp=s′, r=r) failures from most initial states af-
end
ter 50,000 training episodes. For
more information on the solving
We first define an initial state distribution that covers the entire state space,
code, see the Crux.jl documenta-
allowing us to find failures starting from any state. The interact function tion.
defines how the adversary interacts with the system. Given a state s and a
disturbance x, the function simulates the system one step forward in time
and returns a tuple with the next state s′ and reward r. The random number
generator rng is a required input for Crux.jl but is not used in this case since
the function is deterministic.
The reward is based on the change in robustness for the current step. We
also add a large reward for reaching a failure state. With these definitions, we
can apply any of the reinforcement learning algorithms in the Crux.jl pack-
age to find failures. The plots show rollouts of the adversary policy starting
from different initial states after different numbers of training episodes using
an algorithm called Proximal Policy Optimization (PPO). The adversary is
able to find failures from most initial states after 50,000 training episodes.

5,000 Episodes 30,000 Episodes 50,000 Episodes

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
5.6. simulator requirements 119

Figure 5.15. Overview of simulator


Algorithm Categories Simulator Requirements requirements for the various cat-
egories of falsification algorithms.
The first two rows are related to the
direct sampling Black-Box Rollout
fuzzing
optimization-based falsification al-
s1 , x s2 s3 sd f (τ )
direct methods step 1 step 2 ... step d gorithms discussed in chapter 4,
population methods and the second two rows relate to
the planning algorithms discussed
White-Box Rollout
in this chapter. Variables shown in
blue are variables that the user of
first-order methods s1 , x s2 s3 sd f (τ )
second-order methods step 1 step 2 ... step d the simulator has control over. We
cannot observe any aspects of the
simulator shown in gray.
Episode
reinforcement learning s1 , x1 s2 , r2 , x2 s3 , r3 , x3 sd , rd , xd
step 1 step 2 ... step d

Extend
tree search
multiple shooting
s, x s0 , c
step

on evaluations of the objective function such as direct methods and population


methods. A white-box rollout has the same inputs and outputs as a black-box
rollout, but it also allows us to observe the internal model of this system. We can
compute gradients and Hessians of the objective function for white-box rollouts,
allowing us to apply first- and second-order optimization methods.
The planning algorithms discussed in this chapter require the simulator to be
able to perform single steps. Reinforcement learning algorithms operate using
episodes, which consist of a series of steps starting from a user-specified initial
state. At each step, the reinforcement learning agent observes the next state and
reward from the previous step and selects a disturbance. Tree search algorithms
and multiple shooting methods require the simulator to be able to take steps from
arbitrary states. Given a state s and a disturbance x, the simulator must be able to
simulate the system one step forward in time and return the next state s0 . For tree
search algorithms that use cost functions, the simulator must also return the cost
c of taking the step.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
120 chapter 5. falsification through pla nning

5.7 Summary

• Planning algorithms account for the temporal aspect of the falsification problem
and break it into a series of smaller problems.

• Shooting methods perform optimization-based falsification by optimizing over


a series of trajectory segments, which may increase efficiency for systems where
small changes in the disturbances at the beginning of a trajectory can have a
significant effect later.

• Tree search algorithms search the space of possible trajectories as a tree and
iteratively grow the tree in search of a failure trajectory.

• Heuristic search algorithms use heuristics such as distance to failure, coverage,


and robustness to guide the search.

• Monte Carlo tree search balances between exploration and exploitation to


efficiently search the space of possible trajectories.

• Reinforcement learning algorithms can be used train an adversary to produce


failures in a sample efficient manner.

• The capabilities of a system’s simulator determine which falsification algo-


rithms can be applied.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6 Failure Distribution

While the falsification algorithms in the previous chapters search for single failure
events, it is often desirable to understand the distribution over failures for a
given system and specification. This distribution is difficult to quantify exactly for
many real-world systems. Instead, we can approximate the failure distribution by
drawing samples from it. This chapter discusses methods for sampling from the
failure distribution. We present two categories of sampling methods. First, we
discuss rejection sampling, which produces samples from a target distribution
by accepting or rejecting samples from a different distribution. We then present
Markov chain Monte Carlo (MCMC) methods. MCMC methods generate samples p(τ | τ ∈
/ ψ)
from a target distribution using a chain of correlated samples. We conclude with
a discussion of probabilistic programming, which allows us to scale MCMC
p(τ )
methods to complex, high-dimensional systems.

−4 −2 0 2 4
6.1 Distribution over Failures τ

The distribution over failures for a given system with specification ψ is represented Figure 6.1. The distribution over
failures for a simple system where
by the conditional probability p(τ | τ ∈ / ψ). We can write this probability as
trajectories consist of only a single
state that is sampled from a normal
1{ τ 6 ∈ ψ } p ( τ ) distribution (black). A failure oc-
p(τ | τ ∈
/ ψ) = R (6.1)
1{τ 6∈ ψ} p(τ ) dτ curs when the sampled state is less
than −1. The area of the shaded re-
where 1{·} is the indicator function and p(τ ) is the probability density of the gion corresponds to the integral in
equation (6.1). The failure distribu-
nominal trajectory distribution for trajectory τ. Figure 6.1 shows the failure distri- tion (red) is the probability density
bution for a simple system where trajectories consist of only a single state that is function of the nominal distribu-
sampled from a normal distribution. For most systems, the failure distribution is tion in the failure region scaled by
this value.
difficult to compute exactly because doing so requires solving the integral in the
denominator of equation (6.1) to compute the normalizing constant. The value of
122 chapter 6. failure distribution

this integral corresponds to the probability of failure for the system. We discuss 1
For a detailed overview, see C. P.
Robert and G. Casella, Monte Carlo
methods to estimate this quantity in chapter 7.
Statistical Methods. Springer, 1999,
While we cannot compute the probability density of the failure distribution vol. 2.
exactly, we can use its unnormalized probability density p̄(τ | τ ∈ / ψ) to draw
samples from it. The unnormalized probability density is given by

p̄(τ | τ ∈
/ ψ ) = 1{ τ 6 ∈ ψ } p ( τ ) (6.2)

Computing this density for a given trajectory only requires determining whether
it is a failure trajectory and evaluating its probability density under the nominal
trajectory distribution. The rest of this chapter discusses several methods for
sampling from this unnormalized distribution.1 With enough samples, we can
implicitly represent the distribution over failures (see figure 6.2).
Figure 6.2. Distribution over fail-
ures for the grid world prob-
6.2 Rejection Sampling lem represented implicitly through
samples. The probability of slip-
ping is set to 0.8.
Rejection sampling produces samples from a complex target distribution by ac-
cepting or rejecting samples from a different distribution that is easier to sample
from. It is inspired by the idea of throwing darts uniformly at a rectangular dart
board that encloses the graph of the density of the target distribution. If we keep
only the darts that land inside the target density, we produce samples that are
distributed according to the target distribution (see figure 6.3).
Figure 6.3. Sampling from a
In the dart board example, we are using samples from a uniform distribution truncated normal distribution by
to produce samples from an arbitrary target density. The efficiency of this process throwing darts uniformly at a rect-
angular dart board that encloses
depends on the area of the dart board that lies outside the target distribution. the graph of its density function.
If there is a large area outside the target distribution, many of the darts will The samples on the bottom are ob-
be rejected, and we will require more darts to accurately represent the target tained by moving all of the darts
that land inside the target distribu-
distribution. One way to improve efficiency is to use a different dart board that tion to the bottom of the dart board.
more closely matches the shape of the target distribution. In other words, we may These samples are distributed ac-
cording to the target distribution.
want to draw samples from a different distribution that is still easy to sample
from but more closely matches the target distribution. We call this distribution a 2
In the dart board analogy, we can
proposal distribution. think of this acceptance criteria as a
Algorithm 6.1 implements the rejection sampling algorithm given a target two step process. First, we sample
the x-coordinate of the dart from
distribution with density function p̄(τ ) and a proposal distribution with density the proposal distribution. Second,
function q(τ ). At each iteration, we draw a sample τ from the proposal distribution we select its y-coordinate randomly
and accept it with probability proportional to p̄(τ )/(cq(τ )).2 To ensure that between the bottom of the board
and cq(τ ). If it falls under p(τ ), it
the proposal distribution fully encloses the target distribution, we require that is accepted.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.2. rejection sampling 123

q(τ ) > 0 whenever p̄(τ ) > 0 and that c is selected such that p̄(τ ) ≤ cq(τ ) for all
τ. The density function of the target distribution does not need to be normalized.

struct RejectionSampling Algorithm 6.1. The rejection sam-


p̄ # target density pling algorithm for sampling from
q # proposal trajectory distribution a target distribution. At each itera-
c # constant such that p(τ) ≤ cq(τ) tion, the algorithm performs a roll-
k_max # max iterations out using the proposal trajectory
end distribution, computes the accep-
tance ratio, and accepts the sample
function sample_failures(alg::RejectionSampling, sys, ψ) with probability equal to the accep-
p̄, q, c, k_max = alg.p̄, alg.q, alg.c, alg.k_max tance ratio.
τs = []
for k in 1:k_max
τ = rollout(sys, q)
if rand() < p̄(τ) / (c * pdf(q, τ))
push!(τs, τ)
end
end
return τs
end
p(τ )

p̄(τ | τ ∈
/ ψ)
To sample from the failure distribution, we use the unnormalized density in
equation (6.2) as the target density. A common choice for the proposal distribution −2
−4 0 2 4
is the nominal trajectory distribution. To use this proposal, we must select a value
p(τ | τ ∈
/ ψ)
for c such that 1{τ 6∈ ψ} p(τ ) ≤ cp(τ ). Selecting c = 1 satisfies this condition and
causes the acceptance ratio to reduce to 1{τ 6∈ ψ}. In other words, we will accept
a sample if it is a failure trajectory and reject it otherwise. Figure 6.4 shows an
−4 −2 0 2 4
example that uses the nominal trajectory distribution to sample from the failure τ
distribution shown in figure 6.1.
Figure 6.4. Rejection sampling us-
If failures are unlikely under the nominal distribution, we will require many
ing the nominal trajectory distribu-
samples to produce a representative set of samples from the failure distribution. tion as the proposal distribution to
In this case, we may be able to improve efficiency by using domain knowledge to sample from the failure distribu-
tion shown in figure 6.1. The plot
select a proposal distribution that more closely matches the shape of the failure on the top shows the target den-
distribution. For example, failures occur at negative values in the simple system sity (red) and the proposal den-
sity (gray). Accepted samples are
shown in figure 6.1, so we may be able to improve efficiency by shifting the
highlighted in red. The plot on the
proposal distribution to the left. bottom shows a histogram of the
When we select the proposal distribution for rejection sampling, we must also accepted samples compared to the
density function of the failure dis-
select a value for c to ensure that the proposal distribution fully encloses the target tribution.
distribution for all τ. Figure 6.5 shows an example that uses a shifted proposal
distribution to sample from the failure distribution shown in figure 6.1 for two

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
124 chapter 6. failure distribution

different values of c. We want to select c to be as tight as possible to achieve the


highest efficiency. In general, selecting a good proposal distribution and value
for c requires domain knowledge and can be challenging for high-dimensional
systems with long time horizons. If c is too loose, rejection sampling may be too
inefficient to be useful (see example 6.1). The next section discusses techniques
that tend to perform better in these cases.

50 samples 200 samples 1,000 samples Figure 6.5. Using a hand-designed


proposal distribution to apply re-
p(τ ) jection sampling to the simple sys-
Mq(τ ) tem in figure 6.1 for two different
p̄(τ | τ ∈
/ ψ)
values of M. The proposal distribu-
tion is a normal distribution shifted
c=1

to the left (q(τ ) = N τ | −1, 12 ).



−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
The top row shows the results for
p(τ | τ ∈
/ ψ)
M = 1, which is a loose bound. The
bottom row shows the results for
M = 0.6065, which is the tightest
possible value for M. More sam-
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 ples are accepted using the tighter
value for M resulting in greater ef-
τ τ τ
ficiency.

50 samples 200 samples 1,000 samples

p(τ )
Mq(τ ) p̄(τ | τ ∈
/ ψ)
c = 0.6065

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

p(τ | τ ∈
/ ψ)

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
τ τ τ

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.2. rejection sampling 125

Suppose we want to use rejection sampling to sample failures from an in- Example 6.1. Example of the chal-
lenges of using rejection sampling
verted pendulum system where the standard deviation of the sensor noise for high-dimensional systems with
for each state variable is 0.1. From example 4.3, we know that failures are rare long time horizons. In this exam-
ple, we compute the tightest value
under the nominal trajectory distribution, so rejection sampling using the we can select for c based on domain
nominal trajectory distribution as a proposal will be inefficient. We also saw knowledge for the inverted pendu-
in example 4.3 that when we instead sampled trajectories from a distribution lum system and show that it is pro-
hibitively large.
where the standard deviation of the sensor noise was 0.15, we were able to
find failures. Therefore, we may want to use this distribution as a proposal
for rejection sampling.
We must then select a value for c such that

p(τ ) ≤ cq(τ )
d d
p(s1 ) ∏ N ( xt | 0, (0.1)2 I ) ≤ cp(s1 ) ∏ N ( xt | 0, (0.15)2 I )
t =1 t =1
d
N ( xt | 0, 0.01I )
 
∏ N ( xt | 0, 0.0225I )
≤c
t =1

where we assume that the initial state distribution is the same for the proposal
and target. The term in the product will be maximized when xt = [0, 0] for
all t. Plugging this result into the product and assuming a depth of 40, we
find that
N (0 | 0, 0.01I ) 40
 
≤c
N (0 | 0, 0.0225I )
1.2226 × 1014 ≤ c

Therefore, the tightest value we can select for c is 1.2226 × 1014 . Using this
value, our acceptance probabilities end up being very small (on the order of
10−15 ), and rejection sampling is inefficient.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
126 chapter 6. failure distribution

6.3 Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) algorithms generate samples from a target
distribution by sampling from a Markov chain.3 A Markov chain is a sequence of 3
A detailed overview of MCMC
techniques is provided in C. P.
random variables where each variable depends only on the previous one. MCMC
Robert and G. Casella, Monte Carlo
algorithms begin by initializing a Markov chain with an initial sample τ. At each Statistical Methods. Springer, 1999,
iteration, they use the current sample τ to generate a new sample τ 0 by sampling vol. 2.

from a conditional distribution g(· | τ ). This distribution is sometimes referred to


as a kernel.4 We accept or reject the new sample based on an acceptance criteria. If 4
This distribution is also some-
the new sample is accepted, we set τ = τ 0 and continue to the next iteration. If times referred to as a proposal dis-
tribution. It differs from the pro-
the new sample is rejected, we keep the previous sample. posal distribution used in rejection
Given certain properties of the kernel and acceptance criterion, MCMC algo- sampling in that it is conditioned
on the previous sample.
rithms are guaranteed to converge to the target distribution in the limit of infinite
samples. However, the initial samples may not be representative of the target
distribution. For this reason, we often specify a burn-in period in which the initial
samples are discarded. Furthermore, unlike rejection sampling, the samples pro-
duced by MCMC algorithms are not independent from one another. Each sample
in the chain depends on the previous one. Therefore, it is also common to thin
the samples by only keeping every hth sample. Several variations of MCMC differ
in how they implement the acceptance criteria and the kernel.

6.3.1 Metropolis-Hastings
One of the most common MCMC algorithms is the Metropolis-Hastings algorithm.5 5
W. K. Hastings, “Monte Carlo
The Metropolis-Hastings algorithm accepts a new sample τ 0 given the current Sampling Methods Using Markov
Chains and Their Applications,”
sample τ with probability Biometrika, vol. 57, no. 1, pp. 97–97,
p̄(τ 0 ) g(τ | τ 0 ) 1970.
(6.3)
p̄(τ ) g(τ 0 | τ ) 6
When the kernel is symmet-
where p̄ is the unnormalized target density. To sample from the failure distribution, ric, the algorithm is called the
Metropolis algorithm: N. Metropo-
we set p̄ = 1{τ 6∈ ψ} p(τ ). Since we are taking a ratio of the densities, the target lis, A. W. Rosenbluth, M. N. Rosen-
density does not need to be normalized. The kernel g(· | τ ) is often chosen to bluth, A. H. Teller, and E. Teller,
“Equation of State Calculations by
be a symmetric distribution, meaning that g(τ 0 | τ ) = g(τ | τ 0 ).6 In this case, Fast Computing Machines,” Journal
the acceptance criteria reduces to p̄(τ 0 )/ p̄(τ ). Intuitively, if τ 0 is more likely than of Chemical Physics, vol. 21, no. 6,
τ, it is always accepted. If τ 0 is less likely than τ, it is accepted with probability pp. 1087–1092, 1953. A common
choice of a symmetric kernel is a
proportional to the ratio of the densities. Gaussian distribution centered at
the previous sample.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 127

Algorithm 6.2 implements the Metropolis-Hastings algorithm given a target


density, a kernel, and an initial trajectory to begin the Markov chain. The kernel
is a conditional distribution that takes in a trajectory and produces a trajectory
distribution. Example 6.2 shows an example of a kernel for the inverted pendulum
system. The next sample is generated by performing a rollout using this distri-
bution. We then accept or reject the new sample based on the acceptance ratio
in equation (6.3). Figure 6.6 shows the result of using the Metropolis-Hastings
algorithm to sample from the failure distribution shown in figure 6.1.

struct MCMCSampling Algorithm 6.2. The Metropolis-


p̄ # target density Hastings algorithm for sampling
g # kernel: τ′ = rollout(sys, g(τ)) from a target distribution. The ker-
τ # initial trajectory nel function g must take in a trajec-
k_max # max iterations tory and return a trajectory distri-
m_burnin # number of samples to discard from burn-in bution. At each iteration, the algo-
m_skip # number of samples to skip for thinning rithm generates a new sample by
end performing a rollout using this dis-
tribution. It then accepts or rejects
function sample_failures(alg::MCMCSampling, sys, ψ) the new sample based on the accep-
p̄, g, τ = alg.p̄, alg.g, alg.τ tance ratio in equation (6.3). The al-
k_max, m_burnin, m_skip = alg.k_max, alg.m_burnin, alg.m_skip gorithm discards the first m_burnin
τs = [] samples and thins the remaining
for k in 1:k_max samples according to m_skip.
τ′ = rollout(sys, g(τ))
if rand() < (p̄(τ′) * pdf(g(τ′), τ)) / (p̄(τ) * pdf(g(τ), τ′))
τ = τ′
end
push!(τs, τ)
end
return τs[m_burnin:m_skip:end]
end

6.3.2 Smoothing
When we use algorithm 6.2 to sample from the failure distribution, we will not
accept any samples that are not failures because p̄(τ ) = 1{τ 6∈ ψ} p(τ ) will be 0
for those samples. While this behavior is necessary for the algorithm to converge
to the failure distribution in the limit of infinite samples, it can create challenges
in practice. For example, if we initialize the Markov chain to a safe trajectory, the
algorithm will reject all samples from g(· | τ ) until it samples a failure. Since
g(· | τ ) typically produces trajectories similar to τ, we may require many samples

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
128 chapter 6. fa ilure distribution

To define a Gaussian kernel for the inverted pendulum system, we must first Example 6.2. Example of a Gaus-
sian kernel for the inverted pendu-
define a trajectory distribution type (algorithm 4.3) for the pendulum. The lum system.
following code defines a trajectory distribution for the pendulum system
that uses a Gaussian distribution for the initial state and a vector of Gaussian
distributions for the observation disturbance distributions:
struct PendulumTrajectoryDistribution <: TrajectoryDistribution
μ₁ # mean of initial state distribution
Σ₁ # covariance of initial state distribution
μs # vector of means of length d
Σs # vector of covariances of length d
end
function initial_state_distribution(p::PendulumTrajectoryDistribution)
return MvNormal(p.μ₁, p.Σ₁)
end
function disturbance_distribution(p::PendulumTrajectoryDistribution, t)
D = DisturbanceDistribution((o)->Deterministic(),
(s,a)->Deterministic(),
(s)->MvNormal(p.μs[t], p.Σs[t]))
return D
end
depth(p::PendulumTrajectoryDistribution) = length(p.μs)

We can then define a kernel for the pendulum system that returns an instan-
tiation of this distribution as follows:
function inverted_pendulum_kernel(τ; Σ=0.01I)
μ₁ = τ[1].s
μs = [step.x.xo for step in τ]
return PendulumTrajectoryDistribution(μ₁, Σ, μs, [Σ for step in τ])
end

The new distribution is centered at the initial state and observation distur-
bances of the current sample. We can use this kernel with algorithm 6.2 to
sample from the failure distribution of the inverted pendulum system.

2 Figure 6.6. Metropolis-Hastings


Burn-in applied to sample from the fail-
p(τ | τ ∈
/ ψ) ure distribution shown in figure 6.1.
0 We use a Gaussian kernel with a
standard deviation of 1. The plot
τ

on the left shows the samples over


−2
time. The plot on the right shows a
Failure histogram of the resulting samples
Region
−4 compared to the true probability
0 200 400 600 800 1,000 −4 −2 0 2 4 density function of the failure dis-
Iteration τ tribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 129

before we sample a failure to accept, especially if τ is far from the failure region.7 7
One way to avoid this behavior is
We see this behavior during the burn-in period in figure 6.6. to ensure that the initial trajectory
is a failure. The algorithms in chap-
Another challenge arises when the failure distribution has multiple modes. To ters 4 and 5 can be used to search
move between modes, the algorithm must sample a failure from one failure mode for an initial failure trajectory.
using a kernel conditioned on a trajectory from another. If the failure modes are
spread out in the trajectory space, the algorithm may require a large number of
samples before moving from one mode to another. Example 6.3 illustrates these
challenges on a simple Gaussian system.
Smoothing is a technique that addresses these challenges by modifying the target
density to make it easier to sample from.8 It relies on a notion of the distance 8
H. Delecki, A. Corso, and M. J.
to failure, which we will write as ∆(τ ) for a given trajectory τ. This distance Kochenderfer, “Model-Based Vali-
dation as Probabilistic Inference,”
is a nonnegative number that measures how close τ is to a failure. For failure in Conference on Learning for Dynam-
trajectories, ∆(τ ) should be 0. We can rewrite the target density in terms of this ics and Control (L4DC), 2023.
distance as
/ ψ ) = 1{ ∆ ( τ ) ≤ 0} p ( τ )
p̄(τ | τ ∈ (6.4)
The indicator function causes sharp boundaries between safe and unsafe trajecto-
ries. To create a smooth version of this density, we replace the indicator function
with a Gaussian distribution with mean 0 and a small standard deviation. The
resulting smoothed density is
e = 0.8
e = 0.5
 
/ ψ) ≈ N ∆(τ ) | 0, e2 p(τ )
p̄(τ | τ ∈ (6.5) e = 0.2
no smoothing
where e is the standard deviation.
For systems with temporal logic specifications, we can specify the distance −4 −2 0 2 4
function using temporal logic robustness (section 3.5.2). Since robustness is τ
positive when the formula is satisfied and negative when it is violated, we can
Figure 6.7. Smoothed versions
write the distance function as of the failure distribution in fig-
ure 6.1 for different values of e. As
∆(τ ) = max(0, ρ(τ )) (6.6) e decreases, the smoothed distribu-
tion approaches the failure distri-
bution.
where ρ(τ ) is the robustness of the trajectory τ. Figure 6.7 shows the smoothed
version of the failure distribution in figure 6.1 for different values of e. As e
approaches 0, the smoothed density approaches the shape of failure distribution.
As e approaches infinity, the smoothed density approaches the shape of the
nominal distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
130 chapter 6. failure distribution

Suppose we want to sample from the failure distribution shown in the plot Example 6.3. Example of the chal-
lenges of using MCMC to sample
on the left and we initialize our Markov chain with τ = 1. We will not accept from the failure distribution given
a new sample until we draw a sample with a value less than −1. If we use a a finite sample budget. The plot
on the left demonstrates the chal-
Gaussian kernel with standard deviation 1, we have that lenges with initialization, and the
  plot on the right shows the chal-
g(τ 0 | τ ) = N τ 0 | 1, 12 lenges of sampling from failure dis-
tributions with multiple modes.
The probability of drawing a sample less than −1 from this distribution is
0.02275 (corresponding to the shaded region in the plot on the left). Therefore,
we will require 44 samples on average before the algorithm accepts a sample.
If we were to initialize the algorithm with a sample even further from the
failure region, we would require even more samples to the point where
MCMC may not converge within a finite sample budget.
The plot on the right demonstrates the challenge of using MCMC to sample
from a failure distribution with multiple modes. In this case, the current
sample is in the mode on the left at −2.2. Using the same Gaussian kernel,
we have  
g(τ 0 | τ ) = N τ 0 | −2.2, 12

The probability of moving to the other mode from this point is 1.3346 × 10−5 .
Therefore, we will require a large number of samples before we switch modes.

g(τ 0 | τ )
g(τ 0 | τ )
p̄(τ | τ ∈
/ ψ)
p̄(τ | τ ∈
/ ψ)

τ τ
−4 −2 0 2 4 −4 −2 0 2 4
τ τ

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 131

The smoothed failure distribution assigns a nonzero probability to all trajecto-


ries, and it assigns higher probabilities to trajectories that are close to failure. This
design allows the MCMC algorithm to more easily move between failure modes.
However, because the smoothed distribution will assign a nonzero probability to
safe trajectories, the algorithm will accept some samples that are not failures. We
can still recover the failure distribution by rejecting these samples after MCMC
has terminated. In fact, this process is equivalent to performing rejection sam-
pling with the smoothed density as the proposal distribution. Figure 6.8 and
example 6.4 show the benefit of applying MCMC with a smoothed density to
sample from a failure distributions with multiple modes.

6.3.3 Metropolis-Adjusted Langevin Algorithm


The performance of MCMC is sensitive to the choice of kernel. While a Gaussian
kernel is simple to implement, it does not scale well to high-dimensional systems
with complex failure distributions because it randomly explores the target density
without taking into account its underlying structure. We can improve performance
by selecting a kernel that takes into account this structure. For example, we can
use knowledge of the gradient of the target density to guide exploration.
The Metropolis-Adjusted Langevin Algorithm (MALA) uses a gradient-based
kernel that approximates a process known as Langevin diffusion.9 The kernel is 9
Langevin dynamics is an idea
defined as from physics that was developed to
  √ 2  model molecular systems by physi-
g(τ 0 | τ ) = N τ 0 | τ + α∇ log p̄(τ ), 2α (6.7) cist Paul Langevin (1872–1946).
MALA is also referred to as
where α is a hyperparameter of the algorithm.10 The MALA kernel is not symmet- Langevin Monte Carlo. U. Grenan-
der and M. I. Miller, “Representa-
ric in general. Intuitively, the kernel takes a step in the direction of the greatest tions of Knowledge in Complex
increase in log likelihood and samples from a Gaussian distribution centered at Systems,” Journal of the Royal Sta-
tistical Society: Series B (Methodolog-
the new location. The algorithm then accepts or rejects the new sample based on ical), vol. 56, no. 4, pp. 549–581,
the Metropolis-Hastings acceptance ratio in equation (6.3). We can run MALA 1994.
10
using algorithm 6.2 by implementing a kernel that follows equation (6.7). This kernel represents a discrete
approximation of the Langevin dif-
Using the gradient to guide the sampling allows the algorithm to explore the fusion process. It approaches the
target density more efficiently than a random walk. Furthermore, when combined continuous-time Langevin diffu-
with the smoothing technique in section 6.3.2, the gradient helps to guide the sion process as α approaches 0.

algorithm toward different failure modes. Figure 6.9 compares the path taken by
a Gaussian kernel with the path taken by MALA on a simple target density. The

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
132 chapter 6. failure distribution

No Smoothing e = 0.3 e = 0.5 Figure 6.8. The effect of smooth-


ing on the MCMC algorithm for a
failure distribution with multiple
modes. The first row shows the tar-
get density (gray) compared to the

density of the true failure distribu-


tion (red). The second row shows
the MCMC samples over time with
the failure regions shaded in red.
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 The third row shows the accepted
τ τ τ (red) and rejected samples (gray).
The fourth row shows a histogram
of the accepted samples compared
to the true probability density func-
tion of the failure distribution. The
τ

first column shows the results with-


out smoothing. The second and
third columns show the results
with different values of e. With-
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 out smoothing, MCMC stays in the
Iteration Iteration Iteration same failure mode for all 2,000 iter-
ations and misses the other mode.
Applying smoothing allows the al-
gorithm to more easily move be-
tween failure modes and results in

better estimates of the failure dis-


tribution given the sample budget.

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
τ τ τ
p

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
τ τ τ

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.3. markov chain monte carlo 133

The inverted pendulum system has two main failure modes: tipping over in Example 6.4. Applying smoothing
to sample from the failure distri-
the negative direction and tipping over in the positive direction. To observe bution of the inverted pendulum
the effect of smoothing on the performance of MCMC, we define the following system. Smoothing allows MCMC
two unnormalized target densities: to sample from both failure modes
p = NominalTrajectoryDistribution(inverted_pendulum, 21) # depth = 21 given a finite sample budget. The
p̄(τ) = isfailure(ψ, τ) * pdf(p, τ) plot below shows the result of run-
function p̄_smooth(τ; ϵ=0.15) ning MCMC without smoothing.
Δ = max(robustness([step.s for step in τ], ψ.formula), 0)
return pdf(Normal(0, ϵ), Δ) * pdf(p, τ) 1
end

θ (rad)
The plot in the margin shows the results when we run algorithm 6.2 using p̄ 0
as the target density. We will not accept any samples that are not failures, and
we only observe failures from one failure mode. The plots below show the
−1
results when we use p̄_smooth combined with rejection sampling. Smoothing 0 0.2 0.4 0.6 0.8
allows us to sample failures from both failure modes. However, we now draw Time (s)
some samples that are not failures during the MCMC (left), so we must reject
them after the algorithm has terminated to recover the failure distribution
(right).

MCMC Output Failure Distribution

1
θ (rad)

−1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Time (s) Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
134 chapter 6. failure distribution

Gaussian Kernel MALA Kernel Figure 6.9. Comparison of the


paths taken by algorithm 6.2
using the Gaussian kernel and
the MALA kernel for a two-
dimensional smoothed target den-
sity with a failure region shown
in red. Brighter contours inidicate
higher density. The MALA kernel
uses the gradient of the log likeli-
hood to guide its steps and requires
fewer samples than the Gaussian
kernel to move to the failure re-
gion. The MALA kernel is also has
a higher acceptance rate.

MALA kernel enables MCMC to move more efficiently toward regions of high
likelihood.

6.3.4 Metropolis-Hastings Variations 11


Hamiltonian Monte Carlo is also
MALA is one of several variations of the Metropolis-Hastings algorithm. Other referred as Hybrid Monte Carlo.
S. Duane, A. D. Kennedy, B. J.
gradient-based variations include Hamiltonian Monte Carlo11 (HMC) and the Pendleton, and D. Roweth, “Hy-
No U-Turn Sampler12 (NUTS). HMC uses a simulation of Hamiltonian dynamics brid Monte Carlo,” Physics Letters
B, vol. 195, no. 2, pp. 216–222, 1987.
based on the gradient of the log likelihood to guide exploration. NUTS is an 12
M. D. Hoffman, A. Gelman, et al.,
extension of HMC that is less sensitive to hyperparameters. Another variation “The No-U-Turn Sampler: Adap-
of the Metropolis-Hastings algorithms is Gibbs sampling, which updates each tively Setting Path Lengths in
Hamiltonian Monte Carlo.,” Jour-
variable in the target density one at a time conditioned on the values of the other nal of Machine Learning Research
variables.13 Gibbs sampling is particularly beneficial when sampling from high- (JMLR), vol. 15, no. 1, pp. 1593–
dimensional target densities where the conditional distributions are easier to 1623, 2014.
13
G. Casella and E. I. George, “Ex-
sample from than the joint distribution. plaining the Gibbs Sampler,” The
American Statistician, vol. 46, no. 3,
pp. 167–174, 1992.
6.4 Probabilistic Programming

Probabilistic programming is a technique for specifying probabilistic models as 14


An in-depth overview of proba-
computer programs in a way that allows inference to be performed automati- bilistic programming is provided
in G. Barthe, J.-P. Katoen, and A.
cally.14 By specifying the model of a given system as a probabilistic program, we Silva, Foundations of Probabilistic
can apply a variety of MCMC algorithms to sample from the failure distribution. Programming. Cambridge Univer-
sity Press, 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.4. probabilistic programming 135

Furthermore, probabilistic programming tools are often combined with autod-


ifferentiation tools, which allows us to automatically compute the gradient of
the target density for use in gradient-based MCMC algorithms.15 These features 15
A. Griewank and A. Walther,
allow us to sample from the failure distribution of complex systems without the Evaluating Derivatives: Principles
and Techniques of Algorithmic Differ-
need for significant manual overhead. entiation, 2nd ed. SIAM, 2008.

struct ProbabilisticProgramming Algorithm 6.3. Probabilistic


Δ # distance function: Δ(𝐬) programming algorithm for
mcmc_alg # e.g. Turing.NUTS() sampling from the failure distri-
k_max # number of samples bution. The algorithm uses the
d # trajectory depth Turing.jl package to specify
ϵ # smoothing parameter the rollout function as a prob-
end abilistic model. It adds a log
probability term equivalent to the
function sample_failures(alg::ProbabilisticProgramming, sys, ψ) smoothed indicator function in
Δ, mcmc_alg = alg.Δ, alg.mcmc_alg equation (6.5) to specify that we
k_max, d, ϵ = alg.k_max, alg.d, alg.ϵ want to sample failure trajectories.
It then generates samples using
@model function rollout(sys, d; xo=fill(missing, d), the specified MCMC algorithm.
xa=fill(missing, d), Turing.jl supports a variety of
xs=fill(missing, d)) MCMC algorithms, including
p = NominalTrajectoryDistribution(sys, d)
Metropolis-Hastings, HMC,
s ~ initial_state_distribution(p)
NUTS, and Gibbs sampling.
𝐬 = [s, [zeros(length(s)) for i in 1:d]...]
for t in 1:d
D = disturbance_distribution(p, t)
s = 𝐬[t]
xo[t] ~ D.Do(s)
o = sys.sensor(s, xo[t])
xa[t] ~ D.Da(o)
a = sys.agent(o, xa[t])
xs[t] ~ D.Ds(s, a)
𝐬[t+1] = sys.env(s, a, xs[t])
end
Turing.@addlogprob! logpdf(Normal(0.0, ϵ), Δ(𝐬))
end

return Turing.sample(rollout(sys, d), mcmc_alg, k_max)


end

Algorithm 6.3 writes the rollout function as a probabilistic program that can be 16
We use a probabilistic program-
used to sample from the smoothed failure distribution.16 Similar to algorithm 4.5, ming package written for the Julia
language called Turing.jl.
the probabilistic programming model samples an initial state from the initial
state distribution and steps the system forward in time by sampling from the
disturbance distribution at each time step. However, rather than explicitly drawing

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
136 chapter 6. fa ilure distribution

the samples, the model only specifies the distributions from which the samples
are drawn. The probabilistic programming tool handles the sampling and keeps
track of the probability associated with each draw automatically.
To specify that we want to sample failure trajectories, we add a log proba-
bility term for the smoothed indicator function in equation (6.5). Probabilistic
programming tools often perform operations in log space for numerical stability.
Adding this term in log space is equivalent to multiplying the target density by
the smoothed indicator function. Example 6.5 demonstrates how to use algo-
rithm 6.3 to sample from the failure distribution of the inverted pendulum system.
It runs the algorithm twice to produce two chains that capture two distinct failure
modes. In addition to smoothing, running multiple MCMC chains from different
starting points is another method to improve performance of MCMC for failure
distributions with multiple modes.

To use probabilistic programming to sample from the failure distribution of Example 6.5. Sampling from
the failure distribution of the in-
the inverted pendulum system, we can use the following code to set up the verted pendulum system using al-
MCMC algorithm and distance function: gorithm 6.3. The plot shows the re-
mcmc_alg = Turing.NUTS(10, 0.65, max_depth=6) sult of running the algorithm twice
Δ(𝐬) = max(robustness(𝐬, ψ.formula, w=1.0), 0) to produce two MCMC chains. The
initial samples that are not failures
The code sets up the No U-Turn Sampler (NUTS) MCMC algorithm. Since are discarded during the burn-in
NUTS relies on the gradient of the target density, we use smoothed robustness period.

in the distance function so that the gradient exists. The first two parame- 1
ters in the NUTS constructor are the number of adaptation steps and the Chain 1

θ (rad)
target acceptance rate. The plot shows the result of running algorithm 6.3
0
with the specified parameters. We run the algorithm twice to produce two
MCMC chains. Running multiple chains from different starting points is an- Chain 2
−1
other method to improve performance for failure distributions with multiple
0 0.2 0.4 0.6 0.8
modes.
Time (s)

6.5 Summary

• In general, it is difficult to compute the distribution over failures exactly, but


we can compute its unnormalized density.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
6.5. summary 137

• Using an unnormalized density over failures, we can apply a variety of algo-


rithms to draw samples.

• Rejection sampling works by drawing independent samples from a proposal


distribution and accepting them with probability proportional to the ratio of
the target density to the proposal density.

• The performance of rejection sampling depends on the choice of proposal


distribution, and it can be difficult to select a good proposal distribution for
high-dimensional systems.

• Markov chain Monte Carlo (MCMC) algorithms sample from the target dis-
tribution by drawing samples from a Markov chain and scale well to high-
dimensional systems.

• MCMC is only guaranteed to converge to the target distribution in the limit


of infinite samples, but we cannot generate an infinite number of samples in
practice.

• We can use heuristics such as smoothing and gradient-based kernels to improve


the performance of MCMC with a finite sample budget.

• Probabilisitic programming is a tool that allows us to specify probabilistic


models as computer programs and can be used to sample from the failure
distribution of complex systems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7 Failure Probability Estimation

After searching for the potential failure modes of a system, we may also want to
estimate its probability of failure. This chapter presents several techniques for
estimating this quantity from samples. We begin by discussing a direct estimation
approach that uses samples from the nominal trajectory distribution to estimate
the probability of failure. If failures are rare, this approach may be inefficient and
require a large number of samples to produce a good estimate. The remainder of
the chapter discusses more efficient estimation techniques based on importance
sampling. Importance sampling techniques artificially increase the likelihood of
failure trajectories by sampling from a proposal distribution. We discuss several
variations of importance sampling and conclude by presenting a nonparametric
algorithm that estimates the probability of failure from a sequence of samples.

7.1 Direct Estimation

The probability of failure for a given system and specification is defined mathe-
matically as
1
Z Note that the right-hand side of
equation (7.1) is equivalent to the
pfail = Eτ ∼ p(·) [1{τ 6∈ ψ}] = 1{τ 6∈ ψ} p(τ ) dτ (7.1) denominator in equation (6.1). In
other words, the probability of fail-
where 1{·} is the indicator function. The expectation is taken over the nominal ure is the normalizing constant for
the failure distribution.
trajectory distribution for the system.1 Given a set of m trajectories from this
distribution, we can produce an estimate p̂fail of the probability of failure by
treating the problem as a parameter learning problem, where where the parameter
of interest is the parameter of a Bernoulli distribution. We can then apply the
maximum likelihood or Bayesian methods from chapter 2 to calculate p̂fail .
140 chapter 7. failure probability estimation

7.1.1 Maximum Likelihood Estimate


The maximum likelihood estimate of the probability of failure is
m
1 n
p̂fail =
m ∑ 1{τi 6∈ ψ} = m
(7.2)
i =1

where n is the number of samples that resulted in a failure and m is the total
number of samples. Algorithm 7.1 uses direct sampling to implement this estima-
tor. It performs m rollouts and computes the probability of failure according to
equation (7.2).

struct DirectEstimation Algorithm 7.1. The direct estima-


d # depth tion algorithm for estimating the
m # number of samples probability of failure. The algo-
end rithm performs rollouts to a depth
d to generate m samples from the
function estimate(alg::DirectEstimation, sys, ψ) nominal trajectory distribution. It
d, m = alg.d, alg.m then applies equation (7.2) to com-
τs = [rollout(sys, d=d) for i in 1:m] pute p̂fail and returns the result.
return mean(isfailure(ψ, τ) for τ in τs)
end

We can evaluate the accuracy of an estimator using metrics such as bias, consis-
tency, and variance (example 7.1). Equation (7.2) provides an empirical estimate
of the probability of failure by computing the sample mean of a set of samples
drawn from a Bernoulli distribution with parameter pfail . The sample mean is an
unbiased estimator of the true mean of a Bernoulli distribution, so the estimator is
unbiased. We can calculate the variance of this estimator by dividing the variance
of a Bernoulli distribution by the number of samples:

pfail (1 − pfail )
Var[ p̂fail ] = (7.3)
m
The square root of this quantity is known as the standard error of the estimator. A
lower variance means that the sample mean will be closer to the true mean on
average and therefore indicates a more accurate estimator. In the limit of infinite
samples, the variance approaches zero, so the estimator is consistent.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.1. direct estimation 141

Bias, consistency, and variance are three common properties used to evaluate Example 7.1. Common metrics
used to evaluate estimators. The
the quality of an estimator. An estimator that produces p̂fail is unbiased if it plots show predictions of three dif-
predicts the true value in expectation: ferent estimators with shaded re-
gions to represent the variance.
E[ p̂fail ] = pfail

An estimator is consistent if it converges to the true value in the limit of infinite


samples:
lim p̂fail = pfail
m→∞

For example, given a set of samples drawn independently from the same
distribution, the sample mean is an unbiased and consistent estimator of
the distribution’s true mean. The variance of the estimator quantifies the
spread of the estimates around the true value. For the sample mean example,
the variance will decrease as the number of samples increases. The plots
below illustrate these concepts. The shaded regions reflect the variance of
the estimator.

Unbiased: 7 Unbiased: 3 Unbiased: 3


Consistent: 3 Consistent: 7 Consistent: 3

high
variance
p̂fail

low
variance

m m m

In general, we want to use an estimator that is unbiased, consistent, and has


low variance. However, we are sometimes forced to trade off between these
metrics to achieve the best efficiency for complex problems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
142 chapter 7. failure probability estimation

Equation (7.3) provides insight on the accuracy of the estimator in equa-


tion (7.2). However, it is expressed in terms of the true probability of failure
pfail , which is the quantity we want to estimate. Therefore, we cannot apply this
equation to directly assess the accuracy of the output of algorithm 7.1. We can
instead use it reason about qualitative trends. For example, equation (7.3) indi-
cates that we decrease the variance of our estimator by collecting more samples.
Example 7.2 illustrates this trend on the grid world problem.

We demonstrate the effect of equation (7.3) empirically by running 10 trials Example 7.2. The empirical mean
and variance of the direct estima-
of algorithm 7.1 on the grid world problem. We compute the empirical mean tor for the grid world problem
and variance of p̂fail across all 10 trials after each new sample. The plot below computed over 10 trials of algo-
rithm 7.1. The depth d is set to 50
shows the results of this experiment. and the probability of slipping is
set to 0.8. The blue line shows the
×10−2
2 mean of p̂fail for all 10 trials, and
the shaded region represents one
standard deviation above and be-
1.5
low the mean.
p̂fail

0.5

0
0 1,000 2,000 3,000 4,000 5,000
m

As predicted by equation (7.3), the variance decreases as the number of


samples m increases.

In addition to the number of samples, the true probability of failure pfail also
has an impact on the relative accuracy of the estimator. As the true probability
of failure decreases, the number of samples required to achieve a given level of
accuracy increases (see exercise 7.1). For systems in which failure events are rare,
we may require a large number of samples to produce an accurate estimate for
the probability of failure using algorithm 7.1. Section 7.2 introduces importance
sampling, which can be used to improve the efficiency in these scenarios.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.1. direct estimation 143

7.1.2 Bayesian Estimate


Bayesian failure probability estimation may improve accuracy in scenarios with
limited data or rare failure events. For example, suppose we want to estimate the
probability of an aircraft collision from an aviation safety database that contains
flight records from the past week. If there are no recorded midair collisions for
the past week, the maximum likelihood estimate for the probability of a midair
collision would be zero. Believing that there is zero chance of a midair collision is
not a reasonable conclusion unless our prior hypothesis was, for example, that all
flights were perfectly safe.
Bayesian estimation techniques incorporate a prior belief about the safety of the
system and maintain a full distribution over the probability of failure. Since p̂fail is
the parameter of a Bernoulli distribution, the distribution over the probability of
failure is a beta distribution. The posterior distribution after observing n failures
in m samples is
pfail ∼ Beta(α + n, β + m − n) (7.4)
if we start with a prior of Beta(α, β).
Algorithm 7.2 implements this estimator given a prior distribution. It performs
m rollouts and computes the posterior distribution over the probability of failure
according to equation (7.4). The prior distribution should be selected to reflect
our prior beliefs about the probability of failure based on domain knowledge. If
we do not have any reason to believe that one value of pfail is more probable than
another value in the absence of data, we can use a uniform prior of Beta(1, 1).
Figure 7.1 shows an example of how the posterior distribution changes as more
samples are collected. We can convert the distribution over the probability of
failure into a point estimate by computing its mean or mode. The mean of the
distribution Beta(α, β) is
α
(7.5)
α+β
and the mode is
α−1
(7.6)
α+β−2
assuming α and β are greater than 1.
Maintaining a distribution over the probability of failure allows us to explictly
quantify the uncertainty in our estimate. For example, suppose we have a target
level of safety corresponding to a probability of failure less than or equal to δ.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
144 chapter 7. failure probability estimation

struct BayesianEstimation Algorithm 7.2. The Bayesian esti-


prior::Beta # from Distributions.jl mation algorithm for estimating a
d # depth distribution over the probability
m # number of samples of failure. The algorithm performs
end rollouts to a depth d to generate m
samples from the nominal trajec-
function estimate(alg::BayesianEstimation, sys, ψ) tory distribution. Using the prior,
prior, d, m = alg.prior, alg.d, alg.m it then applies equation (7.4) to
τs = [rollout(sys, d=d) for i in 1:m] compute the posterior distribution
n, m = sum(isfailure(ψ, τ) for τ in τs), length(τs) over the probability of failure and
return Beta(prior.α + n, prior.β + m - n) returns the result.
end

prior Figure 7.1. Bayesian estimation ap-


plied to the grid world problem
10 1 sample, 0 failures
with a probability of slipping set
5 samples, 1 failure
to 0.5. We begin with a uniform
10 samples, 1 failure prior of Beta(1, 1) and determine a
p( pfail )

100 samples, 15 failures distribution over the probability of


failure by applying algorithm 7.2
5 with 1, 5, 10, and 100 samples. As
we observe more samples, the dis-
tribution over the probability of
failure becomes more concentrated
around a small range of probabili-
0 ties.
0 0.2 0.4 0.6 0.8 1
pfail

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.2. importance sampling 145

We are interested in the quantity p( pfail < δ), which is the probability that the
true probability of failure is less than or equal to δ. This quantity is given by the
cumulative distribution function of the posterior distribution over the probability
of failure.2 The quantiles of the posterior distribution can be used to compute 2
The cumulative distribution func-
confidence intervals in a similar manner. Example 7.3 demonstrates this process. tion of a Beta distribution is the
regularized incomplete beta func-
tion. Software packages such as
Distributions.jl provide imple-
7.2 Importance Sampling mentations of both the cumulative
distribution function and the quan-
Importance sampling algorithms increase the efficiency of sampling-based estima- tile function for the Beta distribu-
tion.
tion techniques. Instead of sampling from the nominal trajectory distribution
p, they sample from a proposal distribution q that assigns higher likelihood to
areas of greater ‘‘importance.’’3 To estimate the probability of failure using these 3
This proposal distribution has
similar properties to the proposal
samples, we must transform the expectation in equation (7.1) to an expectation
distribution introduced in sec-
over q: tion 6.2 for rejection sampling.
pfail = Eτ ∼ p(·) [1{τ 6∈ ψ}]
Z
= p(τ )1{τ 6∈ ψ}dτ
q(τ )
Z
= p(τ ) 1{τ 6∈ ψ}dτ
q(τ ) (7.7)
p(τ )
Z
= q(τ ) 1{τ 6∈ ψ}dτ
q(τ )
 
p(τ )
= Eτ ∼q(·) 1{ τ 6 ∈ ψ }
q(τ )
For equation (7.7) to be valid, we require that q(τ ) > 0 wherever p(τ )1{τ 6∈
ψ} > 0. This condition is satisfied as long as the proposal distribution assigns a
nonzero likelihood to all failure trajectories that are possible under p.
Given samples from q(·), we can estimate the probability of failure based on
equation (7.7) as
1 m p(τi )
m i∑
p̂fail = 1{τi 6∈ ψ} (7.8)
=1
q(τi )
Algorithm 7.3 implements this estimator. Equation (7.8) is an unbiased estimator
of the true probability of failure. It corresponds to a weighted average of samples
from the proposal distribution:
m
1
p̂fail =
m ∑ wi 1{τi 6∈ ψ} (7.9)
i =1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
146 chapter 7. failure probability estimation

Suppose that we run algorithm 7.2 on the collision avoidance problem with Example 7.3. Quantifying uncer-
tainty in the probability estimate
m = 100 samples and observe no failures. Assuming we begin with a uni- produced by algorithm 7.2. The
form prior, the posterior distribution over the probability over failure is plots show the posterior distribu-
Beta(1, 101). Suppose we are also given a safety requirement for the sys- tion Beta(1, 101). The shaded re-
tem stating that pfail must not exceed 0.01. We can compute p( pfail < 0.01) gion in the first plot represents the
from the cumulative distribution function of the beta distribution using the probability that the true probabil-
ity of failure is less than or equal
following code: to 0.01. The shaded region in the
using Distributions second plot shows the 95 % confi-
posterior = Beta(1, 101) dence bound.
confidence = cdf(posterior, 0.01)

The confidence variable is equal to 0.6376, indicating that we are 63.76 %


confident that the true probability of failure is less than 0.01.
Suppose we instead want to determine a 95 % confidence bound on the
probability of failure. We can compute this bound using the quantile function
of the beta distribution as follows:
bound = quantile(posterior, 0.95)

The bound variable is equal to 0.0292, so we can be 95 % confident that the


true probability of failure is less than 0.0292. The plots below show these
results.
100 100
p( pfail )

p( pfail )

50 50

0.64 0.95
0 0
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
pfail pfail

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.2. importance sampling 147

where the weights are wi = p(τi )/q(τi ). These weights are sometimes referred to
as importance weights. Trajectories that are more likely under the nominal trajectory
distribution have higher importance weights.

struct ImportanceSamplingEstimation Algorithm 7.3. The importance


p # nominal distribution sampling estimation algorithm for
q # proposal distribution estimating the probability of fail-
m # number of samples ure. The algorithm generates m
end samples from the proposal distri-
bution q. It then computes the im-
function estimate(alg::ImportanceSamplingEstimation, sys, ψ) portance weights for the samples
p, q, m = alg.p, alg.q, alg.m and applies equation (7.8) to com-
τs = [rollout(sys, q) for i in 1:m] pute p̂fail .
ps = [pdf(p, τ) for τ in τs]
qs = [pdf(q, τ) for τ in τs]
ws = ps ./ qs
return mean(w * isfailure(ψ, τ) for (w, τ) in zip(ws, τs))
end

7.2.1 Optimal Proposal Distribution


The accuracy and efficiency of importance sampling approaches is highly depen-
dent on the proposal distribution. The variance of the estimator in equation (7.8)
is
/ ψ} − q(τ ) pfail )2
( p ( τ )1{ τ ∈
 
1
Var[ p̂fail ] = Eτ ∼q(·) (7.10)
m q(τ )
In general, we want to select a proposal distribution that makes this variance low,
and the optimal proposal distribution is the one that minimizes this variance.
It is evident from equation (7.10) that we can achieve a variance of zero when

p ( τ )1{ τ ∈ / ψ}
q∗ (τ ) = (7.11)
pfail

This distribution corresponds to the failure distribution p(τ | τ ∈ / ψ). As noted in


chapter 6, computing this distribution is not possible in practice since we often
do not know the full set of failure trajectories and the normalizing constant pfail is
the quantity we are trying to estimate. Our goal is therefore to select a proposal
distribution that is as close as possible to the failure distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
148 chapter 7. failure probability estimation

7.2.2 Proposal Distribution Selection


One way to select a proposal distribution for importance sampling is based on
domain knowledge. For example, if we know that collisions between aircraft tend
to occur more often when they have high vertical rates, we may select a proposal
distribution that assigns higher likelihood to high vertical rates. It is important
to ensure that the proposal distribution has adequate overlap with the failure
distribution. In other words, it should assign high likelihood to likely failure
trajectories. A poorly selected proposal distribution can lead to high variance and
result in poor performance (see example 7.4).
We can also select a proposal distribution based on samples from the failure
distribution. Specifically, we can approximate the failure distribution by fitting a qfit (τ )
distribution to samples obtained using the algorithms in chapter 6. The resulting
distribution will approximate the optimal proposal distribution and may improve
p(τ )
the performance over a hand-designed proposal distribution (see figure 7.2). The
efficacy of this approach, however, is dependent on our ability to produce a good
fit to the failure distribution, which may be difficult in high-dimensional spaces −4 −2 0 2 4
with multiple failure modes. τ

Figure 7.2. Fitting a proposal distri-


7.2.3 Multiple Importance Sampling bution to samples from the failure
distribution. The plot shows a his-
We can also draw samples from multiple proposal distributions and combine them togram of samples from the failure
distribution produced using rejec-
to form a more robust estimate. This approach is known as multiple importance tion sampling. We fit a Gaussian
sampling (MIS). Suppose we draw m samples such that distribution to these samples (red)
using maximum likelihood estima-
tion to use as a proposal distribu-
τi ∼ qi (·) for all i ∈ {1, . . . , m} (7.12)
tion. The nominal distribution p(τ )
is shown in black.
where qi (·) is the proposal distribution used to generate the ith sample τi . We can
still use equation (7.9) to estimate the probability of failure for MIS, but we must
modify the importance weights to account for multiple proposal distributions.
Several weighting schemes will result in an unbiased estimate. Algorithm 7.4
implements multiple importance sampling with two different weighting schemes.4 4
For a detailed discussion, see V.
The first weighting scheme is Elvira, L. Martino, D. Luengo, and
M. F. Bugallo, “Generalized Multi-
p(τi ) ple Importance Sampling,” Statis-
wi = (7.13)
qi (τi ) tical Science, vol. 34, no. 1, pp. 129–
155, 2019.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.2. importance sampling 149

Consider the simple Gaussian problem shown below where failures occurs Example 7.4. Performance com-
parison of two hand-designed pro-
at values less than −2 (red shaded region). The plot on the left shows two posal distributions for the simple
proposal distributions we could use for importance sampling. The first pro- Gaussian problem where failures
occur at values less than −2 (red re-
posal distribution q1 shifts the nominal distribution toward the failure region gion). The first plot shows the nom-
and assigns high likelihood to likely failure trajectories. The second proposal inal distribution and two possible
distribution q2 is shifted toward the failure region but it still does not assign proposal distributions. The second
plot shows the estimation error
high likelihood to likely failures. Therefore, we expect q1 to result in better for direct estimation compared to
estimates than q2 . the estimation error of importance
sampling for the two distributions.
The plot on the right shows the estimation error when performing impor-
tance sampling with each proposal distribution compared to direct estimation.
The shaded region represents the 90 % empirical confidence bounds on the
error. As expected, q1 results in a lower estimation error and a lower vari-
ance than q2 and direct estimation. Performing importance sampling with q2
results in worse performance than direct estimation.

0.06
q2 ( τ )
Absolute Estimation Error

0.04
q1 ( τ )
p(τ )

0.02

0
−4 −2 0 2 4 0 200 400 600 800 1,000
τ Number of Samples

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
150 chapter 7. failure probability estimation

struct MultipleImportanceSamplingEstimation Algorithm 7.4. The multiple impor-


p # nominal distribution tance sampling algorithm for esti-
qs # proposal distributions mating the probability of failure.
weighting # weighting scheme: ws = weighting(p, qs, τs) The algorithm generates a sample
end for each proposal distribution in qs.
It then computes the importance
smis(p, qs, τs) = [pdf(p, τ) / pdf(q, τ) for (q, τ) in zip(qs, τs)] weights using the weighting func-
dmmis(p, qs, τs) = [pdf(p, τ) / mean(pdf(q, τ) for q in qs) for τ in τs] tion provided and applies equa-
tion (7.9) to compute p̂fail . The
function estimate(alg::MultipleImportanceSamplingEstimation, sys, ψ) smis and dmmis functions imple-
p, qs, weighting = alg.p, alg.qs, alg.weighting ments the s-MIS and DM-MIS
τs = [rollout(sys, q) for q in qs] weighting schemes respectively.
ws = weighting(p, qs, τs)
return mean(w * isfailure(ψ, τ) for (w, τ) in zip(ws, τs))
end

where the weight for each sample is computed using only the proposal that was
used to generate it. This weighting scheme, which we refer to as standard MIS (s-
MIS), is most similar to the importance sampling estimator for a single proposal
distribution (equation (7.8)).
Instead of considering each proposal individually, we can also view the sam-
ples as if they were drawn in a deterministic order from a mixture distribution
composed of all proposal distributions. This paradigm leads to the deterministic
mixture weighting scheme (DM-MIS):

p(τi )
wi = 1 m
(7.14)
m ∑ j=1 q j (τi )

The denominator of equation (7.14) corresponds to the probability density of the


mixture distribution evaluated at τi .
Figure 7.3 visualizes the weighting schemes, and example 7.5 compares their
performance on a two-dimensional Gaussian problem. While both schemes are
unbiased, DM-MIS has been shown to have lower variance than s-MIS. However,
DM-MIS requires computing the likelihood of each sample under all proposal
distributions, which may be computationally expensive if the number of proposal
distributions is large.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.3. adaptive importance sampling 151

Proposal Distributions Figure 7.3. Visualization of the


s-MIS DM-MIS
two most common multiple impor-
tance sampling (MIS) weighting
q1 ( τ ) q1 ( τ ) q1 ( τ ) schemes. In this case, we have two
proposal distributions, q1 and q2 ,
qmix (τ ) and we want to determine the im-
portance weight for τ2 , which was
q2 ( τ ) q2 ( τ ) q2 ( τ ) sampled from q2 . In s-MIS, we con-
sider only q2 , while in DM-MIS, we
consider the mixture distribution
τ2 τ2 τ2 of q1 and q2 .

τ τ τ

7.3 Adaptive Importance Sampling

Adaptive importance sampling algorithms automatically tune a proposal or set of


proposals to help alleviate the challenge of designing an effective set of proposals
by hand. These algorithms use samples to iteratively adapt the proposal distri-
butions to move toward the failure distribution. In this section, we present two
common adaptive importance sampling algorithms: the cross entropy method
and population Monte Carlo. The cross entropy method adapts a single proposal
distribution, while population Monte Carlo adapts a set of proposal distributions.

7.3.1 Cross Entropy Method


The cross entropy method iteratively fits a proposal distribution using samples.5 The 5
For a detailed overview, see P.-T.
algorithm requires selecting a form for the proposal distribution that is described De Boer, D. P. Kroese, S. Mannor,
and R. Y. Rubinstein, “A Tutorial
by a set of parameters. A common choice is a multivariate Gaussian distribution, on the Cross-Entropy Method,” An-
which is parameterized by a mean vector and a covariance matrix. The goal of nals of Operations Research, vol. 134,
pp. 19–67, 2005.
the cross entropy method is to find the set of parameters that minimizes the
cross entropy between the proposal distribution and the failure distribution. Cross
entropy is an idea used in information theory that provides a measure of distance
between two probability distributions.
We can determine an approximate solution to this minimization problem
using samples drawn from an initial proposal distribution q. For many common
distribution types, minimizing the cross entropy is equivalent to computing the
weighted maximum likelihood estimate with weights based on the proposal

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
152 chapter 7. failure probability estimation

Suppose we want to estimate the probability of failure for the two- Example 7.5. Performance com-
parison of importance sampling to
dimensional Gaussian system shown below. The nominal distribution is multiple importance sampling for
a multivariate Gaussian distribution with a mean at the center of the figure, a two-dimensional Gaussian prob-
lem. The plot on the left shows a
and the failure region is composed of the two shaded red regions. The plots single proposal distribution that
below show the log density of both distributions. can be used for importance sam-
pling on this problem, while the
Nominal Distribution Failure Distribution plot on the right shows a set of pro-
posal distributions that can be used
for MIS. The plot below shows
the estimation error for IS com-
pared to the estimation error of
MIS with the two different weight-
τ2

τ2
ing schemes.

×10−6
8
IS

Estimation Error
τ1 τ1 6 s-MIS
DM-MIS
4
Most of the probability mass for the failure distribution is concentrated in the
central corners of the two modes, and a good proposal distribution should 2
assign high likelihood in those areas. If we only use one multivariate Gaus- 0
0 500 1,000
sian proposal distribution, we need to select a wide distribution to ensure
Number of Samples
that it covers both failure modes (left). We can improve performance by
selecting multiple proposal distributions that together cover both failure
modes (right). The plot in the caption compares the performance of impor-
tance sampling (IS) to multiple importance sampling with the two different
weighting schemes. The shaded region represents the 90 % empirical confi-
dence bounds on the error. The DM-MIS weighting scheme results in better
performance than the s-MIS weighting scheme.

IS Proposal MIS Proposals


τ2

τ2

τ1 τ1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.3. adaptive importance sampling 153

density and failure density.6 In these cases, the weight for a given sample τ drawn 6
Minimizing the cross entropy is
from the distribution q is equivalent to weighted maximum
likelihood estimation for distribu-
1{ τ ∈/ ψ} p(τ ) tions in the natural exponential
w= (7.15)
q(τ ) family. The natural exponential
family includes many common dis-
where p is the nominal trajectory distribution. tributions such as the Gaussian, ge-
If failures are rare under the initial proposal distribution, it is possible that no ometric, exponential, categorical,
and Beta distributions.
samples will be failures, and the weights computed in equation (7.15) will all be
zero. To address this challenge, the cross entropy algorithm iteratively solves a
relaxed version of the problem that relies on an objective function f . Similar to
the objective functions introduced in section 4.5, the objective function should
assess how close a trajectory is to a failure.7 The objective value must be greater 7
For example, an objective func-
than zero for trajectories that are not failures and less than or equal to zero for tion for the aircraft collision avoid-
ance problem might output the
failure trajectories. For systems with temporal logic specifications, we can use the miss distance between the two air-
robustness as the objective function. craft.
We can rewrite the goal of the cross entropy method in terms of the objective
function as finding the set of parameters that minimizes the cross entropy between
the proposal distribution and p(τ | f (τ ) ≤ 0). For systems with rare failure
events, we gradually make progress toward this goal by solving a series of relaxed
problems where we instead minimize the cross entropy between the proposal
and p(τ | f (τ ) ≤ γ) for a given threshold γ > 0. The weights used in maximum
likelihood estimation for the relaxed problem are

τ2
1{ f ( τ ) ≤ γ } p ( τ )
w= (7.16)
q(τ )
τ1
At each iteration, we select the threshold γ based on our current set of samples to
ensure that a fraction of the weights will be nonzero (figure 7.4). Figure 7.4. Threshold selection for
Algorithm 7.5 implements the cross entropy method. At each iteration, we a two-dimensional Gaussian prob-
lem with two failures modes. The
draw samples from the current proposal distribution and compute their objective red shaded region shows the fail-
values. We then select the threshold γ as the highest objective value from a set ure region. None of the current
of elite samples. The elite samples are the melite samples with the lowest objective samples overlap with the failure re-
gion, so we relax the problem by
values. Since our ultimate goal is to approach the failure distribution, we ensure expanding to the blue region that
that the threshold does not become negative by clipping it at zero. Given this contains a desired fraction of the
samples. The blue samples are the
threshold, we compute the weights using equation (7.16) and fit a new proposal top 10 % of samples with the lowest
distribution to the samples. After repeating this process for a fixed number of objective values.
iterations, algorithm 7.5 performs importance sampling (algorithm 7.3) with the

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
154 chapter 7. failure probability estimation

struct CrossEntropyEstimation Algorithm 7.5. The cross entropy


p # nominal trajectory distribution method for estimating the proba-
q₀ # initial proposal distribution bility of failure. At each iteration,
f # objective function f(τ, ψ) the algorithm draws trajectory sam-
k_max # number of iterations ples from the current proposal dis-
m # number of samples per iteration tribution and computes their objec-
m_elite # number of elite samples tive values. It then sorts the sam-
end ples by objective value and uses
the m_elite samples with the low-
function estimate(alg::CrossEntropyEstimation, sys, ψ) est objective values to compute a
k_max, m, m_elite = alg.k_max, alg.m, alg.m_elite threshold value. Using the thresh-
p, q, f = alg.p, alg.q₀, alg.f old, the algorithm computes the
for k in 1:k_max weights and fits a new proposal
τs = [rollout(sys, q) for i in 1:m] distribution to the samples. The
Y = [f(τ, ψ) for τ in τs] fit function is specific to the type
order = sortperm(Y)
of proposal distribution used and
γ = max(0, Y[order[m_elite]])
should perform weighted maxi-
ps = [pdf(p, τ) for τ in τs]
mum likelihood estimation. Af-
qs = [pdf(q, τ) for τ in τs]
ter k_max iterations, the algorithm
ws = ps ./ qs
calls algorithm 7.3 to produce an
ws[Y .> γ] .= 0
q = fit(typeof(q), τs, ws=ws) estimate of the probability of fail-
end ure using the final proposal distri-
return estimate(ImportanceSamplingEstimation(p, q, m), sys, ψ) bution.
end

Iteration 1 Iteration 2 Iteration 5 Iteration 20 Figure 7.5. Visualization of the


cross entropy method on a prob-
lem with a single failure mode (top
row) and two failure modes (bot-
tom row). The proposal distribu-
τ2

tion takes the form of a multivari-


ate Gaussian distribution, and the
blue samples are the elite samples
for each iteration. For a single fail-
ure mode, the threshold reaches
the failure region and is able to
approximate the failure distribu-
tion. For two failure modes, the
τ2

proposal distribution must become


wide to capture both failure modes,
and the algorithm does not per-
form as well.
τ1 τ1 τ1 τ1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.3. adaptive importance sampling 155

final proposal distribution to produce an estimate of the probability of failure.


Figure 7.6 demonstrates the progression of the algorithm on a simple problem.
Some implementations of the cross entropy method increase efficiency by p(τ | τ ∈
/ ψ)

using the samples produced across all iterations of the algorithm to estimate
the probability of failure. They keep track of the proposal distribution used to
generate the samples at each iteration and view the problem as an instance of
MIS. They produce an estimate using the weighting schemes from section 7.2.3. p(τ )
It is important to note in this case, however, that the proposal for each iteration
depends on the previous proposal. Since the proposals are not independent from −4 −2 0 2
one another, the DM-MIS weighting scheme will no longer be unbiased. τ
The performance of the cross entropy algorithm is sensitive to the form of
Figure 7.6. The cross entropy
the proposal distribution. The algorithm may perform poorly if the proposal method for a one-dimensional
distribution is not expressive enough to adequately capture the shape of the Gaussian problem. The plot shows
the Gaussian proposal distribution
failure distribution. This behavior is particularly apparent for complex systems at each iteration of the algorithm
with high-dimensional, multimodal failure distributions. For example, if we with darker distributions repre-
select a Gaussian proposal distribution for a system with two failure modes, the senting later iterations. The distri-
butions start at the nominal distri-
algorithm will struggle to find a proposal distribution that captures both failure bution and gradually move toward
modes (figure 7.5). One solution is to use a mixture of Gaussians for multimodal the failure distribution (red).
failure distributions (figure 7.7), but this approach requires knowing the number
of failure modes in advance, which is often not possible in practice. In these cases,
an adaptive MIS approach such as population Monte Carlo may perform better.

−4 −2 0 2 4
7.3.2 Population Monte Carlo τ

Population Monte Carlo (PMC) is an adaptive MIS algorithm that maintains a set, Figure 7.7. Example of a Gaussian
or population, of proposal distributions (algorithm 7.6).8 Figure 7.8 shows a single mixture model proposal distribu-
tion for a one-dimensional prob-
step of the algorithm. We begin with an initial population of m proposals that is
lem with two failure modes. The
spread across the space of proposal distributions. For example, we could use a proposal distribution is a mixture
set of multivariate Gaussian distributions with a fixed covariance and different of two Gaussians (blue) that ap-
proximates the multimodal failure
means. It is important to ensure that the initial population is sufficiently diverse distribution (red).
to capture all failure modes. 8
O. Cappé, A. Guillin, J.-M. Marin,
At each iteration, the algorithm draws a single sample from each proposal dis- and C. P. Robert, “Population
Monte Carlo,” Journal of Compu-
tribution in the population. It then computes a weight for each sample in the same tational and Graphical Statistics,
way the weights are computed for the cross entropy method (equation (7.15)). vol. 13, no. 4, pp. 907–929, 2004.
Samples in regions of high likelihood under the failure distribution will receive
higher weights. To adapt the proposal distributions, PMC uses the weights to

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
156 chapter 7. fa ilure probability estimation

struct PopulationMonteCarloEstimation Algorithm 7.6. The population


p # nominal trajectory distribution Monte Carlo algorithm for estimat-
qs # vector of initial proposal distributions ing the probability of failure. At
weighting # weighting scheme: ws = weighting(p, qs, τs) each iteration, the algorithm draws
k_max # number of iterations trajectory samples from the pro-
end posal distributions in the popula-
tion, computes their weights, and
function estimate(alg::PopulationMonteCarloEstimation, sys, ψ) resamples to produce new pro-
p, qs, weighting = alg.p, alg.qs, alg.weighting posal distributions. The proposal
k_max, m = alg.k_max, length(qs) function is specific to the trajec-
for k in 1:k_max tory distribution and creates a pro-
τs = [rollout(sys, q) for q in qs] posal distribution from a sample.
ws = [pdf(p, τ) * isfailure(ψ, τ) / pdf(q, τ) After k_max iterations, the algo-
for (q, τ) in zip(qs, τs)] rithm calls algorithm 7.4 using the
resampler = Categorical(ws ./ sum(ws)) specified weighting scheme to pro-
qs = [proposal(qs[i], τs[i]) for i in rand(resampler, m)]
duce an estimate of the probability
end
of failure using the final set of pro-
mis = MultipleImportanceSamplingEstimation(p, qs, weighting)
posal distributions.
return estimate(mis, sys, ψ)
end

Initial Proposals Sampling Weighting Resampling Figure 7.8. One iteration of the
population Monte Carlo algorithm.
For the weighting step, gray sam-
ples have zero weight, and the size
of the blue samples is proportional
to their weight. The resampling
τ2

step produces new proposal distri-


butions that are centered in high
likelihood regions of the failure dis-
tribution.
τ1 τ1 τ1 τ1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.4. sequential monte carlo 157

perform a resampling step. In this step, we redraw m samples from the population
of samples with probability proportional to their weights. We then reconstruct the
population of proposal distributions using the resulting samples. For example,
if we are using proposals in the form of multivariate Gaussian distributions, we
could create new proposals with the same fixed covariance and means centered
at each sample.
Over time, the population of proposal distributions should cover high likeli-
hood regions of the failure distribution. After a fixed number of iterations, we
perform MIS using the final population to estimate the probability of failure. We
can use either of the weighting schemes from section 7.2.3 to produce the estimate.
Similar to the cross entropy method, we could instead use the samples produced
during all iterations of the algorithm to estimate the probability of failure, noting
that the estimate in this case may no longer be unbiased.
Using multiple proposal distributions allows us to represent complex, multi-
modal failure distributions. However, the performance of PMC is still dependent
on the number of proposal distributions and their ability to cover the space of
possible proposals. If the number of proposal distributions is too small or the ini-
tial proposals are not sufficiently diverse, the algorithm may miss failures modes
and produce an inaccurate estimate. Furthermore, the stochastic nature of the
resampling procedure can lead to a loss of diversity in the proposal distributions
over time. For example, the proposals may collapse to a single failure mode or a
subset of the failure modes.

7.4 Sequential Monte Carlo

The sampling, weighting, and resampling components of algorithm 7.6 form the
basis for a more general framework used in the field of Bayesian inference called
sequential Monte Carlo (SMC).9 In SMC, we start with samples from the nominal 9
SMC is also known as particle fil-
trajectory distribution and gradually adapt these samples to move toward the tering in the context of state es-
timation. M. S. Arulampalam, S.
failure distribution. We then use the path of each sample to estimate the probability Maskell, N. Gordon, and T. Clapp,
of failure. “A Tutorial on Particle Filters for
Online Nonlinear/non-Gaussian
One way to adapt the samples in SMC is to move them through a sequence of Bayesian Tracking,” IEEE Transac-
intermediate distributions that gradually transition from the nominal distribu- tions on Signal Processing, vol. 50,
tion to the failure distribution. Specifically, we create a sequence of distributions no. 2, pp. 174–188, 2002.

g1 , g2 , . . . , gn where g1 is the nominal trajectory distribution and gn is the failure


distribution. Figure 7.9 illustrates two methods for selecting the intermediate

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
158 chapter 7. fa ilure probability estimation

Smoothing Thresholding Figure 7.9. Two methods for select-


ing intermediate distributions for
SMC. The distributions gradually
p(τ ) p(τ ) transition from the nominal trajec-
tory distribution p(τ ) (blue) to the
failure distribution p(τ | τ ∈ / ψ)
(red). The first method uses the
p(τ | τ ∈
/ ψ) p(τ | τ ∈
/ ψ) smoothing technique introduced
in section 6.3.2. The second method
uses the thresholding technique in-
τ τ troduced in section 7.3.1 with ob-
jective function f (τ ) = τ + 2.

distributions. The first method uses the smoothing technique introduced in sec-
tion 6.3.2. We can move from the nominal distribution to the failure distribution by
gradually decreasing the value of the standard deviation e in the smoothed den-
sity.10 The second method uses the same thresholding technique used in the cross 10
A similar technique is the expo-
entropy method. The intermediate distributions take the form p(τ | f (τ ) ≤ γ) nential tilting barrier presented in
A. Sinha, M. O’Kelly, R. Tedrake,
where f (τ ) is the objective function and γ is a threshold. We move from the and J. C. Duchi, “Neural Bridge
nominal distribution to the failure distribution by gradually decreasing the value Sampling for Evaluating Safety-
Critical Autonomous Systems,” Ad-
of γ. vances in Neural Information Pro-
At each iteration of SMC, our goal is to transition samples from the current cessing Systems (NeurIPS), vol. 33,
distribution g` to the next distribution in the sequence g`+1 . We typically only pp. 6402–6416, 2020.

have access to the unnormalized densities of the intermediate distributions in


practice, so MCMC is commonly used to perform this transition. Specifically, we
initialize an MCMC chain at each sample with g`+1 as the target distribution and
run the chain for a fixed number of iterations. We take the final sample in each
chain to form the next set of samples. Figure 7.10 demonstrates this process.
To produce an estimate of the probability of failure from the MCMC samples,
we derive a set of importance weights using the joint probability distribution over
the path of each sample as the proposal distribution.11 The importance weight of 11
A full derivation can be found
the ith trajectory after sampling from the distribution g` is given by in F. Llorente, L. Martino, D.
Delgado, and J. Lopez-Santiago,
(`) “Marginal Likelihood Computa-
(`) (`−1) ḡ`+1 ( τi ) tion for Model Selection and Hy-
wi = wi (`)
(7.17) pothesis Testing: an Extensive Re-
ḡ` (τi ) view,” SIAM Review, vol. 65, no. 1,
pp. 3–58, 2023.
where ḡ` (τ ) is equal to p(τ ) when ` = 1 and the unnormalized density of the `th
intermediate distribution otherwise. We can obtain an estimate for the probability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.4. sequential monte carlo 159

Figure 7.10. Adaptation steps in


SMC. The plot on the left shows
the MCMC paths of a single sam-
ple as it transitions from the nom-
inal distribution (blue) to the fail-
ure distribution through a set of
τ2

smoothed intermediate distribu-


tions. The plot on the right shows
this process on a set of samples ini-
tially drawn from the nominal dis-
tribution.

τ1 τ1

of failure using the mean of the weights at the final iteration:


m
1 ( n −1)
p̂fail =
m ∑ wi (7.18)
i =1

The accuracy of the estimator in equation (7.18) depends on how well the
samples at each iteration represent the corresponding intermediate distribution.
If the samples are not representative, the weights will be small, and the estimator
will be inaccurate. However, we may require a large number of MCMC steps
to transition samples from one distribution to the next, especially for samples
that are unlikely under the next distribution. One technique used to address this
challenge is to resample the trajectories based on their importance weights.12 12
P. Del Moral, A. Doucet, and
This step is similar to the resampling step in PMC and tends to result in better A. Jasra, “Sequential Monte Carlo
Samplers,” Journal of the Royal Sta-
coverage of the intermediate distributions (see example 7.6). After resampling, tistical Society Series B: Statistical
we reset the weights to the mean of the weights before resampling to ensure that Methodology, vol. 68, no. 3, pp. 411–
436, 2006.
the estimator in equation (7.18) remains accurate.
Algorithm 7.7 implements SMC given a nominal trajectory distribution and
a set of intermediate distributions. At each iteration, it perturbs the current set
of samples to represent the next distribution in the sequence using MCMC. Ex-
ample 7.7 provides an implementation of this step for the inverted pendulum
problem. The algorithm then updates the importance weights and performs the
resampling step. Finally, it returns an estimate of the probability of failure based
on equation (7.18).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
160 chapter 7. failure probability estimation

Consider the scenario shown below in which we want to transition samples Example 7.6. The benefit of resam-
pling in SMC. The first set of plots
from the blue distribution to the purple distribution using 10 MCMC steps illustrates the resampling step, and
per sample. The plots below illustrate the weighting and resampling steps. the second set of plots shows the
improvement in the samples at the
The plot in the middle shows the weights of the samples, with darker points next iteration after performing re-
having higher weights. The plot on the right shows the samples after resam- sampling.
pling according to these weights. After resampling, the samples are more
representative of the purple distribution.

Sample Weight Resample


τ2

τ1 τ1 τ1

On the next iteration, we perform MCMC starting at these samples with the
purple distribution as the target to complete the transition. The plots below
show the result of this step with and without resampling. The results without
resampling start the MCMC chains at the blue samples shown above. The
resampling step results in a set of samples that better represents the target
distribution.
Without Resampling With Resampling
τ2

τ1 τ1

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constant s 161

struct SequentialMonteCarloEstimation Algorithm 7.7. The sequential


p # nominal trajectory distribution Monte Carlo algorithm for estimat-
ḡs # intermediate distributions ing the probability of failure. The
perturb # τs′ = perturb(τs, ḡ) algorithm iterates through inter-
m # number of samples mediate distributions and perturbs
end the samples to represent the cur-
rent distribution at each iteration.
function estimate(alg::SequentialMonteCarloEstimation, sys, ψ) It then computes the weights ac-
p, ḡs, perturb, m = alg.p, alg.ḡs, alg.perturb, alg.m cording to the next distribution in
p̄failure(τ) = isfailure(ψ, τ) * pdf(p, τ) the sequence, performs the resam-
τs = [rollout(sys, p) for i in 1:m] pling step, and resets the weights.
ws = [ḡs[1](τ) / p(τ) for τ in τs] The algorithm uses a system spe-
for (ḡ, ḡ′) in zip(ḡs, [ḡs[2:end]...; p̄failure]) cific perturb function to transition
τs′ = perturb(τs, ḡ) samples from one distribution to
ws .*= [ḡ′(τ) / ḡ(τ) for τ in τs′] the next. It returns an estimate of
τs = τs′[rand(Categorical(ws ./ sum(ws)), m)]
the probability of failure based on
ws .= mean(ws)
equation (7.18).
end
return mean(ws)
end

Unlike the algorithms presented in section 7.3, algorithm 7.7 is nonparametric.


We do not need to specify a parametric form for the intermediate distributions.
Instead, we represent them using samples. This flexibility allows us to estimate
the probability of failure for complex, multimodal failure distributions. However,
SMC can run into the same potential problems as PMC. For example, if the MCMC
is not run for long enough on each iteration, the samples may be inaccurate and
miss potential failure modes. The resampling step can also cause a loss of diversity 13
One common resampling
that may result in a collapse of the samples to a single mode. Other weighting scheme that ensures that the
samples remain diverse is called
and resampling schemes may be used to maintain diversity.13 low variance resampling. More
details can be found in Section
4.2.4 of S. Thrun, W. Burgard, and
7.5 Ratio of Normalizing Constants D. Fox, Probabilistic Robotics. MIT
Press, 2006.
Importance sampling is a special case of the more general problem of estimating 14
A detailed survey of techniques
the ratio of the normalizing constants of two distributions.14 By focusing on relating the estimating the ratio
of normalizing constants is pro-
the more general problem, we can derive multiple extensions to importance vided in F. Llorente, L. Martino,
sampling that allow us to use unnormalized proposal densities. Consider two D. Delgado, and J. Lopez-Santiago,
“Marginal Likelihood Computa-
probability distributions g1 and g2 with normalizing constants z1 and z2 such tion for Model Selection and Hy-
R
that g1 (τ ) = ḡ1 (τ )/z1 and g2 (τ ) = ḡ2 (τ )/z2 . We have that z1 = ḡ1 (τ ) dτ and pothesis Testing: an Extensive Re-
view,” SIAM Review, vol. 65, no. 1,
pp. 3–58, 2023.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
162 chapter 7. failure probability estimation

To estimate the probability of failure for the inverted pendulum system using Example 7.7. Application of SMC
to the inverted pendulum problem.
SMC, we implement the following function that uses 10 MCMC steps to The plots show the samples from
transition samples between intermediate distributions: the intermediate smoothed failure
function perturb(samples, ḡ) distributions.
function inverted_pendulum_kernel(τ; Σ=0.05^2 * I)
μs, Σs = [step.x.xo for step in τ], [Σ for step in τ]
return PendulumTrajectoryDistribution(τ[1].s, Σ, μs, Σs)
end
k_max, m_burnin, m_skip, new_samples = 10, 1, 1, []
for sample in samples
alg = MCMCSampling(ḡ, inverted_pendulum_kernel, sample,
k_max, m_burnin, m_skip)
mcmc_samples = sample_failures(alg, inverted_pendulum, ψ)
push!(new_samples, mcmc_samples[end])
end
return new_samples
end

We can use a set of smoothed failure distributions as the intermediate distri-


butions. The plot below shows some of the samples from these intermediate
distributions.
e = 0.5 e = 0.2 e = 0.1 e = 0.01
1
θ (rad)

−1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s) Time (s) Time (s)

Using 1,000 samples per iteration, SMC estimates the probability of failure to
be approximately 0.0001. The direct estimate for probability of failure based
on one million simulations is approximately 0.0005.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 163

R
z2 = ḡ2 (τ ) dτ, and our goal is to estimate the ratio of the normalizing constants
z1 /z2 using samples from g2 .
First, we rewite z1 in terms of an expectation over g2 :
Z
z1 = ḡ1 (τ ) dτ (7.19)
g2 ( τ )
Z
= ḡ1 (τ ) dτ (7.20)
g2 ( τ )
ḡ (τ )
Z
= g2 ( τ ) 1 dτ (7.21)
ḡ2 (τ )/z2
ḡ (τ )
Z
= z 2 g2 ( τ ) 1 dτ (7.22)
ḡ2 (τ )
 
ḡ (τ )
= z2 Eτ ∼ g2 (·) 1 (7.23)
ḡ2 (τ )
Dividing both sides of equation (7.23) by z2 gives us the ratio of the normalizing
constants, which we can approximate using m samples from g2 :

1 m ḡ1 (τi )
 
z1 ḡ (τ )
m i∑
= Eτ ∼ g2 (·) 1 ≈ (7.24)
z2 ḡ2 (τ ) ḡ (τ )
=1 2 i

where τi ∼ g2 (·) and g2 (τ ) > 0 whenever g1 (τ ) > 0. Note that the estimator in 15
We could also use equa-
equation (7.24) only requires evaluating the unnormalized densities ḡ1 (τ ) and tion (7.24) to estimate the
reciprocal of the probability of
ḡ2 (τ ). Since pfail is the normalizing constant of the failure distribution, we can failure using samples from the
use equation (7.24) to estimate the probability of failure by setting ḡ1 (τ ) equal to failure distribution by setting
ḡ2 (τ ) equal to the unnormalized
the unnormalized failure density and ḡ2 (τ ) equal to any normalized proposal failure density and ḡ1 (τ ) equal to
density q(τ ). In fact, these choices of ḡ1 (τ ) and ḡ2 (τ ) cause equation (7.24) to any normalized density whose
reduce to the importance sampling estimator in equation (7.8) (see exercises 7.2 support is contained within the
support of failure distribution.
and 7.3).15 However, selecting ḡ1 (τ ) to
If g1 and g2 have little overlap in terms of probability mass, the estimator in satisfy this condition is often
difficult in practice and can lead to
equation (7.24) may perform poorly. One technique to improve performance estimators with infinite variance.
is called umbrella sampling (also known as ratio importance sampling). Umbrella This technique is called reciprocal
sampling introduces a third density, called an umbrella density, that has signif- importance sampling. In general,
this estimator should not be used
icant overlap with both g1 and g2 . We use this density to estimate the ratio of for failure probability estimation.
normalizing constants by applying equation (7.24) twice:
h i
ḡ1 (τ ) m ḡ1 (τi )
z1 z1 /zu E τ ∼ gu (·) ḡu (τ )
1
m ∑i =1 ḡu (τi )
= = h i ≈ ḡ (τ )
(7.25)
z2 z2 /zu E
ḡ2 (τ ) 1
∑m 2 i
τ ∼ gu (·) ḡu (τ ) m i =1 ḡu (τi )

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
164 chapter 7. failure probability estimation

where ḡu is the unnormalized umbrella density, zu is its normalizing constant,


and the m samples are drawn from gu (·). The optimal umbrella density is

z1
ḡu∗ (τ ) ∝ ḡ1 (τ ) − ḡ2 (τ ) (7.26)
z2
Similar to the optimal proposal for importance sampling, the optimal umbrella
density is expressed in terms of the quantity we are trying to estimate, so we
cannot compute it exactly. In general, we want to select an umbrella density that
is as close as possible to this density.
Another technique to estimate the ratio of normalizing constants when g1 and
g2 have little overlap is called bridge sampling. Similar to umbrella sampling, bridge
sampling introduces a third density called a bridge density. However, instead of
using samples from this density to estimate the ratio of normalizing constants,
bridge sampling uses samples from both g1 and g2 . Assuming we produce m1
samples from g1 and m2 samples from g2 , we again apply equation (7.24) twice
to obtain the bridge sampling estimator:

m2 ḡb (τj )
h i
ḡb (τ ) 1
z1 zb /z2 E τ ∼ g 2 (·) ḡ2 (τ ) m2 ∑ j=1 ḡ2 (τj )
= = h i ≈ (7.27)
z2 zb /z1 ḡb (τ ) 1 m ḡ (τ )
E τ ∼ g1 (·) ḡ1 (τ ) ∑ 1 b i
m1 i =1 ḡ1 (τi )

where τi ∼ g1 (·), τj ∼ g2 (·), and ḡb is the bridge density.


The optimal bridge density is

ḡ1 (τ ) ḡ2 (τ )
ḡb∗ (τ ) ∝ (7.28)
m1 ḡ1 (τ ) + m2 zz12 ḡ2 (τ )

which is again written in terms of the quantity we are trying to estimate. Given
samples from both g1 and g2 , we can use a simple iterative procedure to estimate
the optimal bridge density (algorithm 7.8). At each iteration, we apply equa-
tion (7.27) using the current bridge density to estimate the ratio of normalizing
constants. We then plug this ratio into equation (7.28) to obtain a new bridge
density. We repeat this process for a fixed number of iterations.
While umbrella sampling and bridge sampling both introduce a third density to
improve efficiency, they have different properties. For example, umbrella sampling
only requires samples from one density, while bridge sampling requires samples
from two different densities. Furthermore, the optimal umbrella density and the
optimal bridge density are very different (see figure 7.11). The optimal umbrella

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 165

function bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb) Algorithm 7.8. Algorithm for es-
ḡ₁s, ḡ₂s = ḡ₁.(g₁τs), ḡ₂.(g₂τs) timating the optimal bridge den-
ḡb₁s, ḡb₂s = ḡb.(g₁τs), ḡb.(g₂τs) sity ḡb using samples from ḡ₁
return mean(ḡb₂s ./ ḡ₂s) / mean(ḡb₁s ./ ḡ₁s) and ḡ₂. We iteratively apply equa-
end tion (7.27) to estimate the ratio of
normalizing constants and use this
function optimal_bridge(g₁τs, ḡ₁, g₂τs, ḡ₂, k_max) ratio to update the bridge density
ratio = 1.0 using equation (7.28).
m₁, m₂ = length(g₁τs), length(g₂τs)
ḡb(τ) = (ḡ₁(τ) * ḡ₂(τ)) / (m₁ * ḡ₁(τ) + m₂ * ratio * ḡ₂(τ))
for k in k_max
ratio = bridge_sampling_estimator(g₁τs, ḡ₁, g₂τs, ḡ₂, ḡb)
end
return ḡb
end

Optimal Umbrella Optimal Bridge Figure 7.11. Comparison of the


optimal umbrella density and op-
ḡb∗ (τ ) timal bridge density for estimat-
ing the ratio of normalizing con-
ḡ1 (τ ) ḡ2 (τ ) ḡ1 (τ ) ḡ2 (τ )
stants between two example distri-
butions.

ḡu∗ (τ )

τ τ

density covers regions of high likelihood for both distributions, while the optimal
bridge density bridges the gap between the two distributions.

7.5.1 Self-Normalized Importance Sampling


Self-normalized importance sampling (self-IS) is a special case of umbrella sampling
that can be used to estimate the probability of failure given samples from an unnor-
malized density. Specifically, we set ḡ1 (τ ) to be the unnormalized failure density,
ḡ2 (τ ) to be the nominal trajectory distribution, and ḡu (τ ) to be an unnormalized

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
166 chapter 7. failure probability estimation

proposal density q̄(τ ). These choices lead to the following estimator:


1{τi 6∈ψ} p(τi )
pfail
1
m ∑im=1 q̄(τi )
1
∑im=1 wi 1{τi 6∈ ψ}
m
≈ = 1 m
(7.29)
m p(τi )
m ∑ i =1 wi
1 1
m ∑i =1 q̄(τi )

where wi = p(τi )/q̄(τi ) and τi ∼ q(·). This estimator (algorithm 7.9) is similar to
the estimator in equation (7.9) for normalized proposal distributions with the
extra step of dividing the unnormalized importance weights by their sum.

struct SelfImportanceSamplingEstimation Algorithm 7.9. The self-


p # nominal distribution normalized importance sampling
q̄ # unnormalized proposal density estimation algorithm for esti-
q̄_τs # samples from q̄ mating the probability of failure.
end The algorithm takes as input an
unnormalized proposal density q̄
function estimate(alg::SelfImportanceSamplingEstimation, sys, ψ) along with samples drawn from
p, q̄, q̄_τs = alg.p, alg.q̄, alg.q̄_τs it. It computes the importance
ws = [pdf(p, τ) / q̄(τ) for τ in q̄_τs] weights for the samples and ap-
ws ./= sum(ws) plies equation (7.29) to compute
return mean(w * isfailure(ψ, τ) for (w, τ) in zip(ws, q̄_τs)) p̂fail .
end

The optimal proposal for self-IS is different from the optimal proposal for
p(τ )
importance sampling. Based on equation (7.26), the optimal proposal for self-IS
p̄(τ | τ ∈
/ ψ)
is
q∗ (τ ) ∝ p(τ ) |1{τ 6∈ ψ} − pfail | (7.30) q∗ (τ )

Sampling from this density should result in half of the samples coming from the τ
failure distribution and half coming from the success distribution. The optimal
proposal for IS, on the other hand, is the failure distribution itself. Figure 7.12 Figure 7.12. The optimal pro-
posal for self-normalized impor-
shows the optimal proposal distribution for self-IS on a simple Gaussian system. tance sampling for a simple Gaus-
In practice, we can plug a guess for pfail into equation (7.30) to obtain a proposal sian problem with a failure thresh-
old of −1.
distribution that is close to the optimal proposal. However, drawing samples
from this proposal is often difficult in practice, especially for systems with rare
failure events and multiple failure modes (see example 7.8). Furthermore, the
performance of the algorithm tends to be sensitive to incorrect guesses for pfail
when creating the proposal distribution. Bridge sampling, which we discuss in
the next section, is less sensitive to these choices.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 167

Suppose we want to use self-IS to estimate the probability of failure for the Example 7.8. The challenges asso-
ciated with sampling from the op-
simple Gaussian system. We know that the optimal proposal is of the form timal self-IS proposal density. The
plots show the proposal distribu-
q ∗ ( τ ) ∝ p ( τ ) |1{ τ 6 ∈ ψ } − α | tion for three different failure prob-
abilities along with histograms of
samples drawn from these distri-
where α is our guess for the probability of failure. The plots below show the butions using MCMC. As the prob-
proposal distribution for three different values of α along with histograms of ability of failure decreases, the dis-
samples drawn from these distributions using MCMC. For each distribution, tributions become more difficult to
accurately sample from.
we use 5,100 MCMC steps with a burn-in of 100 steps, keeping every 10th
sample.

α = 10−2 α = 10−7 α = 10−12

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10


τ τ τ

As α decreases, the modes of the proposal distribution grow further apart,


and the proposal distribution becomes more difficult to accurately sample
from. For α = 10−12 , the MCMC misses the failure region entirely.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
168 chapter 7. failure probability estimation

7.5.2 Bridge Sampling


We can use the bridge sampling estimator in equation (7.27) to estimate the
probability of failure, but it will be inefficient for systems with low failure prob-
abilities. In fact, if we follow the same steps we followed to derive the self-IS
estimator, we will arrive at an estimator that can perform no better than the direct
estimator in section 7.1. This property of the bridge sampling estimator is a result
of the optimal bridge density being zero for all samples that are not failures (see
example 7.9).
To improve performance, we can build upon ideas from SMC (section 7.4)
by performing bridge sampling on a sequence of distributions that gradually
transition from the nominal distribution to the failure distribution.16 Specifically, 16
A. Sinha, M. O’Kelly, R. Tedrake,
we create a sequence of n distributions and represent the `th distribution as and J. C. Duchi, “Neural Bridge
Sampling for Evaluating Safety-
g` (τ ) = ḡ` (τ )/z` . We set ḡ1 (τ ) equal to the density of the nominal trajectory Critical Autonomous Systems,” Ad-
distribution and ḡn (τ ) equal to the unnormalized failure density. vances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 33,
We then rewrite the probability of failure as a product of ratios of normalizing pp. 6402–6416, 2020.
constants:      L −1
pfail zn z2 z3 zL z
= = ··· = ∏ `+1 (7.31)
1 z1 z1 z2 z L −1 `=1
z`
These ratios can be estimated using the bridge sampling identity in equation (7.27):
1 m ḡb (τj )
n −1 m2 ∑ j=21 ḡ` (τj )
pfail ≈ ∏ 1 m ḡb (τi )
(7.32)
`=1 m
1
∑i=11 ḡ`+1 (τi )

where we draw m1 samples from g`+1 (·) and m2 samples from g` (·). The in-
termediate distributions in the chain should be chosen such that the ratio of
normalizing constants between two consecutive distributions is easy to estimate.
In other words, consecutive intermediate distributions should have significant
overlap with each other. For example, we can create the intermediate distributions
using either of the two methods in figure 7.9.17 17
If we use the thresholding tech-
Algorithm 7.10 implements bridge sampling estimation using a sequence of in- nique, the algorithm reduces to the
multilevel splitting algorithm pre-
termediate distributions. It begins by drawing samples from the nominal trajectory sented in section 7.6.
distribution. At each iteration, it perturbs the samples to match the next interme-
diate distribution and estimates the optimal bridge density using algorithm 7.8.
It then applies equation (7.32) to compute the ratio of normalizing constants
between the two distributions. Finally, the algorithm applies equation (7.31) to
compute an estimate for the probability of failure.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.5. ratio of normalizing constants 169

To estimate to probability of failure from equation (7.27), we set ḡ1 (τ ) equal Example 7.9. Proof that a bridge
sampling estimator that sets ḡ1 (τ )
to the unnormalized failure density and ḡ2 (τ ) equal to the density of the to the unnormalized failure den-
nominal trajectory distribution: sity and ḡ2 (τ ) to the nominal tra-
jectory distribution can perform no
m ḡb (τj ) better than the direct estimator for
1
pfail m2 ∑ j=21 p(τj ) estimating the probability of fail-
≈ m ḡb (τi )
ure.
1 1
∑i=11
m1 1{τi 6∈ψ} p(τi )

Plugging in to equation (7.28) to get the optimal bridge density gives:



1{ τ 6 ∈ ψ } p ( τ ) 2  p(τ ) if τ ∈

gb∗ (τ ) ∝ ∝
m1 1{ τ 6 ∈ ψ } p ( τ ) + m2 p ( τ ) 0 otherwise

The optimal bridge density is zero for all samples that are not failures. Since
all the samples from the failure density will be failure samples, we have
m1 m1
1 ḡ (τ ) 1 p(τi )
m1 ∑ 1{τi 6∈b ψi} p(τi ) =
m1 ∑ p(τi )
=1
i =1 i =1

We also have that


m2 p(τj )
1 n2
m2 ∑ 1{τj 6∈ ψ} p(τj ) =
m2
j =1

where n2 is the number of samples from the nominal trajectory distribution


that were failures. In this case, the bridge sampling estimator reduces to the
direct estimator in equation (7.2). Since we produced this result using the
optimal bridge density, we can conclude that this estimator will not perform
any better than the direct estimator for estimating the probability of failure.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
170 chapter 7. failure probability estimation

struct BridgeSamplingEstimation Algorithm 7.10. The bridge sam-


p # nominal trajectory distribution pling algorithm for estimating the
ḡs # intermediate distributions probability of failure. The algo-
perturb # samples′ = perturb(samples, ḡ′) rithm takes as input a nominal
m # number of samples from each intermediate distribution trajectory distribution p and a se-
kb # number of iterations for estimating optimal bridge quence of intermediate densities
end ḡs. At each iteration, the algorithm
performs resampling and uses the
function estimate(alg::BridgeSamplingEstimation, sys, ψ) system-specific perturb function
p, ḡs, perturb, m = alg.p, alg.ḡs, alg.perturb, alg.m to produce samples from the next
p̄failure(τ) = isfailure(ψ, τ) * pdf(p, τ) distribution. It then estimates to op-
τs = [rollout(sys, p) for i in 1:m] timal bridge density and applies
p̂fail = 1.0 equation (7.27) from algorithm 7.8
for (ḡ, ḡ′) in zip([p; ḡs...], [ḡs...; p̄failure]) to compute the ratio of normalizing
ws = [ḡ′(τ) / ḡ(τ) for τ in τs] constants. The final iteration draws
τs′ = τs[rand(Categorical(ws ./ sum(ws)), m)]
samples from the failure distribu-
τs′ = perturb(τs′, ḡ′)
tion and produces an estimate of
ḡb = optimal_bridge(τs′, ḡ′, τs, ḡ, kb)
the failure probability.
ratio = bridge_sampling_estimator(τs′, ḡ′, τs, ḡ, ḡb)
p̂fail *= ratio
τs = τs′
end
return p̂fail
end

The perturb step in algorithm 7.10 produces samples from the next distribution
in the sequence and can be performed using the MCMC algorithms presented in
chapter 6. However, these distributions may be difficult to sample from, especially
as we get closer to the failure distribution. In practice, we can greatly increase
efficiency by using the samples from the previous distribution as a starting point
to produce samples from the next distribution. This process is similar to the
process used in SMC (algorithm 7.7), in which we weight and resample the
trajectories from the previous distribution before applying MCMC. The ability
to adapt samples from the previous distribution is another benefit of using a
sequence of intermediate distributions. Figure 7.13 shows the samples from the
intermediate distributions for the continuum world problem.

7.6 Multilevel Splitting

Multilevel splitting algorithms estimate the probability of failure using a series of 18


H. Kahn and T. E. Harris, “Esti-
conditional distributions.18 Similar to the cross entropy method (section 7.3.1), mation of Particle Transmission by
Random Sampling,” National Bu-
multilevel splitting relies on an objective function f with properties such that the reau of Standards Applied Mathemat-
ics Series, vol. 12, pp. 27–30, 1951.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.6. multilevel splitting 171

p(τ ) e = 0.5 e = 0.2 Figure 7.13. Samples from the


smoothed intermediate distribu-
tions when applying bridge sam-
pling to estimate the probability
of failure for the continuum world
problem. These samples result in
a failure probability estimate of
4.93 × 10−4 . The direct estimate
from one million samples is 2.47 ×
10−4 .

e = 0.1 e = 0.01 p(τ | τ ∈


/ ψ)

probability of failure can be written as p( f (τ ) ≤ 0). Given a series of thresholds


γ1 > γ2 > · · · > γn where γ1 = ∞ and γn = 0, we can write the probability of
failure as
n
pfail = p( f (τ ) ≤ γ) = ∏ p ( f ( τ ) ≤ γ` | f (τ ) ≤ γ`−1 ) (7.33)
`=2

As long as the thresholds gradually decrease in a way that ensures that the condi-
tional probabilities remain large, we can efficiently estimate these intermediate
probabilities using direct estimation.
To ensure that the conditional probabilities remain large, it is common to se-
lect the thresholds adaptively.19 Algorithm 7.11 implements adaptive multilevel 19
F. Cérou and A. Guyader, “Adap-
splitting. Adaptive multilevel splitting begins by drawing samples from the nom- tive Multilevel Splitting for Rare
Event Analysis,” Stochastic Analy-
inal trajectory distribution. At each iteration, it computes the objective value for sis and Applications, vol. 25, no. 2,
each sample and selects a threshold γ such that a fixed number of samples have pp. 417–443, 2007.
objective values less than γ. It then uses this threshold and the current samples
to estimate p( f (τ ) ≤ γ` | f (τ ) ≤ γ`−1 ).
The algorithm produces the next set of samples by perturbing the current
samples to represent the distribution p(τ | f (τ ) ≤ γ` ). As with SMC and bridge

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
172 chapter 7. failure probability estimation

struct AdaptiveMultilevelSplitting Algorithm 7.11. The adaptive mul-


p # nominal trajectory distribution tilevel splitting algorithm for esi-
m # number of samples mating the probability of failure.
m_elite # number of elite samples At each iteration, the algorithm
k_max # maximum number of iterations computes the objective value for
f # objective function f(τ, ψ) each sample and selects the next
perturb # τs′ = perturb(τs, p̄γ) threshold. It then estimates the cur-
end rent conditional probability and
perturbs the samples to represent
function estimate(alg::AdaptiveMultilevelSplitting, sys, ψ) the next distribution. The perturb
p, m, m_elite, k_max = alg.p, alg.m, alg.m_elite, alg.k_max function is system specific. The al-
f, perturb = alg.f, alg.perturb gorithm iterates until the threshold
τs = [rollout(sys, p) for i in 1:m] reaches zero. If we reach the maxi-
p̂fail = 1.0 mum number of iterations before
for i in 1:k_max this criterion is met, the algorithm
Y = [f(τ, ψ) for τ in τs]
will force the final threshold to be
order = sortperm(Y)
zero.
γ = i == k_max ? 0 : max(0, Y[order[m_elite]])
p̂fail *= mean(Y .≤ γ)
γ == 0 && break
τs = rand(τs[order[1:m_elite]], m)
p̄γ(τ) = p(τ) * (f(τ, ψ) ≤ γ)
τs = perturb(τs, p̄γ)
end
return p̂fail
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.7. summary 173

Iteration 1 Iteration 4 Iteration 8 Iteration 12 Figure 7.14. Adaptive multilevel


splitting applied to the inverted
1 pendulum system. The shaded
blue region represents the region
θ (rad)

where the objective function is less


0 than the current threshold. The
samples gradually transition from
the nominal trajectory distribution
−1 to the failure distribution as the al-
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 gorithm progresses. The algorithm
Time (s) Time (s) Time (s) Time (s) terminates on iteration 12 when all
elite samples are failures.

sampling, this step can be performed using the MCMC algorithms presented in
chapter 6. To improve the efficiency of the MCMC, we first resample by drawing
m samples uniformly from the elite samples.
To accurately estimate the probability of failure, the last iteration of the algo-
rithm must use a threshold of zero. Algorithm 7.11 iterates until the threshold
reaches zero, at which point all elite samples are failures. If we reach the maxi-
mum number of iterations before this criterion is met, the algorithm will force
the final threshold to be zero. However, if there are no failure samples in the final
iteration, the final conditional probability will be zero, causing the algorithm
to return an estimate of zero. Therefore, it is important to ensure that we allow
enough iterations for the algorithm to reach the final threshold.
Multilevel splitting is considered a nonparametric algorithm in that we estimate
the probability of failure without assuming a specific form for the conditional
distributions. This feature allows multilevel splitting to extend to systems with
complex, multimodal failure distributions. Furthermore, the adaptive nature
of algorithm 7.11 allows us to smoothly transition from the nominal trajectory
distribution to the failure distribution without specifying the intermediate dis-
tributions ahead of time. Figure 7.14 shows an example of adaptive multilevel
splitting applied to the inverted pendulum system, which has two modes in its
failure distribution.

7.7 Summary

• We can view the problem of estimating the probability of failure as a parameter


estimation problem and apply maximum likelihood or Bayesian methods.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
174 chapter 7. failure probability estimation

• For systems with rare failure events, we can use importance sampling to es-
timate the probability of failure by sampling from a proposal distribution
assigns higher likelihood to failure trajectories.

• The performance of importance sampling algorithms depends on the choice


of proposal distribution, and we want to select a proposal distribution that is
as close as possible to the optimal proposal distribution, which is the failure
distribution.

• We use adaptive importance sampling algorithms to automatically tune a


proposal or set of proposals based on samples.

• Sequential Monte Carlo is a nonparametric algorithm that can be applied to es-


timate the probability of failure for complex, multimodal failure distributions.

• By viewing the failure probability estimation problem as a special case of a


more general problem of estimating ratios of normalizing constants, we can
derive estimators that allow us to use complex proposal distributions for which
we only know the unnormalized density.

• Umbrella sampling and bridge sampling increase efficiency by introducing a


third density into the ratio of normalizing constants.

• Multilevel splitting is a nonparametric algorithm that estimates the probability


of failure using a series of conditional distributions.

7.8 Exercises
Exercise 7.1. The coefficient of variation of a random variable is defined as the ratio of
the standard deviation to the mean and is a measure of relative variability. Compute the
coefficient of variation for the estimator in equation (7.2). For a fixed sample size m, how
does the coefficient of variation change as pfail increases? For a fixed pfail , how does the
coefficient of variation change as the sample size m increases?

Solution: The coefficient of variation for the estimator in equation (7.2) is


q
pfail (1− pfail )
standard deviation m
=
mean pfail
s
1 − pfail
=
mpfail

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
7.8. exercises 175

For a fixed m, the coefficient of variation will decrease as the true probability of failure
pfail increases. For a fixed pfail , the coefficient of variation will decrease as the sample size
m increases. The plots here show an example of these relationships.

m = 100 pfail = 0.01


1 10
Coefficient of Variation

Coefficient of Variation
0.8 8

0.6 6

0.4 4

0.2 2

0 0
0 0.2 0.4 0.6 0.8 1 0 100 200
pfail m

Exercise 7.2. Show that equation (7.24) reduces to equation (7.2) when q̄1 (τ ) = 1{τ 6∈
ψ} p(τ ) and q̄2 (τ ) = p(τ ).

Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is the
probability of failure (z1 = pfail ). Since q̄2 is the normalized nominal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives

1{ τ 6 ∈ ψ } p ( τ )
 
pfail
= Eτ ∼ p(·)
1 p(τ )
pfail = Eτ ∼ p(·) [1{τ 6∈ ψ}]

Given samples from the nominal distribution τi ∼ p(·), we can approximate the above
equation as
1 m
m i∑
p̂fail = 1{τi 6∈ ψ}
=1

Exercise 7.3. Show that equation (7.24) reduces to equation (7.8) when q̄1 (τ ) = 1{τ 6∈
ψ} p(τ ) and q̄2 (τ ) = q(τ ).

Solution: Since q̄1 is the unnormalized failure distribution, its normalizing constant is
the probability of failure (z1 = pfail ). Since q̄2 is a normalized proposal distribution, its
normalizing constant is z2 = 1. Plugging these values into equation (7.24) gives

1{ τ 6 ∈ ψ } p ( τ )
 
pfail
= Eτ ∼ p(·)
1 q(τ )
 
p(τ )
pfail = Eτ ∼ p(·) 1{ τ 6 ∈ ψ }
q(τ )

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
176 chapter 7. failure probability estimation

Given samples from the proposal distribution τi ∼ q(·), we can approximate the above
equation as
1 m p(τi )
m i∑
p̂fail = 1{τi 6∈ ψ}
=1
q(τi )

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8 Reachability for Linear Systems

In chapters 4 to 7, we covered a variety of sampling-based validation algorithms.


We now transition to formal methods, which can provide mathematical guaran-
tees that a system satisfies a given specification. In contrast with sampling-based
algorithms, which evaluate properties based on a finite sampling of trajectories,
formal methods consider the full set of possible trajectories. We first focus on the
task of reachability. In this chapter, we discuss algorithms for forward reachability
of linear systems. Forward reachability algorithms start with a set of initial states
and compute the set of states that the system reaches as it progresses forward
in time. This chapter begins by defining linear systems and the corresponding
reachability problem. We then discuss set propagation techniques for computing
reachable sets. Set propagation techniques can be computationally expensive for
high-dimensional systems with long time horizons, so we also discuss overap-
proximation techniques that allow us to reduce the computational complexity.
Finally, we outline an optimization-based approach to reachability analysis.

8.1 Forward Reachability

Forward reachability algorithms compute the set of states a system could reach
over a given time horizon. To perform this analysis, we need to make some
assumptions about the initial state and disturbances for the system. In the previous
chapters, we sampled initial states and disturbances from probability distributions, 1
One way to convert a probability
often with support over the entire real line. However, to perform reachability distribution to a bounded set is to
use the support of the distribution.
computations, we need to restrict the initial states and disturbances to bounded If the support of the distribution
sets.1 We assume that the initial state comes from a bounded set S and that the spans the entire real line, we can
select a region that contains most
disturbances at each time step come from a bounded set X . The disturbance set of the probability mass.
178 chapter 8. reachability for linear systems

X is defined as follows:
  
 xa
 

X =  xo  x a ∈ X a , xo ∈ X o , xs ∈ X s (8.1)
 

 x 

s

where X a , Xs , and Xo are the disturbance sets for the agent, environment, and
sensor, respectively. 0.5
Given an initial state s and a disturbance trajectory x1:d = (x1 , . . . , xd ) with

v (m/s)
depth d, we can compute the state of the system at time step d by performing a 0
rollout (algorithm 4.6) and taking the final state. We denote this operation as
sd = Reach(s, x1:d ). By performing this operation on various initial states and −0.5
disturbances sampled from S and X , we find a set of points in the state space
that the system could reach at time step d. Figure 8.1 demonstrates this process −0.4 −0.2 0 0.2 0.4
p (m)
on the mass-spring-damper system.
We define the reachable set at depth d as the set of all states that the system Figure 8.1. Samples from R5 for
could reach at time step d given all possible initial states and disturbances. We the mass-spring-damper system
with initial position between −0.2
write this set as
and 0.2 and initial velocity set to
zero. The disturbance sets for the
Rd = {sd | sd = Reach(s, x1:d ), s ∈ S , xt ∈ Xt , t ∈ 1 : d} (8.2) observation noise are bounded be-
tween −1 and 1. The gray points
where Xt represents the set of possible disturbances at time step t. We are often represent the initial states, the gray
lines show the trajectories, and the
interested in the full set of states that the system might reach in a given time blue points represent the states af-
horizon rather than at a specific depth d. We denote this set as R1:h and represent ter 5 time steps.
it as the union of the reachable sets at each depth up to the time horizon:

h
[
R1:h = Rd (8.3)
d =1

Figure 8.2 shows the reachable sets in R1:4 for the mass-spring-damper system.
Computing reachable sets allows us to understand the behavior of a system
over time. For example, we can use reachable sets to determine if a system remains
within a safe region of the state space.2 We call the set of states that make up 2
We could also determine if the
the unsafe region of the state space the avoid set and use this set to define a system reaches a goal region in the
state space. In this case, we would
specification for the system (algorithm 8.1). If the reachable set intersects with want to check if the reachable set is
the avoid set, the system violates the specification. contained within the goal region.
The reachability algorithms we discuss in this chapter apply to linear systems.
Linear systems are a class of systems for which the sensor, agent, and environment

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation techniques 179

R1 = S R2 R3 R4 Figure 8.2. The reachable sets that


make up R1:4 for the mass-spring-
damper system. The blue points
0.5 are samples from the reachable set
v (m/s)

generated using the Reach opera-


0 tor. The shaded blue regions show
the true reachable sets, and the
−0.5 red regions make up the avoid set.
Since the reachable sets do not in-
−0.3 0 0.3 −0.3 0 0.3 −0.3 0 0.3 −0.3 0 0.3
tersect with the avoid set, the sys-
p (m) p (m) p (m) p (m) tem satisfies the specification.

struct AvoidSetSpecification <: Specification Algorithm 8.1. A specification that


set # avoid set checks if a trajectory avoids a given
end set. The set can be any type that
evaluate(ψ::AvoidSetSpecification, τ) = all(step.s ∉ ψ.set for step in τ) supports the ∉ operator. A com-
mon package for defining sets in
Julia is LazySets.jl. M. Forets and
C. Schilling, “LazySets.jl: Scalable
models are linear functions of the continuous state s, action a, observation o, and Symbolic-Numeric Set Computa-
tions,” Proceedings of the JuliaCon
disturbance x. We define these models mathematically as follows: Conferences, vol. 1, no. 1, pp. 1–11,
2021.
O(s, xo ) = Os s + xo (8.4)
π (o, xa ) = Πo o + xa (8.5)
T (s, a, xs ) = Ts s + Ta a + xs (8.6)

where Os , Πo , Ts , and Ta are matrices of the appropriate dimensions.3 Example 8.1 3


We could also multiply the dis-
outlines a common linear system that we will refer to throughout this chapter. turbances by matrices, but we omit
this step for simplicity.
A naïve approach to computing reachable sets would involve applying the
Reach operator to all possible initial states and disturbances. However, this ap-
proach is not feasible for systems with continuous states and disturbances since
there are an infinite number of possibilities. Instead, we rely on other mathemati-
cal analysis techniques to reason about the reachable set. The remainder of the
chapter discusses set propagation and optimization techniques that can be used
to compute reachable sets for linear systems.

8.2 Set Propagation Techniques

Set propagation techniques perform reachability by converting the operations in


equations (8.4) to (8.6) to set operations. To introduce these techniques, we will

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
180 chapter 8. reachability for linear systems

A common example of a linear system is a mass-spring-damper system (see Example 8.1. A common exam-
ple of a linear system. The mass-
diagram in caption), which can be used to model mechanical systems such as spring-damper system consists of
a car suspension or a bridge. The state of the system is the position (relative a mass m attached to a wall by a
spring with spring constant k and a
to the resting point) p and velocity v of the mass (s = [ p, v]), the action is damper with damping coefficient c.
the force β applied to the mass, and the observation is a noisy measurement The system is controlled by a force
of state. The equations of motion for a mass-spring-damper system are β applied to the mass. The plots
show simulated trajectories of the
system for different levels of obser-
p0 = p + (v)∆t vation noise. With enough noise,
the system becomes unstable.
 
0 k c 1
v = v + − p − v + β ∆t
m m m k
β
where m is the mass, k is the spring constant, c is the damping coefficient, m
and ∆t is the discrete time step. Rewriting the dynamics in the form of
c
equation (8.6), we have
" #" # " #
1 ∆t p 0
T (s, a, xs ) = + 1 β + xs = Ts s + T a a + xs
− mk ∆t 1 − mc ∆t v m ∆t

We control the mass-spring-damper using a proportional controller such that


Πo in equation (8.5) is the gain matrix. Similarly, we model the sensor as an
additive noise sensor such that Os in equation (8.4) is the identity matrix
and xo is the additive noise distributed uniformly within specified bounds.
Typically, trajectories for this system with oscillate back and forth before
coming to rest. In general, we want to ensure that the system remains stable,
meaning that the position does not exceed some magnitude. The plots below
show simulated trajectories of the system for different levels of observation
noise. With enough noise, the system becomes unstable.

−0.1 ≤ xo ≤ 0.1 −1 ≤ xo ≤ 1 −10 ≤ xo ≤ 10

0.4

0.2
p (m)

−0.2
−0.4
0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation technique s 181

Linear Transformation Figure 8.3. Visualization of linear


set operations. Applying a linear
transformation has the effect of ro-
" # tating, stretching, and translating
1.0 2.0 the set. The Minkowski sum of two
× = sets is the set of all points obtained
0.5 −1.0 by adding each point in the first set
to each point in the second set.

Minkowski Sum

⊕ =

first focus on set propagation for the one step reachability problem. We assume
we are given a set of initial states S and a set of disturbances X . Our goal is to
compute the set of states S 0 that the system could reach at the next time step.
Given a single initial state s and disturbance x, we can compute the next state s0
by applying equations (8.4) to (8.6) sequentially:

s0 = T (s, π (O(s, xo ), xa ), xs ) (8.7)


= Ts s + T a ( Π o (Os s + xo ) + x a ) + xs (8.8)
= (Ts + T a Π o Os )s + T a Π o xo + T a x a + xs (8.9)

We can compute the reachable set at the next time step by applying equation (8.9)
to S and X . To perform this computation, we must define the operations in
equation (8.9) as set operations. In particular, we must be able to apply a linear
transformation, or matrix multiplication, to a set and add two sets together.
The multiplication of a set P by a matrix A is defined as

AP = {Ap | p ∈ P } (8.10)

where the result is the set of all points obtained by multiplying each point in P
by A. The addition of two sets P and Q is defined as

P ⊕ Q = {p + q | p ∈ P , q ∈ Q} (8.11)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
182 chapter 8. reachability for linear systems

where the result is the set of all points obtained by adding each point in P to
each point in Q. This operation is referred to as the Minkowski sum of two sets
and is often denoted using the ⊕ symbol.4 Figure 8.3 shows these operations in 4
The Minkowski sum is named af-
two-dimensional space. As we will discuss in the next section, we can efficiently ter Polish mathematician Hermann
Minkowski (1864–1909).
compute linear transformations and Minkowski sums for many common set types
such as hyperrectangles and polytopes.5 5
The LazySets.jl package in Julia
With these definitions in place, we can rewrite equation (8.9) using set opera- provides implementations of these
operations for many common sets.
tions as
S 0 = (Ts + Ta Πo Os )S ⊕ Ta Πo Xo ⊕ Ta X a ⊕ Xs (8.12)
where S 0 is the one step reachable set. It is important that we simplify the system
dynamics into the form of equation (8.9) before applying set operations. If we
apply the equations without simplification, we may encounter a phenomenon
called the dependency effect, which occurs when a variable appears more than once
in a formula. Set operations fail to model this dependency, leading to conservative
reachable sets (see example 8.2). Algorithm 8.2 implements equation (8.12).

function get_matrices(sys) Algorithm 8.2. Algorithm for com-


return Ts(sys.env), Ta(sys.env), Πo(sys.agent), Os(sys.sensor) puting the one step reachable set
end for a linear system with initial
states from S and disturbances
function linear_set_propagation(sys, 𝒮, 𝒳) from X . We use the LazySets.jl
Ts, Ta, Πo, Os = get_matrices(sys) package to perform the set opera-
return (Ts + Ta * Πo * Os) * 𝒮 ⊕ Ta * Πo * 𝒳.xo ⊕ Ta * 𝒳.xa ⊕ 𝒳.xs tions in equation (8.12).
end

To compute reachable sets over a given time horizon using set propagation
techniques, we rely on the fact that the reachable set at time step d is a function of
the reachable set at time step d − 1. Specifically, we can compute the reachable set
at time step d by applying equation (8.12) to the reachable set at time step d − 1:

Rd = (Ts + Ta Πo Os )Rd−1 ⊕ Ta Πo Xo ⊕ Ta X a ⊕ Xs (8.13)

Algorithm 8.3 implements this recursive algorithm for computing the reachable
set at each time step. The algorithm terminates when it reaches the desired time
horizon h and returns R1:h .
In addition to gaining insight into the behavior of a system, we can use reach-
able sets to verify that a system satisfies a given specification. For a given spec-
ification, we want to ensure that the reachable set does not intersect with its

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation techniques 183

Consider a simple system with the following component models: Example 8.2. Example of the de-
pendency effect on a simple sys-
tem.
O(s, xo ) = s
π (o, xa ) = −Io
T (s, a, xs ) = s + a

where the state, action, and observation are two-dimensional and I is the
identity matrix. Suppose we want to compute the one-step reachable set S 0
when the initial set is a square centered at the origin with side length 1. If we
apply the sensor, agent, and environment models on the initial set without
simplification, we get O = S , A = −IO = −IS , and S 0 = S ⊕ A = S ⊕ −IS .
The resulting set S ⊕ −IS is a square with side length 2 centered at the origin.
However, if we first simplify before switching to set operations, we get that
s0 = s − s = 0. Thus, the true reachable set contains only the origin. The
plots below show this result.

S S 0 (no simplification) True S 0


2 2 2

1 1 1

−2 −1 1 2 −2 −1 1 2 −2 −1 1 2
−1 −1 −1

−2 −2 −2

This mismatch is due to an effect called the dependency effect, which leads
to conservative reachable sets. Because applying the set operations in order
does not account for the fact that the action depends on the state, it considers
worst-case behavior. For this reason, it is important to simplify the system
models into the form of equation (8.9) before applying set operations to avoid
unnecessary conservativeness. While this simplification is always possible
for linear systems, it is not always possible for the nonlinear systems we
discuss in the next chapter.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
184 chapter 8. reachability for linear systems

abstract type ReachabilityAlgorithm end Algorithm 8.3. Linear forward


reachability using set propagation.
struct SetPropagation <: ReachabilityAlgorithm The 𝒮₁ and disturbance_set func-
h # time horizon tions are system-specific functions
end that return the initial state set and
disturbance set, respectively. We
function reachable(alg::SetPropagation, sys) assume the disturbance set is the
h = alg.h same at each time step. At each iter-
𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) ation, the algorithm computes the
ℛ = 𝒮 reachable set at the next time step
for t in 1:h by calling algorithm 8.2. It then
𝒮 = linear_set_propagation(sys, 𝒮, 𝒳) adds this set to the union of reach-
ℛ = ℛ ∪ 𝒮 able sets. The algorithm terminates
end when it reaches the desired time
return ℛ horizon h and returns R1:h .
end

¬(ψ::AvoidSetSpecification) = ψ.set Algorithm 8.4. Checking whether


function satisfies(alg::SetPropagation, sys, ψ) a system satisfies a given specifi-
ℛ = reachable(alg, sys) cation using set propagation. The
return !isempty(ℛ ∩ ¬ψ) algorithm computes the reachable
end set R1:h using algorithm 8.3 and
checks whether its intersection
with the avoid set ¬ψ is empty.

Suppose we want to compute R1:20 for the mass-spring-damper system with Example 8.3. Computing the
reachable sets for the mass-spring-
initial position between −0.2 and 0.2 and initial velocity set to zero. We damper system over a time horizon
assume the observation noise is bounded between −0.2 and 0.2. To perform of 20 steps. The reachable sets are
shown below, switching from light
reachability, we must implement the following functions for the system: blue to dark blue over time.
𝒮₁(env::MassSpringDamper) = Hyperrectangle(low=[-0.2,0], high=[0.2,0])
function disturbance_set(sys)
Do = sys.sensor.Do 0.4
low = [support(d).lb for d in Do.v]
high = [support(d).ub for d in Do.v] 0.2
v (m/s)

return Disturbance(ZeroSet(1), ZeroSet(2), Hyperrectangle(;low,high))


0
end
−0.2
The disturbance_set function uses the support of the disturbance distri-
−0.4
bution from the sensor to define the observation disturbance set. We use
−0.4 −0.2 0 0.2 0.4
ZeroSet from LazySets.jl for the agent and environment disturbances since
p (m)
they are deterministic. We can then compute the reachable set using algo-
rithm 8.3 and visualize the reachable sets in R1:20 (see figure on the right).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.2. set propagation techniques 185

−0.1 ≤ xo ≤ 0.1 −1.0 ≤ xo ≤ 1.0 −2.5 ≤ xo ≤ 2.5 Figure 8.4. Reachable sets (bottom
row) for the mass-spring-damper
system with varying levels of ob-
0.5 servation noise compared to sam-
ples from a finite number of tra-
v (m/s)

jectory rollouts (top row). As the


0 noise bounds increase, the reach-
able sets move closer to the avoid
set. For the largest noise bound,
−0.5
the reachable set intersects with
the avoid set, indicating that the
system violates the specification.
However, the finite number of tra-
0.5 jectory samples do not capture this
behavior. Formal methods such as
v (m/s)

0 reachability are able to identify this


violation by considering the entire
reachable set.
−0.5

−0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4
p (m) p (m) p (m)

complement such that


R1:h ∩ ¬ψ = ∅ (8.14)
For a specification that is defined as an avoid set, we can check if the system
satisfies the specification by verifying that the reachable set does not intersect
with the avoid set (algorithm 8.4). Figure 8.4 shows the reachable sets for the
mass-spring-damper system with varying levels observation noise compared to a R d −1
finite sampling of reachable points. When the noise becomes large enough, the
Rd
reachable sets intersect with the avoid set, indicating that the system violates the
specification.
In general, the safety guarantee derived from equation (8.14) only holds up to
Figure 8.5. Example of an invariant
the horizon h. In other words, there is no guarantee that the system will not enter
set Rd such that Rd ⊆ Rd−1 .
the avoid set after the time horizon. However, if we observe certain convergence
properties of the reachable set, we can extend the safety guarantee to infinite time. 6
We can use LazySets.jl to check
Specifically, if at any point in algorithm 8.3 we find that Rd ⊆ Rd−1 (figure 8.5), this property for convex sets. It
is also the case that if Rd ⊆
we can conclude that Rd is an invariant set, meaning that the system will stay R1:d−1 , then R1:d is an invariant set.
within this set indefinitely. 6 However, this property is generally
more difficult to check since it re-
quires checking whether a set is a
subset of a nonconvex set.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
186 chapter 8. re achability for linear systems

Figure 8.6. An example of a convex


and nonconvex set. For the noncon-
vex set, it is possible to draw a line
segment connecting two points in
the set that is not entirely contained
within the set.

a convex set a nonconvex set

8.3 Set Representations

To ensure that algorithms 8.2 to 8.4 are tractable, we must select set representations
that are computationally efficient. Desirable properties include:

• Finite representations: We should be able to specify the points that are contained
in the set without needing to enumerate all of them.

• Efficient set operations: We should be able to perform set operations such as


linear transformations, Minkowski sums, and intersection efficiently.

• Closure under set operations: A set representation is closed under a particular set
operation if applying the operation results in a set of the same type.

In this chapter, we will focus on convex set representations, which tend to have
these properties.7 A convex set is a set for which a line drawn between any two 7
Some nonconvex sets can also be
points in the set is contained entirely within the set. Mathematically, a set P is efficiently represented and manip-
ulated. A detailed overview is pro-
convex if we have vided in M. Althoff, G. Frehse, and
αp + (1 − α)q ∈ P (8.15) A. Girard, “Set Propagation Tech-
niques for Reachability Analysis,”
for all p, q ∈ P and α ∈ [0, 1]. Figure 8.6 illustrates this property. The rest of this Annual Review of Control, Robotics,
and Autonomous Systems, vol. 4,
section discusses a common convex set representation called polytopes.
pp. 369–395, 2021.

8.3.1 Polytopes
A polytope is defined as the bounded intersection of a set of linear inequalities.8 A 8
We can also define convex sets
linear inequality has the form a> x ≤ b where a is a vector of coefficients, x is a such as ellipsoids using nonlinear
inequalities. O. Maler, “Computing
vector of variables, and b is a scalar. We refer to the set of points that satisfy a given Reachable Sets: An Introduction,”
linear inequality as a half space. A polyhedron is the intersection of a finite number French National Center of Scientific
Research, pp. 1–8, 2008.
of half spaces. If the polyhedron is bounded, we call it a polytope. Figure 8.7
illustrates these concepts in two dimensions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.3. set representations 187

Half Space Polyhedron Polytope Figure 8.7. Example of a half space,


polyhedron, and polytope in two
dimensions. A half space is defined
a1
>
a1 a1 a1
x by a single linear inequality, a poly-
=
b1 a2 a2 hedron is the intersection of multi-
ple half spaces, and a polytope is a
Ax ≤ b bounded polyhedron.
a1>
   
b
a1> x ≤ b1 > x≤ 1
a2 b2

a3

We can represent polytopes in different ways. An H-polytope is a polytope


represented as a set of half spaces. It is written in the form Ax ≤ b where A
and b are formed by stacking the linear inequalities from the half spaces. A V -
polytope is a polytope represented as the convex hull of a set of vertices V , written
as conv(V ). The convex hull of a set of points V is the set of all possible convex
combinations of the points. A convex combination of a set of points {v1 , . . . vn } is
a linear combination of the form
Figure 8.8. The convex hull of a set
λ 1 v1 + . . . + λ n v n (8.16) of points.

such that ∑in=1 λi = 1 and λi ≥ 0 for all i. Intuitively, the convex hull of a set of
points is the smallest convex set that contains all the points (figure 8.8).
It is always possible to convert between the two polytope representations; how-
ever, the calculation is nontrivial.9 Each representation has different advantages. 9
A detailed overview is provided
For example, H-polytopes are more efficient for checking whether a point belongs in G. M. Ziegler, Lectures on Poly-
topes. Springer Science & Business
to the set because we can simply check if it satisfies all the linear inequalities. In Media, 2012, vol. 152. In Julia,
contrast, V -polytopes are more efficient for set operations such as linear trans- LazySets.jl provides functional-
ity to convert between the two rep-
formations. To compute a linear transformation of a polytope represented as a resentations.
V -polytope, we can apply the transformation to each vertex to obtain the vertices
of the transformed polytope.
The Minkowski sum of two V -polytopes is

P1 ⊕ P2 = conv({v1 + v2 | v1 ∈ V1 , v2 ∈ V2 }) (8.17)

where V1 and V2 are the vertices of P1 and P2 , respectively. In other words, we


can obtain all candidates for the vertices of the Minkowski sum by taking the sum

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
188 chapter 8. reachability for linear systems

Figure 8.9. Computing the Min-


kowski sum of two V -polytopes.
P1 ⊕ P2
P2

P1

of all pairs of vertices from the two polytopes. To determine which candidates are
actual vertices, we must determine which candidate vertices are on the boundary
of the convex hull. Figure 8.9 illustrates this process.
We can use these results to reason about the complexity of algorithm 8.3 if
we were to represent our sets as polytopes. We apply equation (8.12) at each
iteration, which involves four linear transformations and three Minkowski sums.
The number of candidate vertices resulting from computing the one step reachable
set using equation (8.12) is |S1 ||Xo ||X a ||Xs | where |P | represents the number of
vertices in polytope P . The number of candidate vertices for the reachable set
at depth d is then |S1 |(|Xo ||X a ||Xs |)d . We can prune the candidate vertices that
are not actual vertices by computing the convex hull of the candidate vertices,
but this operation can be expensive. 10 Therefore, the exponential growth in the 10
The most efficient algorithms for
number of candidate vertices creates tractability challenges for high-dimensional computing the vertices of the con-
vex hull of a set of points have
systems with long time horizons.11 a complexity of O(mv) where m
is the number of candidate ver-
tices and v is the number of ac-
8.3.2 Zonotopes tual vertices. In general, the num-
ber of actual vertices grows super-
A zonotope is a special type of polytope that avoids the exponential growth in linearly. For more details, see R. Sei-
candidate vertices for Minkowski sums. It is defined as the Minkowski sum of a del, “Convex Hull Computations,”
in Handbook of Discrete and Com-
set of line segments centered at a point c: putational Geometry, Chapman and
Hall, 2017, pp. 687–703.
m
Z = {c + ∑ αi gi | αi ∈ [−1, 1]}
11
Other polytope representations
(8.18)
such as the Z -representation and
i =1
M-representation perform Min-
kowski sums more efficiently. More
where g1:m are referred to as the generators of the zonotope.12 We represent zono-
details are provided in S. Sigl
topes by a center point and list of generators: and M. Althoff, “M-Representation
of Polytopes,” ArXiv:2303.05173,
Z = (c, hg1:m i) (8.19) 2023.
12
Zonotopes can also be viewed as
linear transformations of the unit
hypercube.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.3. set representations 189

Figure 8.10. Iterative construction


of a zonotope centered at a point
c by taking the Minkowski sum of
its generators. The generators are
c c c c shown below.

Figure 8.10 shows the construction of a zonotope from its generators.


To apply a linear transformation to a zonotope, we apply the transformation to
the center and each generator:

AZ = (Ac, hAg1:m i) (8.20)

To compute the Minkowski sum of two zonotopes, we sum the centers and con-
catenate the generators:

Z ⊕ Z 0 = (c + c0 , hg1:m , g1:m
0
0 i) (8.21)

Note that the number of generators in the resulting zonotope grows linearly
with the number of generators in each zonotope. Therefore, if we represent our
sets as zonotopes, the number of generators for the reachable set at depth d is
|S1 | + d|Xo ||X a ||Xs |. This linear growth represents a significant improvement
over the exponential growth in candidate vertices for generic polytopes.

8.3.3 Hyperrectangles
A hyperrectangle is a generalization of a rectangle to higher dimensions (fig-
ure 8.12). It is a special type of zonotope in which the generators are aligned with Polytopes
the axes. We may also work with linear transformations of hyperrectangles, which
Zonotopes
can always be transformed back to an axis-aligned representation. All hyperrect-
angles are zonotopes, and all zonotopes are polytopes; however, the reverse does
Hyperrectangles
not hold (figure 8.11). Hyperrectangles can be compactly represented as a center
point and a vector of half-widths. They can also be represented as a set of intervals
with one for each dimension. Unlike zonotopes, hyperrectangles are not closed
Figure 8.11. Zonotopes are a sub-
under linear transformations and Minkowski sums. class of polytopes, and hyperrect-
angles are a subclass of zonotopes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
190 chapter 8. reachability for linear systems

Figure 8.12. Example of a hyper-


rectangle in one, two, and three di-
mensions. In one dimension, a hy-
perrectangle is equivalent to an in-
terval.

8.4 Reducing Computational Cost

As noted in section 8.3.1, the number of candidate vertices for the reachable
sets in algorithm 8.3 grows exponentially with the time horizon and causes
computational challenges for high-dimensional systems. There are multiple ways
to reduce this computational burden. One way is to represent the initial state and
disturbance sets using zonotopes since the number of generators scales linearly
with the time horizon (see section 8.3.2). In this section, we will discuss another
technique to reduce the computational cost that relies on overapproximation.
The set P̃ represents an overapproximation of the set P if P ⊆ P̃ . Typically, we
select the overapproximated set P̃ such that it is easier to compute or represent.
For example, we can use overapproximation to reduce the computational cost of
algorithm 8.3 by overapproximating the reachable set at each iteration with a set
that has fewer vertices (figure 8.13). We can then use this overapproximated set
as the initial set for the next iteration.
As long as the overapproximated reachable set does not intersect with the avoid
set, we can still use to it make claims about the safety of the system. However, if
the overapproximated reachable set does intersect with the avoid set, the results Figure 8.13. Overapproximating
are inconclusive. The violation could be due to unsafe behavior or the overap- the blue polytope with the purple
polytope. The purple polytope has
proximation itself. In this case, we could move to a tighter overapproximation or fewer vertices.
use a different method to verify safety (example 8.4).
Algorithm 8.5 modifies algorithm 8.3 to include overapproximation. Depend-
ing on the complexity of the reachable sets, we may not need to overapproximate
at every iteration, so we set a frequency parameter to control how often we overap-
proximate. Figure 8.14 demonstrates this idea on the mass-spring-damper system.
A more frequent overapproximation will result in greater computational efficiency
at the cost of extra overapproximation error in the reachable sets. We define over-
approximation error as the difference in volume between the overapproximated
reachable set and the true reachable set. 13
The Hausdorff distance is named
The overapproximation tolerance e places a bound on the Hausdorff distance after German mathematician Felix
between the overapproximated set and the original set.13 The Hausdorff distance Hausdorff (1868–1942).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.4. reducing computational cost 191

struct OverapproximateSetPropagation <: ReachabilityAlgorithm Algorithm 8.5. Overapproximate


h # time horizon linear forward reachability using
freq # overapproximation frequency set propagation. At each iteration,
ϵ # overapproximation tolerance the algorithm calls algorithm 8.2
end to compute the reachable set at the
next time step. If the current time
function reachable(alg::OverapproximateSetPropagation, sys) step matches up with the overap-
h, freq, ϵ = alg.h, alg.freq, alg.ϵ proximation frequency, the algo-
𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) rithm calls the overapproximate
ℛ = 𝒮 function from LazySets.jl to com-
for t in 1:h pute an e-close overapproximation
𝒮 = linear_set_propagation(sys, 𝒮, 𝒳) of the reachable set for use at the
ℛ = ℛ ∪ 𝒮 next time step. Section 8.4.2 de-
𝒮 = t % freq == 0 ? overapproximate(𝒮, ϵ) : 𝒮 scribes how the overapproximation
end function works.
return ℛ
end

R1 = S R2 R3 R4 Figure 8.14. An overapproxima-


tion of R1:4 for the mass-spring-
damper system. We reduce the
0.5 number of vertices in R3 by over-
v (m/s)

approximating it with the purple


0 polytope. This overapproximation
results in fewer vertices for R4 but
−0.5 causes it to produce a conservative
estimate of the reachable set.
−0.3 0 0.3 −0.3 0 0.3 −0.3 0 0.3 −0.3 0 0.3
p (m) p (m) p (m) p (m)

between two sets P and P̃ is the maximum distance from a point in P to the
nearest point in P̃ . A lower value for e results in a less conservative overap-
proximation but may require more computation and result in a more complex
representation. The rest of this section discusses a technique for computing this
overapproximation.

8.4.1 Support Functions


We can overapproximate convex sets by sampling their support function. The
support function ρ of a set P ⊂ Rn is defined as

ρ(d) = max d> p (8.22)


p∈P

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
192 chapter 8. reachability for linear systems

Suppose we want to determine if the mass-spring-damper system could Example 8.4. The effect of overap-
proximation on accuracy and com-
reach the avoid set within 40 time steps. To reduce computational cost, we use putational cost for the mass-spring-
algorithm 8.5 with an overapproximation frequency of 5 time steps. The plots damper system. The plots show
the reachable sets R1:40 using three
below show the reachable set R1:40 using three different overapproximation different overapproximation toler-
tolerances e. The plot on the right shows the number of vertices in Rd for ances. The plot below shows the
each depth d. number of vertices in the reachable
sets at each depth. If the tolerance
is too high, the reachable set may
e=0 e = 0.001 e=1 overlaps with the avoid set, and the
analysis is inconclusive.
0.5

Number of Vertices in Rd
100
v (m/s)

0 e=0
80 e = 0.001
60 e=1
−0.5
40
−0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4
20
p (m) p (m) p (m)
0
0 10 20 30 40
The first tolerance of e = 0 results in no overapproximation, but the number of Depth (d)
vertices grows quickly. The highest tolerance of e = 1 results in significantly
fewer vertices, but it is too conservative to the point where the reachable set
overlaps with the avoid set. Therefore, the results of the analysis with e = 1
are inconclusive. The middle tolerance of e = 0.001 strikes a balance between
the two extremes. With this tolerance, we are still able to verify safety while
reducing the computational cost.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.4. reducing computational cost 193

Bounding Halfspace Bounding Polyhedron Bounding Polytope Figure 8.15. Overapproximating


a polytope by evaluating its sup-
port function in various directions.
When we only use one direction
(left), the overapproximation is a
d
half space. By using multiple di-
rections (center), we can construct
a polyhedral overapproximation.
However, the two vectors only posi-
tively span the shaded cone, result-
d> p ≤ ρ (d ) ing in an unbounded overapprox-
imation. By adding a third direc-
tion (right), we can construct a set
of directions that positively span
R2 and produce a bounded over-
where d is a direction vector. The maximizer of the support function is called the approximate set.
support vector:
σ (d) = arg max d> p (8.23)
p∈P
σ (d)
For polytopes, there will always be a support vector in a given direction d that
d
corresponds to one of its vertices (figure 8.16). In fact, evaluating the support
function of a V -polytope at a direction d involves computing d> v for each vertex
v ∈ V and taking the maximum. Evaluating the support function of an H-polytope
requires solving a linear program. The support function of a zonotope can be
Figure 8.16. Support vector of a
computed in closed form as a function of its generators.14 polytope in a given direction d. The
The support function of a set P can be used to define a half space that contains support vector is the vertex of the
the set: polytope that maximizes the sup-
port function.
{p | d> p ≤ ρ(d)} (8.24) 14
M. Althoff and G. Frehse, “Com-
bining Zonotopes and Support
By evaluating the support function on a set of directions D = d1:m and taking the
Functions for Efficient Reachabil-
intersection of the resulting half spaces, we obtain a polyhedral overapproxima- ity Analysis of Linear Systems,” in
tion of the set P : IEEE Conference on Decision and Con-
trol (CDC), 2016.
{p | d> p ≤ ρ(d)}
\
P̃ = (8.25)
d∈D

For the overapproximation to be a polytope, the set D must be a positive spanning


set. The set of directions D represents a positive spanning set if we can construct
any point in Rn as a convex combination of the directions in D .15 Figure 8.15 15
R. G. Regis, “On the Properties of
demonstrates this concept in R2 . Positive Spanning Sets and Positive
Bases,” Optimization and Engineer-
ing, vol. 17, no. 1, pp. 229–262, 2016.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
194 chapter 8. reachability for linear systems

The choice of directions in D affects the tightness of the overapproximation.


In general, adding more direction vectors to D will decrease overapproxima-
tion error. As we approach all possible direction vectors, the overapproximation
converges to the set itself. However, more direction vectors will require more
computation to create the overapproximated set and will result in a more complex Figure 8.17. Two different over-
approximations of the blue poly-
overapproximate representation. tope that each use four evalua-
We want to select the directions in D to balance between overapproximation tions of the support function. The
first overapproximation evaluates
error and computational cost. If we have no prior information about the shape of the support function in the posi-
the set, a common choice is to add a direction in the positive and negative direc- tive and negative directions of the
tion of each axis. This choice will result in a hyperrectangular overapproximation. axes. The second overapproxima-
tion uses the directions of the diag-
However, other choices of directions may result in a tighter overapproximation onals of the unit square. The choice
(figure 8.17). Section 8.4.2 discusses an iterative algorithm for intelligently select- of directions affects the tightness of
the overapproximation.
ing these directions.

8.4.2 Iterative Refinement


One way to select the directions in D is to use a process called iterative refinement.16 16
This method is implemented
The algorithm proceeds as follows: in the LazySets.jl package as
the overapproximate function. For
more details, see G. K. Kamenev,
1. Begin with a positive spanning set of template directions D and compute the “An Algorithm for Approximating
corresponding overapproximate polytope by evaluating the support function Polyhedra,” Computational Math-
in each direction. A common choice is the positive and negative directions of ematics and Mathematical Physics,
vol. 4, no. 36, pp. 533–544, 1996.
the axes.

2. Compute an inner approximation by taking the convex hull of the correspond-


ing support vectors.

3. Compute the distance between each facet of the inner approximation and the
nearest vertex of the outer approximation.

4. Add the direction of the face that is furthest from the nearest vertex to D and
return to step 1.

The process is repeated until the maximum distance between the inner and outer
approximations is less than a specified tolerance e. Figure 8.18 shows the steps
involved in a single iteration of the algorithm, and figure 8.19 demonstrates the
process over multiple iterations.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.5. linear programming 195

Step 1 Step 2 Step 3 Step 4 Figure 8.18. Illustration of the steps


involved in a single iteration of
the iterative refinement algorithm.
In this example, the initial over-
approximation is a hyperrectangle.
The distance between the inner and
outer approximations is computed
for each face of the inner approx-
imation. The direction of the face
that is furthest from the nearest ver-
tex (purple arrow) is added to the
template directions.

Iteration 1 Iteration 2 Iteration 3 Converged Figure 8.19. The resulting overap-


proximated polytope for various it-
erations of the iterative refinement
algorithm. The Hausdorff distance
between the overapproximated set
and the true set decreases with
each iteration until it is within the
specified tolerance e = 0.7.

8.5 Linear Programming

Another technique for computing overapproximate reachable sets of linear sys-


tems is to directly evaluate the support function of the reachable set at a desired
depth d:
ρd (d) = max d> s (8.26)
s∈Rd

Similar to the support function of a polytope, the support function of a reachable


set can be used to construct an overapproximation of the reachable set. We form
the overapproximation by evaluating the support function in a set of directions
D and taking the intersection of the resulting half spaces.
We can solve the optimization problem in equation (8.26) using a linear program
solver. A linear program is an optimization problem where the objective function

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
196 chapter 8. re achability for linear systems

and constraints are all linear. The linear program for equation (8.26) is

minimize d> sd
s1:d ,x1:d

subject to s1 ∈ S
(8.27)
xt ∈ Xt for all t ∈ 1 : d
st+1 = Step(st , xt ) for all t ∈ 1 : d − 1

where
Step(s, x) = (Ts + Ta Πo Os )s + Ta Πo xo + Ta xa + xs (8.28)
The decision variables in equation (8.27) are the state and disturbances at each
time step. The constraints enforce that the state and disturbances are within
their respective sets and that the state evolves according to equation (8.9). The
optimization problem in equation (8.27) can be solved efficiently using a variety
of algorithms.17 17
Modern linear programming
For the optimization problem in equation (8.27) to be a linear program, the solvers can solve problems with
thousands of variables and
sets S and Xt must be polytopes. We can write them as a set of linear inequalities constraints. H. Karloff, Linear
using their H-polytope representations. Algorithm 8.6 implements the linear Programming. Springer, 2008.
program for computing the support function of a reachable set at a particular
depth d. Given a desired time horizon h and a set of directions D , we can compute
an overapproximation of R1:h by evaluating the support function at each direction
for each depth. Algorithm 8.7 implements this process.
Similar to the polytope overapproximation in section 8.4, the choice of the
directions in D affects the tightness of the reachable set overapproximation. We
could select the directions to align with the axes or use more sophisticated meth-
ods like the iterative refinement algorithm in section 8.4.2. Since linear program
solvers are computationally efficient, another option is to simply evaluate the
support function at many randomly sampled directions. We could also select the 18
H. Abdi and L. J. Williams, “Prin-
directions using trajectory samples. Given a set of samples from the reachable set, cipal Component Analysis,” Wi-
ley Interdisciplinary Reviews: Com-
we can use principal component analysis (PCA)18 to determine the directions putational Statistics, vol. 2, no. 4,
that best capture the shape of the set.19 pp. 433–459, 2010.
19
The overapproximate reachable sets improve our understanding of the be- O. Stursberg and B. H. Krogh,
“Efficient Representation and Com-
havior of the system. However, if our ultimate goal is to check intersection with putation of Reachable Sets for Hy-
a convex avoid set U , we can solve the problem exactly without the need for brid Systems,” in Hybrid Systems:
Computation and Control, 2003.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.5. linear programming 197

Ab(𝒫) = tosimplehrep(constraints_list(𝒫)) Algorithm 8.6. Computing the


support function of a reachable
function constrained_model(sys, d, 𝒮, 𝒳) set at a desired depth d. The
model = Model(SCS.Optimizer) constrained_model function con-
@variable(model, 𝐬[1:dim(𝒮),1:d]) structs an optimization model with
@variable(model, 𝐱o[1:dim(𝒳.xo),1:d]) the contraints in equation (8.27)
@variable(model, 𝐱s[1:dim(𝒳.xs),1:d]) that is compatible with the JuMP.jl
@variable(model, 𝐱a[1:dim(𝒳.xa),1:d]) package. It uses the Ab function to
convert a polytope to a set of linear
As, bs = Ab(𝒮) inequalities. Given this model and
(Axo, bxo), (Axs, bxs), (Axa, bxa) = Ab(𝒳.xo), Ab(𝒳.xs), Ab(𝒳.xa) a direction vector 𝐝, the ρ function
@constraint(model, As * 𝐬[:, 1] .≤ bs) solves the linear program and re-
for i in 1:d turns the value of the support func-
@constraint(model, Axo * 𝐱o[:, i] .≤ bxo) tion.
@constraint(model, Axs * 𝐱s[:, i] .≤ bxs)
@constraint(model, Axa * 𝐱a[:, i] .≤ bxa)
end

Ts, Ta, Πo, Os = get_matrices(sys)


for i in 1:d-1
@constraint(model, (Ts + Ta*Πo*Os) * 𝐬[:, i] + Ta*Πo * 𝐱o[:, i]
+ Ta * 𝐱a[:, i] + 𝐱s[:, i] .== 𝐬[:, i+1])
end
return model
end

function ρ(model, 𝐝, d)
𝐬 = model.obj_dict[:𝐬]
@objective(model, Max, 𝐝' * 𝐬[:, d])
optimize!(model)
return objective_value(model)
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
198 chapter 8. reachability for linear systems

struct LinearProgramming <: ReachabilityAlgorithm Algorithm 8.7. Linear forward


h # time horizon reachability using linear program-
𝒟 # set of directions to evaluate support function ming. The system-specific 𝒮₁ and
tol # tolerance for checking satisfaction disturbance_set functions return
end the initial state set and distur-
bance set, respectively. For each
function reachable(alg::LinearProgramming, sys) depth, the algorithm creates the
h, 𝒟 = alg.h, alg.𝒟 contrained model and evaluates
𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) the support function at each direc-
ℛ = 𝒮 tion in 𝒟. It then constructs a poly-
for d in 2:h tope from the results and takes its
model = constrained_model(sys, d, 𝒮, 𝒳) union with the current reachable
ρs = [ρ(model, 𝐝, d) for 𝐝 in 𝒟] set. The algorithm returns R1:h .
ℛ = ℛ ∪ HPolytope([HalfSpace(𝐝, ρ) for (𝐝, ρ) in zip(𝒟, ρs)]) The tol input in a tolerance used
end by algorithm 8.8.
return ℛ
end

overapproximation. Specifically, we solve the following optimization problem:

minimize k sd − u k
s1:d ,x1:D

subject to u∈U
s1 ∈ S (8.29)

xt ∈ Xt for all t ∈ {1, . . . , d}


st+1 = Step(st , xt ) for all t ∈ {1, . . . , d − 1}
The solution to the optimization problem in equation (8.29) is the minimum
distance between any point in the reachable set and the avoid set. If this distance
is greater than zero, we can conclude that the reachable set does not intersect
the avoid set at depth d. If the distance is equal to zero, we can conclude that the
reachable set intersects the avoid set.
The norm in the objective function of equation (8.29) means that it is no longer
a linear program. It is, however, a convex program as long as the avoid set is
convex. If the avoid set is a union of convex sets, we can check intersection with
each component separately (see example 8.5). Convex programs can be solved
efficiently using a variety of algorithms.20 Algorithm 8.8 implements this check 20
S. P. Boyd and L. Vandenberghe,
for a given time horizon. For each depth, the algorithm solves the optimization Convex Optimization. Cambridge
University Press, 2004.
problem in equation (8.29) and checks if the objective value is within some
tolerance of zero. If the objective value is zero at any depth, the system does not
satisfy the specification.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.6. summary 199

R2 R3 R4 R5 Figure 8.20. Overapproximate


reachable sets for the mass-spring-
damper system using linear pro-
Axis Aligned

gramming. On each plot, the x-axis


is position, and the y-axis is ve-
locity. Each row uses a different
strategy for selecting D . The first
row uses directions aligned with
the axes, the second row uses 10
randomly sampled directions, the
third row uses 50 randomly sam-
10 Random

pled directions, and the fourth row


uses directions selected based on
the principal components of the
trajectory samples. When we ran-
domly sample only 10 directions,
the overapproximation is too con-
servative to verify safety.
50 Random
PCA

8.6 Summary

• While the sampling-based methods in the previous chapters draw conclusions


based on a finite sampling of trajectories, formal methods such as reachability
analysis consider the entire set of possible trajectories.

• We can compute reachable sets for linear systems by propagating sets through
the system dynamics.

• We can efficiently propagate convex sets such as polytopes, zonotopes, and


hyperrectangles through linear equations.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
200 chapter 8. reachability for linear systems

The avoid set for the mass-spring-damper system can be written as the Example 8.5. Checking whether
the mass-spring-damper system
union of two convex sets. Specifically, we require that | p| < 0.3. The first can reach the avoid set using con-
set is therefore represented by the linear inequality [1, 0]> s ≤ −0.3, and the vex programming.
second set is represented by the linear inequality [−1, 0]> s ≤ −0.3. To check
whether the system could reach the avoid set, we run algorithm 8.8 for each
component of the avoid set. The system does not satisfy the specification if
the algorithm returns false for either component.

function satisfies(alg::LinearProgramming, sys, ψ) Algorithm 8.8. Checking whether


𝒮, 𝒳 = 𝒮₁(sys.env), disturbance_set(sys) a system could reach a convex
for d in 1:alg.h avoid set using convex program-
model = constrained_model(sys, d, 𝒮, 𝒳) ming. For each depth, the al-
@variable(model, u[1:dim(𝒮)]) gorithm constructs a constrained
Au, bu = Ab(¬ψ) model that considers the initial
@constraint(model, Au * u .≤ bu) state, disturbances, and system
𝐬 = model.obj_dict[:𝐬] dynamics. It then adds a vari-
@objective(model, Min, sum((𝐬[i, d] - u[i])^2 for i in 1:dim(𝒮))) able for the avoid set and min-
optimize!(model) imizes the squared distance be-
if isapprox(objective_value(model), 0.0, atol=alg.tol) tween the reachable set and the
return false avoid set (equivalent to minimiz-
end ing the norm). If the distance is
end zero (within the numerical toler-
return true ance) at any depth, the system does
end not satisfy the specification.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
8.6. summary 201

• If the number of vertices in the reachable set grows too large, we can produce
overapproximate representations by evaluating the support function on a set
of directions.

• We can overapproximate the reachable set directly by solving a linear program


to evaluate the support function.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9 Reachability for Nonlinear Systems

This chapter extends the set propagation and optimization techniques discussed
in chapter 8 to perform reachability on nonlinear systems. A system is nonlin-
ear if its agent, environment, or sensor model contains nonlinear functions. The
reachable sets of nonlinear systems are often nonconvex and difficult to compute
exactly. This chapter begins by discussing several set propagation techniques
for nonlinear systems that overapproximate the reachable set.1 We then discuss 1
For more details on set propaga-
optimization-based nonlinear reachability methods. To minimize the overap- tion through nonlinear systems, re-
fer to M. Althoff, G. Frehse, and
proximation error introduced by these methods, we introduce a technique for A. Girard, “Set Propagation Tech-
overapproximation error reduction that involves partitioning the state space. We niques for Reachability Analysis,”
Annual Review of Control, Robotics,
conclude by discussing reachability techniques for nonlinear systems represented and Autonomous Systems, vol. 4,
by a neural network. pp. 369–395, 2021.

9.1 Interval Arithmetic

For nonlinear systems, the reachability function r (s, x1:d ) is a nonlinear function.
In contrast with the linear systems in chapter 8, we cannot directly propagate
arbitrary polytopes through nonlinear systems. We can, however, propagate
hyperrectangular sets2 using a technique called interval arithmetic.3 Interval arith- 2
We can also propagate sets that
metic extends traditional arithmetic operations and other elementary functions are linear transformations of hyper-
rectangles by reversing the linear
to intervals. An interval is a set of real numbers written as transformation to obtain an axis-
aligned hyperrectangle and per-
[ x ] = [ x, x ] = { x | x ≤ x ≤ x } (9.1) forming the analysis in the trans-
formed space.
where x and x are the lower and upper bounds of the interval, respectively. A 3
L. Jaulin, M. Kieffer, O. Didrit,
and É. Walter, Interval Analysis.
hyperrectangle, also known as an interval box, is the Cartesian product of a set of
Springer, 2001.
n intervals:
[x] = [ x1 ] × [ x2 ] · · · × [ x n ] (9.2)
204 chapter 9. reachability for nonlinear systems

where [ xi ] = [ xi , xi ] for i ∈ 1 : n (figure 9.1).


Given two intervals [ x ] and [y], we define the interval counterpart of elementary
arithmetic functions as

[ x2 ]
[x] = [ x1 ] × [ x2 ]
[ x ] ◦ [y] = { x ◦ y | x ∈ [ x ], y ∈ [y]} (9.3)

where ◦ represents the addition, subtraction, multiplication, and division opera-


tions. We evaluate the interval counterparts of these functions as follows: [ x1 ]

[ x ] + [y] = [ x + y, x + y] (9.4) Figure 9.1. The Cartesian product


of two intervals forms a hyperrect-
[ x ] − [y] = [ x − y, x − y] (9.5)
angle in R2 .
[ x ] × [y] = [min( xy, xy, xy, xy), max( xy, xy, xy, xy)] (9.6)
[ x ] / [y] = [min( x/y, x/y, x/y, x/y), max( x/y, x/y, x/y, x/y)] (9.7)

where the division operation is only defined when 0 ∈ / [ y ].


In general, we define the interval counterpart of a given function f ( x ) as

f ([ x ]) = [{ f ( x ) | x ∈ [ x ]}] (9.8)

where the [·] operation takes the interval hull of the resulting set. The interval
hull of a set is the smallest interval that contains the set. Therefore, the interval
counterpart of a function returns the smallest interval that contains all possible
function evaluations of the points in the input interval.
We can define an interval counterpart for a variety of elementary functions.4 4
IntervalArithmetic.jl defines

For monotonically increasing functions such as exp, log, and square root, the the interval counterpart of many
elementary functions such as sin,
interval counterpart is cos, exp, and log in Julia.
f ([ x ]) = [ f ( x ), f ( x )] (9.9)
The interval counterpart for monotonically decreasing functions is similarly de-
fined. Nonmonotonic elementary functions such as sin, cos, and square require
multiple cases to define their interval counterparts. For example, the interval
counterpart for the square function is

[min( x2 , x2 ), max( x2 , x2 )] if 0 ∈
/ [x]
[ x ]2 = (9.10)
[0, max( x2 , x2 )] otherwise

Figure 9.2 shows example evaluations of the interval counterparts for the exp,
square, and sin functions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion functions 205

f ( x ) = exp( x ) f ( x ) = x2 f ( x ) = sin( x ) Figure 9.2. Example of the interval


counterparts for the exp, square,
10 and sin functions.
4
8 1

6
f (x)

0
4 2

2 −1
0 0
0 1 2 −2 −1 0 1 2 −2 0 2
x x x

9.2 Inclusion Functions

For complex functions, it is not always possible to define a tight interval counter-
part. In these cases, we instead define an inclusion function. An inclusion function
[ f ]([ x ]) outputs an interval that is guaranteed to contain the interval from the
interval counterpart:
f ([ x ]) ⊆ [ f ]([ x ]) (9.11)
In other words, inclusion functions output overapproximate intervals. We can
also define an inclusion function for multivariate functions that map from Rk to
R where k ≥ 1.
For reachability analysis, our goal is to propagate intervals through the function
r (s, x1:d ), which maps its inputs to Rn where n is the dimension of the state space.
We can rewrite r (s, x1:d ) as a vector of functions that map to R as follows:
 
r1 (s, x1:d )
..
s0 = r (s, x1:d ) = 
 
 .

 (9.12)
rn (s, x1:d )

where ri (s, x1:d ) outputs the value of the ith component of s0 . We can then define
the inclusion function for each ri (s, x1:d ) as [ri ]([s], [x1:d ]). By evaluating each
inclusion function for the input intervals [s] and [x1:d ], we obtain an overapproxi-
mate hyperrectangular reachable set. The rest of this section discusses techniques
to create these inclusion functions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
206 chapter 9. reachability for nonlinear systems

9.2.1 Natural Inclusion Functions


2
One simple way to create an inclusion function from a complex function is to
replace each elementary function with its interval counterpart. This type of inclu-

f (x)
0
sion function is known as a natural inclusion function. For example, the natural
inclusion function for f ( x ) = x − sin( x ) is [ f ]([ x ]) = [ x ] − sin([ x ]) (figure 9.3). −2
By replacing the elementary nonlinear components of the agent, environment,
−2 −1 0 1 2
and sensor models with their interval counterparts, we can create the natural
x
inclusion function for ri (s, x1:d ). We can then use interval arithmetic to propagate
hyperrectangular sets through the natural inclusion function. This computation Figure 9.3. Example evaluation of
the natural inclusion function for
will result in overapproximate reachable sets for nonlinear systems. Algorithm 9.1 f ( x ) = x − sin( x ). The inclusion
implements the natural inclusion reachability algorithm and computes over- function produces an overapproxi-
mate interval.
approximate reachable sets up to a desired time horizon. Example 9.1 applies
algorithm 9.1 to the inverted pendulum problem.
As shown in figure 9.3 and example 9.1, natural inclusion functions tend to
be overly conservative. This property is due to the dependency effect, in which
multiple occurrences of the same variable are treated independently (see ex-
ample 8.2). In chapter 8, we were able to eliminate this effect by simplifying
equations to algebraically combine all repeated instances of a variable. However,
this simplification is not always possible for nonlinear functions such as the one
shown in figure 9.3. We can instead mitigate the dependency effect by using more
f (x)
sophisticated techniques for generating inclusion functions, which we discuss in
the remainder of this section.

9.2.2 Mean Value Inclusion Functions


f (x)
For functions that are continuous and differentiable, we can use the mean value
x x0 x
theorem to create an inclusion function. The mean value theorem states that for a
function f ( x ) that is continuous and differentiable on the interval [ x ], there exists Figure 9.4. Illustration of the mean
a point x 0 ∈ [ x ] such that value theorem on the function
f ( x ) = x2 over the interval [ x ] =
f (x) − f (x) [1, 4].
= f 0 (x0 ) (9.13)
x−x

In other words, there exists a point in [ x ] where the slope of the tangent line
is equal to the slope of the secant line between the endpoints of the interval
(figure 9.4).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion functions 207

struct NaturalInclusion <: ReachabilityAlgorithm Algorithm 9.1. Nonlinear for-


h # time horizon ward reachability using natural
end inclusion functions. For each
depth, the algorithm gets the input
function r(sys, x) intervals using the system specific
s, 𝐱 = extract(sys.env, x) intervals function and computes
τ = rollout(sys, s, 𝐱) the output intervals using the nat-
return τ[end].s ural inclusion function of r (s, x1:d ).
end The IntervalArithmetic.jl.jl
package replaces functions with
to_hyperrectangle(𝐈) = Hyperrectangle(low=[i.lo for i in 𝐈], their interval counterparts so that
high=[i.hi for i in 𝐈]) we can propagate the intervals
directly through the rollout
function reachable(alg::NaturalInclusion, sys) function. The algorithm returns
𝐈′s = [] R1:h as the union of the output
for d in 1:alg.h
intervals.
𝐈 = intervals(sys, d)
push!(𝐈′s, r(sys, 𝐈))
end
return UnionSetArray([to_hyperrectangle(𝐈′) for 𝐈′ in 𝐈′s])
end

Suppose we want to compute reachable sets for the pendulum system with Example 9.1. Computing reach-
able sets for the inverted pendulum
bounded sensor noise on the angle and angular velocity using algorithm 9.1. system using its natural inclusion
We define the intervals and extract functions as follows: function. The plot shows the over-
approximated reachable set R2
function intervals(sys, d) computed using algorithm 9.1 com-
disturbance_mag = 0.01 pared to a set of samples from R2 .
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
𝐈 = [interval(θmin, θmax), interval(ωmin, ωmax)] 2
for i in 1:2d

ω (rad/s)
push!(𝐈, interval(-disturbance_mag, disturbance_mag))
end 0
return 𝐈
end
function extract(env::InvertedPendulum, x) −2
s = x[1:2]
𝐱 = [Disturbance(0, 0, x[i:i+1]) for i in 3:2:length(x)] −1 0 1
return s, 𝐱 θ (rad)
end

The intervals function returns the initial state intervals followed by the
disturbance intervals for each time step. The extract function extracts these
intervals into the state and disturbance components. The plot in the caption
shows the overapproximated reachable set for after two time steps.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
208 chapter 9. reachability for nonlinear systems

The mean value theorem implies that for any subinterval of [ x ], there exists a f (x)
point in [ x ] where the slope of the tangent line is equal to the slope of the secant
line between the endpoints of the subinterval. Therefore, given the center c of the
interval [ x ], there exists a point x 0 ∈ [ x ] such that

f ( x ) − f (c)
= f 0 (x0 ) (9.14) f (x)
x−c
x c x0 x x
for any x ∈ [ x ] (figure 9.5). Rearranging equation (9.14) gives
Figure 9.5. For a given subinter-
f ( x ) = f (c) + f 0 ( x 0 )( x − c) (9.15) val [c, x ], there exists a point in [ x ]
where the slope of the tangent line
is equal to the slope of the secant
Because we know that x 0 ∈ [ x ], we can use equation (9.15) to create an inclusion
line between the endpoints of the
function for f ( x ) as follows: subinterval.

[ f ]([ x ]) = f (c) + [ f 0 ]([ x ])([ x ] − c) (9.16)

where [ f 0 ]([ x ]) is an inclusion function for f 0 ( x ). It is common to define [ f 0 ]([ x ]) as 2

the natural inclusion function for f 0 ( x ). For multivariate functions, equation (9.16)

f (x)
generalizes to 0

[ f ]([x]) = f (c) + [∇ f ]([x])> ([x] − c) (9.17)


−2
where c is the center of the interval [x] and [∇ f ]([x]) is an inclusion function for
the gradient of f (x). −2 −1 0 1 2
x
Equation (9.17) is a linearization of the nonlinear function f ( x ). Therefore, mean
value inclusion functions tend to perform well when the input interval covers a Figure 9.6. Mean value inclusion
function for f ( x ) = x − sin( x ) over
region of the input space for which the function is nearly linear. Figure 9.6 shows
the same interval as figure 9.3.
an evaluation of the mean value inclusion function for the function in figure 9.3.
Because the function is roughly linear over the input interval, the mean value
inclusion function provides a tighter overapproximation. However, if we expand 2

the input interval to include nonlinear regions, the mean value inclusion function
f (x)

produces more conservative results (figure 9.7). 0

−2
9.2.3 Taylor Inclusion Functions
−2 −1 0 1 2
Natural inclusion functions and mean value inclusion functions are special cases x
of a more general type of inclusion function known as a Taylor inclusion function.
Figure 9.7. Mean value inclusion
These inclusion functions use Taylor series expansions about the center of the
function for f ( x ) = x − sin( x ) over
a wider interval.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion functions 209

Natural Inclusion First Order Second Order Third Order Figure 9.8. Evaluation of Taylor
inclusion functions of different or-
ders for f ( x ) = x − sin( x ) over
2
the interval [ x ] = [−1, 1] (top row)
and [ x ] = [−1.5, 1.5] (bottom row).
f (x)

0 The natural inclusion function is


equivalent to the zeroth-order Tay-
−2 lor inclusion function. As the or-
der increases, overapproximation
error decreases, especially over the
wider input interval.
2
f (x)

−2

−1.5 0 1.5 −1.5 0 1.5 −1.5 0 1.5 −1.5 0 1.5


x x x x

input interval. In one dimension, a Taylor inclusion function of order n for a


function f ( x ) is defined as

f 00 (c) [ f (n) ]([ x ])


[ f ]([ x ]) = f (c) + f 0 (c)([ x ] − c) + ([ x ] − c)2 + · · · + ([ x ] − c)n
2! n!
(9.18)
where c is the center of the interval [ x ] and [ f (n) ]([ x ]) is an inclusion function for
the nth-order derivative of f ( x ).5 5
It is possible to create a Taylor in-
Taylor inclusion functions can be similarly defined for multivariate functions. clusion function centered around
any point in the interval. However,
The second-order Taylor inclusion function for a multivariate function f (x) is choosing the center of the interval
minimizes overapproximation er-
1
[ f ]([x]) = f (c) + ∇ f (c)> ([x] − c) + ([x] − c)> [∇2 f ]([x])([x] − c) (9.19) ror.
2
where c is the center of the interval [x] and [∇2 f ]([x]) is an inclusion function
for the Hessian of f (x).6 A zero-order Taylor inclusion function is equivalent 6
For higher order models, see R.
Neidinger, “Directions for Comput-
to the natural inclusion function, and a first-order Taylor inclusion function is
ing Truncated Multivariate Taylor
equivalent to the mean value inclusion function. Series,” Mathematics of Computation,
In general, higher-order Taylor inclusion functions provide tighter overapprox- vol. 74, no. 249, pp. 321–340, 2005.

imations (figure 9.8). However, the benefit of using higher-order terms depends
on the behavior of the function over the input interval. If the function is nearly
linear over the input interval, moving beyond a first-order model may not be

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
210 chapter 9. reachability for nonlinear systems

worth the additional computational cost. In contrast, if the function is highly non-
linear over the input interval, a higher-order model may significantly decrease
overapproximation error.
Algorithm 9.2 implements first- and second-order Taylor inclusion functions
for reachability analysis. The algorithm computes overapproximate reachable sets
up to a desired time horizon by evaluating the Taylor inclusion function for each 7
Because Taylor inclusion func-
subfunction ri (s, x1:d ). Taylor inclusion functions can be used to create tighter tions can only be applied to func-
tions that are continuous and dif-
overapproximations of the reachable set than natural inclusion functions, espe- ferentiable, we use a modified ver-
cially for short time horizons (figure 9.9).7 However, the nonlinearities compound sion of the pendulum problem
in this chapter that does not ap-
for each time step, so Taylor models can be computationally expensive and result ply clamping in the environment
in significant overapproximation error for long time horizons (example 9.2). model.

struct TaylorInclusion <: ReachabilityAlgorithm Algorithm 9.2. Nonlinear forward


h # time horizon reachability using first- or second-
order # order of Taylor inclusion function (supports 1 or 2) order Taylor inclusion functions.
end For each depth, the algorithm gets
the input intervals using the sys-
function taylor_inclusion(sys, 𝐈, order) tem specific intervals function
c = mid.(𝐈) and applies either equation (9.17)
fc = r(sys, c) or equation (9.19) to each sub-
if order == 1 function ri (s, x1:d ) of r (s, x1:d ).
𝐈′ = [fc[i] + gradient(x->r(sys, x)[i], 𝐈)' * (𝐈 - c) The IntervalArithmetic.jl and
for i in eachindex(fc)] ForwardDiff.jl packages are
else compatible, which allows us to
𝐈′ = [fc[i] + gradient(x->r(sys, x)[i], c)' * (𝐈 - c) + evaluate gradients and Hessians
(𝐈 - c)' * hessian(x->r(sys, x)[i], 𝐈) * (𝐈 - c) over intervals. The algorithm
for i in eachindex(fc)] returns R1:h as the union of the
end
output intervals.
return 𝐈′
end

function reachable(alg::TaylorInclusion, sys)


𝐈′s = []
for d in 1:alg.h
𝐈 = intervals(sys, d)
𝐈′ = taylor_inclusion(sys, 𝐈, alg.order)
push!(𝐈′s, 𝐈′)
end
return UnionSetArray([to_hyperrectangle(𝐈′) for 𝐈′ in 𝐈′s])
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.2. inclusion function s 211

Natural Inclusion First Order Second Order Figure 9.9. Comparison of the one-
step overapproximated reachable
1 sets for the inverted pendulum sys-
tem using natural, first-order Tay-
lor, and second-order Taylor inclu-
ω (rad/s)

sion functions. In this particular


0
example, a first-order Taylor in-
clusion function provides a sign-
ficantly tighter overapproximation
−1 than the natural inclusion function.
−1 0 1 −1 0 1 −1 0 1 The second-order Taylor inclusion
function does not provide a signifi-
θ (rad) θ (rad) θ (rad) cant benefit over the first-order Tay-
lor inclusion function, indicating
that the dynamics are roughly lin-
ear over the input space.

The plots below show the overapproximate reachable sets for the inverted Example 9.2. Overapproximate
reachable sets for the inverted pen-
pendulum system produced by a first-order Taylor inclusion function at dif- dulum system using first-order
ferent depths. As the depth increases, the overapproximation error increases. Taylor inclusion functions at dif-
ferent depths. As the depth in-
This result is due to the increasing presence of nonlinearities in the system creases, the overapproximation er-
dynamics as we increase the depth. For the one-step reachable set (R2 ), the ror increases.
only nonlinearity present is the sine function in the pendulum dynamics.
As the depth increases, this nonlinearity will be repeated for each time step,
leading to larger overapproximation error.

R2 R3 R4 R5
1
ω (rad/s)

−1
−1 0 1 −1 0 1 −1 0 1 −1 0 1
θ (rad) θ (rad) θ (rad) θ (rad)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
212 chapter 9. reachability for nonlinear systems

9.3 Taylor Models

While inclusion functions only operate over interval inputs and output reachable
sets in the form of hyperrectangles, Taylor models operate over other types of input
sets and are able to represent more expressive reachable sets.8 Similar to Taylor 8
K. Makino and M. Berz, “Taylor
inclusion functions, Taylor models are based on Taylor series expansions. An Models and Other Validated Func-
tional Inclusion Methods,” Interna-
nth-order Taylor model is a set represented as tional Journal of Pure and Applied
Mathematics, vol. 4, no. 4, pp. 379–
T = { p(x) + [α] | x ∈ X , α ∈ [α]} (9.20) 456, 2003.

where X is the input set, p(x) is a polynomial of degree n − 1, and [α] is an


interval remainder term. In one dimension, the polynomial of an nth-order Taylor
model for the function f ( x ) over an input interval [ x ] is defined as

f 00 (c) f ( n −1) ( x )
p( x ) = f (c) + f 0 (c)( x − c) + ( x − c )2 + · · · + ( x − c ) ( n −1)
2! ( n − 1) !
(9.21)
where c is the center of the input interval. The interval remainder term, also
9
One way to handle this noncon-
known as the Lagrange remainder, bounds the sum of the rest of the terms in the
vexity is to represent sets using an
Taylor expansion over the input interval [ x ] so that the Taylor model is guaranteed extension of zonotopes called poly-
to contain the true output of the function. It is calculated as nomial zonotopes. More details can
be found in M. Althoff, “Reachabil-
[ f (n) ]([ x ]) ity Analysis of Nonlinear Systems
[α] = ([ x ] − c)n (9.22) Using Conservative Polynomializa-
n! tion and Non-Convex Sets,” in In-
and is equivalent to the last term in a Taylor inclusion function of order n. In fact, ternational Conference on Hybrid Sys-
tems: Computation and Control, 2013.
passing an interval through a Taylor model performs the same computation as a Another representation called star
Taylor inclusion function of the same order. sets can also be used to repre-
sent nonconvex sets and has been
As the order of a Taylor model increases, overapproximation error tends to used for reachability. H.-D. Tran,
decrease (figure 9.10). Producing a zero-order Taylor model is equivalent to evalu- D. Manzanas Lopez, P. Musau, X.
ating the natural inclusion function, while producing a first-order Taylor model is Yang, L. V. Nguyen, W. Xiang, and
T. T. Johnson, “Star-Based Reacha-
equivalent to evaluating the mean value inclusion function. Taylor models begin bility Analysis of Deep Neural Net-
to deviate from inclusion functions for orders of two or higher. Second-order works,” in International Symposium
on Formal Methods, 2019.
Taylor models represent arbitrary polytopes, while second-order inclusion func-
tions only produce hyperrectangles. Higher-order Taylor models correspond to 10
M. Althoff, O. Stursberg, and
nonconvex sets, which are more difficult to understand and manipulate.9 For this M. Buss, “Reachability Analysis
of Nonlinear Systems with Uncer-
reason, we focus the remainder of this section on second-order Taylor models. tain Parameters Using Conserva-
Creating a second-order Taylor model for a function f (x) is a process known as tive Linearization,” in IEEE Confer-
ence on Decision and Control (CDC),
conservative linearization.10 Given an input set X and a center point c, the second- 2008.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.3. taylor models 213

Zero Order First Order Second Order Third Order Figure 9.10. Taylor models of dif-
ferent orders for f ( x ) = x − sin( x )
1 over the interval [ x ] = [−1.5, 0.0].
The dashed purple lines show re-
0 sults from a Taylor inclusion func-
f (x)

tion of the same order.


−1

−2
−2 −1 0 −2 −1 0 −2 −1 0 −2 −1 0

order Taylor model is

T = { f (c) + J(x − c) + α | x ∈ X , α ∈ [α]} (9.23)

where J is the Jacobian of f evaluated at c and [α] is the interval remainder term.
The Jacobian is a generalization of the gradient to functions with multidimensional
outputs and is computed as
 
∇ f 1 (c) >
 .. 
J=  .

 (9.24)
∇ f n (c) >

where ∇ f i (c) is the gradient of the ith component of f evaluated at c. The interval
remainder term is calculated using interval arithmetic as

1
[α] = ([X ] − c)> [∇2 f ]([X ])([X ] − c) (9.25)
2
where [X ] is the interval hull of X .11 11
If the input set X is represented
Equation (9.23) represents a linear approximation of the nonlinear function f as a zonotope, it is also possible
to overapproximate the remainder
with a remainder term that bounds the error of the approximation. Because all of term directly without taking the in-
the operations in equation (9.23) are linear, we can use it to propagate convex terval hull. This approach can re-
duce overapproximation error. M.
sets. In other words, if X is convex, we can rewrite the Taylor model in terms of Althoff, O. Stursberg, and M. Buss,
linear transformations and Minkowski sums as “Reachability Analysis of Nonlin-
ear Systems with Uncertain Pa-
T = f (c) + J(X ⊕ −c) ⊕ [α] (9.26) rameters Using Conservative Lin-
earization,” in IEEE Conference on
Decision and Control (CDC), 2008.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
214 chapter 9. reachability for nonlinear systems

Algorithm 9.3 computes overapproximate reachable sets using conservative


linearization. Since Taylor models can be applied to functions with multidimen-
sional outputs, we can apply conservative linearization directly to the reachability
function r (s, x1:d ) without the need to break it into subfunctions. Example 9.3
demonstrates algorithm 9.3 on the inverted pendulum system. Conservative lin-
earization using Taylor models performs better than second-order Taylor inclusion
functions because it is able to output more expressive reachable sets. However,
for higher orders, Taylor models output nonconvex sets that are difficult to ma-
nipulate. In contrast, Taylor inclusion functions always output hyperrectangles
and do not suffer from this added complexity.

struct ConservativeLinearization <: ReachabilityAlgorithm Algorithm 9.3. Nonlinear forward


h # time horizon reachability using conservative lin-
end earization. At each depth, the al-
gorithm gets the input sets for the
to_intervals(𝒫) = [interval(lo, hi) for (lo, hi) in zip(low(𝒫), high(𝒫))] initial states and disturbances us-
ing the system specific sets func-
function conservative_linearization(sys, 𝒫) tion and applies equation (9.23) to
𝐈 = to_intervals(interval_hull(𝒫)) r (s, x1:d ). It uses the interval hull of
c = mid.(𝐈) the input set to calculate the inter-
fc = r(sys, c) val remainder term. The algorithm
J = ForwardDiff.jacobian(x->r(sys, x), c) returns R1:h as the union of the out-
α = to_hyperrectangle([(𝐈 - c)'*hessian(x->r(sys, x)[i], 𝐈)*(𝐈 - c) put sets.
for i in eachindex(fc)])
return fc + J * (𝒫 ⊕ -c) ⊕ α
end

function reachable(alg::ConservativeLinearization, sys)


ℛs = []
for d in 1:alg.h
𝒮, 𝒳 = sets(sys, d)
𝒮′ = conservative_linearization(sys, 𝒮 × 𝒳)
push!(ℛs, 𝒮′)
end
return UnionSetArray([ℛs...])
end

9.4 Concrete Reachability

Algorithms 9.2 and 9.3 tend to be computationally expensive when computing


reachable sets over long time horizons. As the depth d increases, the input di-
mension for the reachability function r (s, x1:d ) also increases. This increase in

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.4. concrete reachability 215

Suppose we want to compute reachable sets for the pendulum system with Example 9.3. Computing the one-
step reachable set for the inverted
bounded sensor noise on the angle and angular velocity using algorithm 9.3. pendulum system using conserva-
We define the sets function as follows: tive linearization. Conservative lin-
earization better approximates the
function sets(sys, d) reachable set than a second-order
disturbance_mag = 0.01 Taylor inclusion function.
θmin, θmax = -π/16, π/16
ωmin, ωmax = -1.0, 1.0
low = [θmin, ωmin]
high = [θmax, ωmax]
for i in 1:d
append!(low, [-disturbance_mag, -disturbance_mag])
append!(high, [disturbance_mag, disturbance_mag])
end
return Hyperrectangle(low=low, high=high)
end

The sets function returns the initial state set followed by the disturbance sets
for each time step. The plots below compare the one-step reachable set pro-
duced by conservative linearization with the set produced by a second-order
Taylor inclusion function. While conservative linearization still produces an
overapproximation, it captures the shape of the true reachable set better than
a Taylor inclusion function.

Taylor Inclusion Conservative Linearization

1
ω (rad/s)

−1
−1 0 1 −1 0 1
θ (rad) θ (rad)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
216 chapter 9. re achability for nonlinear systems

1.2 Figure 9.11. Comparison of sym-


R1
bolic and concrete reachability al-
1.1 gorithms when computing R3 for
Symbolic

r (s1 , x1:3 ) R3
the inverted pendulum system.
1
The symbolic reachability algo-
0.9 rithm directly computes R3 with-
out explicity computing R2 by con-
sidering r (s1 , x1:3 ) as a single func-
−0.5 0 0.5 1
tion. The concrete reachability al-
gorithm computes R2 and R3 sep-
R1
arately by considering r (s1 , x1:2 )
R2 R3 and r (s2 , x2:3 ) as separate func-
Concrete

r (s1 , x1:2 ) r (s2 , x2:3 )


tions. It uses R2 as the input set
for computing R3 .

input dimension causes the size of the gradient and Hessian to increase, leading
to more expensive computations. Furthermore, the nonlinearities in the agent,
environment, and sensor models compound over time, causing the accuracy of a
linearized model to degrade as the depth increases.
Concrete reachability algorithms address these issues by decomposing the reach-
ability function into a sequence of simpler functions. Instead of overapproximating
the reachable set over the entire depth at once, they compute the overapproximate
reachable set for each time step individually. At each iteration, they use the over-
approximate reachable set from the previous time step as the input set for the next
time step. We refer to this process as concrete reachability because we concretize
the reachable set at each time step by explicitly computing an overapproximate
representation. In contrast, the algorithms presented thus far maintain a sym-
bolic representation of the reachable set at each time step and only concretize the
reachable set at depth d. For this reason, we refer to these algorithms as symbolic
reachability algorithms. Figure 9.11 illustrates the difference between symbolic and
concrete reachability algorithms.
Algorithms 9.4 and 9.5 implement concrete versions of the symbolic reach-
ability algorithms presented in algorithms 9.2 and 9.3, respectively. For each
depth in the time horizon, they compute the overapproximate reachable set for
the next step using the overapproximate reachable set from the previous step.
Algorithm 9.4 concretizes the reachable set into a hyperrectangle at each time

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.4. concrete reachability 217

struct ConcreteTaylorInclusion <: ReachabilityAlgorithm Algorithm 9.4. Nonlinear forward


h # time horizon reachability using Taylor inclusion
order # order of Taylor inclusion function (supports 1 or 2) functions, concretizing the reach-
end able set at each time step. The algo-
rithm first gets the intervals for a
function reachable(alg::ConcreteTaylorInclusion, sys) depth of 2, which correspond to the
𝐈 = intervals(sys, 2) intervals for a one-step reachabil-
s, _ = extract(sys.env, 𝐈) ity computation. At each depth, it
𝐈′s = [s] computes the intervals for the next
for d in 2:alg.h time step and creates the input for
𝐈′ = taylor_inclusion(sys, 𝐈, alg.order) the next time step by extracting the
push!(𝐈′s, 𝐈′) new state. The algorithms return
s, _ = extract(sys.env, 𝐈) R1:h as the union of the output sets.
𝐈[1:length(s)] = s
end
return UnionSetArray([to_hyperrectangle(𝐈′) for 𝐈′ in 𝐈′s])
end

struct ConcreteConservativeLinearization <: ReachabilityAlgorithm Algorithm 9.5. Nonlinear forward


h # time horizon reachability using conservative lin-
end earization, concretizing the reach-
able set at each time step. The algo-
function reachable(alg::ConcreteConservativeLinearization, sys) rithm first gets the state and distur-
𝒮, 𝒳 = sets(sys, 2) bance sets for a depth of 2, which
ℛs = [] correspond to the sets required for
push!(ℛs, 𝒮) a one-step reachability computa-
for d in 2:alg.h tion. At each depth, it computes
𝒮 = conservative_linearization(sys, 𝒮 × 𝒳) the state set for the next time step
push!(ℛs, 𝒮) and and uses it to compute the next
end reachable set. The algorithms re-
return UnionSetArray([ℛs...]) turn R1:h as the union of the output
end sets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
218 chapter 9. reachability for nonlinear systems

step, while algorithm 9.5 concretizes the reachable set into a polytope at each
time step.
Concrete reachability algorithms are generally more computationally efficient
than symbolic reachability algorithms. However, it is not always clear whether
they will produce tighter overapproximations because there are multiple factors
that contribute to the overapproximation error. The only source of overapproxima-
tion error in symbolic reachability algorithms is the error introduced by linearizing
the reachability function and bounding the remainder term. We expect this lin-
earization error to be smaller for concrete reachability algorithms because they
linearize over a single time step rather than the entire time horizon.
While concrete reachability algorithms reduce overapproximation error due to
linearization, they introduce additional overapproximation error by concretizing
the reachable set at each time step into an overapproximate reachable set (fig-
ure 9.11). This error compounds over time, and the accumulation of this error is
often referred to as the wrapping effect.
The decrease in linearization error and introduction of the wrapping effect
for concrete reachability algorithms result in a tradeoff between concrete and
symbolic reachability (figures 9.12 and 9.13). The choice of which type of al-
gorithm to use depends on the specific system, the reachability algorithm, and
the desired tradeoff between computational efficiency and overapproximation
error. For example, if we are using linearized models for reachability and the
one-step reachability function is nearly linear, concrete reachability algorithms
may produce tighter overapproximations than symbolic reachability algorithms.
It is common to mix concrete and symbolic reachability algorithms to take advan-
tage of the strengths of each approach. For example, instead of concretizing the
reachable set at each time step, we can concretize the reachable set every k time
steps to reduce the wrapping effect.
Another benefit of using concrete reachability algorithms is that we can use
them to check for invariant sets. Similar to the check for invariance described for
the set propagation techniques in section 8.2, if we find that the reachable set at
a given time step in contained within the concrete reachable set at the previous
time step, we can conclude that the reachable set is invariant. For example, the
concrete versions of R6 in figures 9.12 and 9.13 are contained within the concrete
versions of R5 . Therefore, we can conclude that R6 is an invariant set in both
cases, meaning that the system will remain within the set for all future time steps.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.5. optimization-based nonlinear reachability 219

R2 R3 R4 R5 R6 Figure 9.12. Comparison of sym-


bolic and concrete Taylor inclusion
algorithms when computing R1:6
Symbolic

for the inverted pendulum system.


Up to a depth of 5, the concretiza-
tion error dominates, so the sym-
bolic algorithm produces tighter
overapproximations. However, at
a depth of 6, the linearization er-
ror dominates, and the concrete al-
Concrete

gorithm produces a tighter overap-


proximation.

R2 R3 R4 R5 R6 Figure 9.13. Comparison of sym-


bolic and concrete conservative lin-
earization algorithms when com-
puting R1:6 for the inverted pen-
Symbolic

dulum system. Because concretiza-


tion using polytopes does not in-
troduce as much error as the hy-
perrectangles used by Taylor inclu-
sion functions, the linearization er-
ror dominates, and the concrete al-
Concrete

gorithm produces tighter overap-


proximations.

We cannot draw this conclusion using symbolic reachability algorithms because


the property requires consecutive concrete steps.

9.5 Optimization-Based Nonlinear Reachability

Similar to the ideas in section 8.5, we can overapproximate the reachable set of
nonlinear systems by sampling the support function. For symbolic reachability,

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
220 chapter 9. reachability for nonlinear systems

we rewrite the optimization problem in equation (8.27) as

minimize d> sd
sd ,x1:d

subject to s1 ∈ S
(9.27)
xt ∈ Xt for all t ∈ 1 : d
sd = r (s1 , x1:d )

For concrete reachability, we replace the last constraint with a constraint for each
time step as follows:

st+1 = r (st , xt:t+1 ) for all t ∈ 1 : d − 1 (9.28)

The optimization problem in equation (9.27) is a nonlinear program because


r (s1 , x1:d ) is a nonlinear function of the state and disturbance. However, to ensure
that the overapproximation of the reachable set holds, we must solve this opti-
mization problem exactly. In general, we cannot find exact solutions for nonlinear
programs, so we must introduce further overapproximations. The rest of this
section discusses these methods.

9.5.1 Linear Programming through Conservative Linearization


We can transform the nonlinear program in equation (9.27) into a linear pro-
gram using the conservative linearization technique introduced in section 9.3.
Specfically, we create the following linear program:

minimize d> sd
sd ,x1:d ,α

subject to s1 ∈ S
xt ∈ Xt for all t ∈ 1 : d
" # (9.29)
s1 − sc
sd = r (sc , xc ) + J +α
x1:d − xc
α ∈ [α]

where sc and xc are the centers of the state and disturbance sets and J is the
Jacobian of the reachability function evaluated at sc and xc . We introduce another
decision variable α to represent the remainder term and constrain it to belong
to the Lagrange remainder interval [α] (equation (9.25)). The concrete version

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.5. optimization-based nonlinear reachability 221

10 Figure 9.14. Example of a piece-



x<2 wise linear function.
4

8 4x − 4 2≤x<3



f ( x ) = −3x + 17 x≤x<5

6 2x − 8
 5≤x<6
f (x)


4 x≥6

0 2 4 6 8 10 12 14
x

of this linear program is similarly defined by replacing equation (9.28) with a


conservative linearization for each time step.

9.5.2 Piecewise Linear Models


If the reachability function r (s, x1:d ) is piecewise linear, we can formulate the
optimization problem as a mixed-integer linear program (MILP). A piecewise linear
function is a function that comprises multiple linear functions that are activated 12
C. Sidrane, A. Maleki, A. Irfan,
based on the region of the input space (figure 9.14). A mixed-integer linear pro- and M. J. Kochenderfer, “OVERT:
An Algorithm for Safety Verifica-
gram is a linear program that includes some design variables that are constrained tion of Neural Network Control
to a set of integers. We can convert piecewise linear functions to mixed-integer Policies for Nonlinear Systems,”
Journal of Machine Learning Research,
constraints by introducing binary variables that activate the appropriate linear vol. 23, no. 117, pp. 1–45, 2022.
function based on the input. 13
More details can be found in V.
The process of encoding a piecewise linear function as a set of constraints begins Tjeng, K. Y. Xiao, and R. Tedrake,
by writing the function in terms of max and min functions (example 9.4). It is “Evaluating Robustness of Neural
Networks with Mixed Integer Pro-
possible to write any piecewise linear function in this form.12 We can then convert gramming,” in International Con-
the max and min functions to mixed-integer constraints.13 Example 9.5 shows ference on Learning Representations
the conversion of the ReLU function. Encoding the piecewise linear reachability (ICLR), 2018.

function as a set of mixed integer contraints turns equation (9.27) into a MILP, 14
A detailed overview of integer
which we can solve using a variety of algorithms.14 programming can be found in L. A.
Wolsey, Integer Programming. Wi-
While many real-world nonlinear systems do not have piecewise linear reach- ley, 2020. Modern solvers, such
ability functions, we can overapproximate them with piecewise linear bounds. as Gurobi and CPLEX, can rou-
tinely handle problems with mil-
First, we decompose the reachability function into a conjunction of elementary lions of variables. There are pack-
nonlinear functions (see example 9.6). For each nonlinear elementary function, ages for Julia that provide access
to Gurobi, CPLEX, and a variety of
other solvers.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
222 chapter 9. reachability for nonlinear systems

Consider the following piecewise linear function (shown in the caption): Example 9.4. Writing the ReLU
function in terms of the max func-
 tion.
0 if x < 0
f (x) =
 x otherwise

This function is often referred to as the rectified linear unit (ReLU) function
and is commonly used in neural networks. We can rewrite this function in
terms of the max function as follows:

f ( x ) = max(0, x )

If x < 0, the max function will return 0, and if x ≥ 0, the max function will
return x.

we can derive piecewise linear lower and upper bounds over a given interval.
We can then convert those bounds to mixed-integer constraints and solve the
resulting MILP to overapproximate the reachable set.15 15
For more details on the process of
deriving the bounds and convert-
ing to constraints, see C. Sidrane, A.
9.6 Partitioning Maleki, A. Irfan, and M. J. Kochen-
derfer, “OVERT: An Algorithm for
Safety Verification of Neural Net-
The methods presented in this chapter tend to result in less overapproximation work Control Policies for Nonlin-
error when computing reachable sets over smaller regions of the input space. For ear Systems,” Journal of Machine
Learning Research, vol. 23, no. 117,
example, Taylor approximations are more accurate for points near the center of the
pp. 1–45, 2022.
region and become less accurate as we move away from the center (figure 9.15).
Therefore, we want to keep the input set for Taylor inclusion functions and Taylor
models as small as possible to minimize overapproximation error. 1

Based on this property, we can improve the performance of reachability al-


f (x)

gorithms by partitioning the input set into smaller regions and computing the 0

reachable set for each region separately. Specifically, we divide the input set S
−1
into a set of smaller regions S (1) , S (2) , . . . , S (m) such that
m −2 −1 0 1 2
(i )
[
S= S (9.30) x
i =1
Figure 9.15. First-order Taylor ap-
(i ) proximation (dashed blue line) for
To compute the reachable set at depth d, we compute the reachable set Rd for the function f ( x) = x − sin( x)
each region S (i) separately and then combine the results to form the reachable (gray) at centered x = 0. The ap-
proximation is more accurate near
the center.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.6. partitioning 223

Suppose we want to solve an optimization problem with the following piece- Example 9.5. Mixed-integer formu-
lation of the ReLU function.
wise linear constraint:
y = max(0, x )
We will also assume that we know that x lies in the interval [ x, x ]. We can
encode this constraint using a set of mixed-integer constraints as follows:

y ≤ x − x (1 − a )
y≥x
y ≤ xa
y≥0
a ∈ {0, 1}

The plots below iteratively build up the constrained region for each possible
value of a.
y ≤ x−x y≥x y≥0 y≤0
a=0

x x x x x x x x

y≤x y≥0 y≤x y≥x


a=1

x x x x x x x x

When a = 0, y must be 0 and x must be between x and 0. When a = 1, y must


be equal to x and x must be between 0 and x.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
224 chapter 9. reachability for nonlinear systems

Consider the nonlinear constraint y = x − sin( x ) over the region −2 ≤ x ≤ 2. Example 9.6. Converting a nonlin-
ear equality constraint into a set
We can convert this constraint into a set of piecewise linear constraints by of mixed-integer constraints using
first decomposing the function into its elementary functions: piecewise linear bounds. We use
the OVERT.jl package to compute
the overapproximations.
y = x−z
z = sin( x )
−2 ≤ x ≤ 2

We then derive a piecewise linear lower bound z and upper bound z for
sin( x ) and rewrite the constraints as

y = x−z
z≤z≤z
z = z( x )
z = z( x )
−2 ≤ x ≤ 2

The plots below show the overapproximations of sin( x ) using different num-
bers of linear segments.

1 z( x ) z( x ) z( x )
f (x)

−1 z( x ) z( x ) z( x )

−2 0 2 −2 0 2 −2 0 2
x x x

The final step is to convert the piecewise linear functions z( x ) and z( x ) into
their corresponding mixed-integer constraints. The overapproximations be-
come tighter as the number of segments increases, but the computational
cost and the number of mixed integer constraints required to represent the
piecewise linear bounds also increases.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.6. partitioning 225

Input Set Input Partition Output Partition Output Set Figure 9.16. Computing the one
step reachable set for the inverted
S S (1) S (2) pendulum system using partition-
ing. The input set S is partitioned
R (3) R into four regions, and the reachable
R (4)
set for each region is computed sep-
R (1) arately using a first order Taylor in-
R (2)
clusion function. The union of the
S (3) S (4) resulting output sets forms the full
reachable set.

m=1 m=4 m = 16 m = 100 Figure 9.17. The effect of the num-


ber of subsets m in the partition on
the overapproximation error in R6
Input Partition

for the inverted pendulum system.


The reachability algorithm applied
to each subset uses a first-order
Taylor inclusion function. As m in-
creases, the overapproximation er-
ror decreases.
Output Partition

set for the entire input set:


m
[ (i )
Rd = Rd (9.31)
i =1
Figure 9.16 demonstrates this process. The union of the reachable sets for each
region is often nonconvex, which results in a benefit when performing reachability
analysis for nonlinear systems with nonconvex reachable sets.
As shown in figure 9.17, partitioning the input set can significantly reduce
overapproximation error. Finer partitions tend to result in more accurate reach-
able sets. In general, the performance is highly dependent on the partitioning
strategy. Figure 9.17 uses a uniform partitioning strategy, in which the input set
is divided into m equal-sized regions. However, this strategy may be intractable
for systems with high-dimensional input spaces because the number of subsets

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
226 chapter 9. reachability for nonlinear systems

s1 s2 = φ(W1 s1 + b1 ) s3 = φ(W2 s2 + b2 ) Figure 9.18. Example of a two layer


neural network with two neurons
s11 s21 s31 in each layer.

s12 s22 s32

in a uniform partition grows exponentially with the dimension. In such cases, we


can use more sophisticated partitioning strategies, such as adaptive partitioning
based on samples, to improve the accuracy of the reachable set while keeping the
computational cost manageable.16 16
M. Everett, G. Habibi, C. Sun, and
J. P. How, “Reachability Analysis
of Neural Feedback Loops,” IEEE
9.7 Neural Networks Access, vol. 9, pp. 163 938–163 953,
2021.

We can use the techniques discussed in the previous sections to verify properties
of neural networks. Neural networks are a class of functions that are widely used
in machine learning and could be used to represent the agent, environment, or
sensor model. They are composed of a series of layers, each of which applies
an affine transformation followed by a nonlinear activation function.17 Given a 17
More details about the structure
set of inputs to a neural network, we are often interested in understanding the and training of neural networks are
found in appendix C.
possible outputs.18 For example, we may want to ensure that an aircraft collision 18
This process is sometimes re-
avoidance system will always output an alert when other aircraft are nearby. ferred to as neural network verifica-
Evaluating a neural network is similar to performing a rollout of a system. tion. A detailed overview of neural
network verfication can be found
However, instead of computing st+1 by passing st through the sensor, agent, and in C. Liu, T. Arnon, C. Lazarus,
environment models, we compute it by passing st through the tth layer of the C. Strong, C. Barrett, and M. J.
Kochenderfer, “Algorithms for Ver-
neural network. If st is the input to layer t, then the output st+1 is computed as ifying Deep Neural Networks,”
Foundations and Trends in Optimiza-
st +1 = φ (Wt st + bt ) (9.32) tion, vol. 4, no. 3–4, pp. 244–404,
2021.
where Wt is a matrix of weights, bt is a bias vector, and φ(·) is a nonlinear ac-
tivation function. Common activation functions include ReLU, sigmoid, and
hyperbolic tangent. Figure 9.18 shows an example of a two-layer neural network.
In this context, we can check properties of the neural network by computing the
reachable set of the output layer given an input set.
For piecewise linear activation functions, we can compute the exact reachable
set by partitioning the input space into different activation sets and computing

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.7. neural networks 227

f ( x ) = ReLU( x ) f ( x ) = sigmoid( x ) f ( x ) = tanh( x ) Figure 9.19. Example evaluations


of the interval counterparts for
2 1 three common neural network ac-
0.8 1 tivation functions.
1.5
0.6
f (x)

1 0
0.4
0.5
0.2 −1
0 0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

Input Set Layer 1 Output Output Set Figure 9.20. Computing the over-
approximate reachable set of a two-
s12 s22 s32 layer neural network using natural
inclusion functions. The true reach-
able set for each layer is shown in
blue, and the interval overapproxi-
s11 s21 s31 mation is shown in purple.

the reachable set for each subset separately.19 For example, we can compute exact 19
W. Xiang, H.-D. Tran, J. A. Rosen-
reachable sets for neural networks with ReLU activation functions (example 9.7). feld, and T. T. Johnson, “Reachable
Set Estimation and Safety Verifi-
However, the number of subsets grows exponentially with the number of nodes cation for Piecewise Linear Sys-
in the network. Therefore, exact reachability analysis is often intractable for large tems with Neural Network Con-
trollers,” in American Control Con-
neural networks, so it is common to instead use overapproximation techniques to ference (ACC), 2018.
bound the output set.
Similar to the nonlinear systems discussed earlier, we can use inclusion func-
tions to overapproximate the output set of neural networks. By replacing each
activation function with its interval counterpart, we obtain the natural inclusion
function for a neural network. Figure 9.19 shows an example evaluation of the
interval counterpart for the ReLU function. Evaluating the natural inclusion func-
tion for the network on a set of input intervals provides an overapproximation of
the possible network outputs (figure 9.20).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
228 chapter 9. reachability for nonlinear systems

Suppose we want to propagate the input set S1 shown below through the Example 9.7. Exact reachability for
a two-layer neural network with
first layer of the neural network in figure 9.18. We first apply the linear ReLU activation functions.
transformation to obtain the pre-activation region Z2 = W1 S1 ⊕ b1 :

Input Set Linear Transformation


s12 z12

s11 z11

Next, we need to apply the nonlinear ReLU activation function to Z2 to


compute S2 . We divide Z2 into four subsets for which the ReLU function is
linear. Each subset corresponds to a different activation pattern for the first
layer. An activation pattern describes which nodes in the layer are active for
a given input. A node is considered active if its input is greater than zero.
Each quadrant in Z2 corresponds to a different activation pattern and
maps to a different subset in the output set S2 . For example, in the first
quadrant, both nodes are active, so the ReLU has no affect on the output.
In the second quadrant, only the second node is active, so the inputs get
mapped to a line. The output set S2 is the union of the four subsets. The plots
below demonstrate this process.

Activation Patterns Output Subsets Output Set


z12 s22 s22

z11 s21 s21

To compute the final output set of the neural network in figure 9.18, we would
repeat this process for each of the subsets that comprise S2 . The final output
set will therefore be the union of 16 subsets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
9.8. summary 229

As noted in section 9.2.1, natural inclusion functions tend to be overly conserva-


tive. Some techniques apply partitioning strategies to reduce overapproximation
error.20 Note that we cannot use Taylor inclusion functions or conservative lin- 20
W. Xiang, H.-D. Tran, and T. T.
earization for ReLU networks because the ReLU function is not differentiable at Johnson, “Output Reachable Set Es-
timation and Verification for Mul-
zero. However, we can use other techniques to create inclusion functions that are tilayer Neural Networks,” IEEE
tighter than the natural inclusion function.21 Transactions on Neural Networks and
Learning Systems, vol. 29, no. 11,
We can also evaluate the support function of the output set of a neural network pp. 5777–5783, 2018.
with d layers by solving the following optimization problem: 21
H. Zhang, T.-W. Weng, P.-Y.
Chen, C.-J. Hsieh, and L. Daniel,
minimize d> sd “Efficient Neural Network Robust-
sd ness Certification with General Ac-
subject to s1 ∈ S (9.33) tivation Functions,” Advances in
Neural Information Processing Sys-
sd = f n (s1 ) tems (NeurIPS), vol. 31, 2018.

where f n (s1 ) is the neural network function and S is the input set. For ReLU
networks, it is possible to write this optimization problem as a MILP by converting
each ReLU activation function into its corresponding mixed-integer contraints
(see example 9.5). To create the mixed integer constraints, we need an upper 22
M. Akintunde, A. Lomuscio, L.
and lower bound on the input to each ReLU. We can either select a sufficiently Maganti, and E. Pirovano, “Reach-
ability Analysis for Neural Agent-
large bound for all nodes22 or compute specific bounds by evaluating the natural Environment Systems,” in Inter-
inclusion function.23 To compute an overapproximation of the output set, we national Conference on Principles of
Knowledge Representation and Rea-
evaluate the support function in multiple directions. soning, 2018.
In addition to evaluating the support function, we can use the MILP formula- 23
V. Tjeng, K. Y. Xiao, and R.
tion to check other properties of the neural network by changing the objective Tedrake, “Evaluating Robustness
of Neural Networks with Mixed
function or adding constraints.24 For example, we can check if the output set Integer Programming,” in Interna-
intersects with a given avoid set or find the maximum disturbance that causes the tional Conference on Learning Repre-
network to change its output. In general, neural network verification approaches sentations (ICLR), 2018.
24
C. A. Strong, H. Wu, A. Zeljic,
can be combined with the techniques discussed in this chapter to verify closed- K. D. Julian, G. Katz, C. Barrett,
loop properties of systems that contain neural networks.25 and M. J. Kochenderfer, “Global
Optimization of Objective Func-
tions Represented by ReLU Net-
9.8 Summary works,” Machine Learning, vol. 112,
pp. 3685–3712, 2023.
25
M. Everett, G. Habibi, C. Sun,
• Reachable sets for nonlinear systems are often nonconvex and difficult to
and J. P. How, “Reachability Anal-
compute exactly. ysis of Neural Feedback Loops,”
IEEE Access, vol. 9, pp. 163 938–
• We can apply a variety of techniques to overapproximate the reachable sets of 163 953, 2021.
nonlinear systems.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
230 chapter 9. reachability for nonlinear systems

• Interval arithmetic allows us to propagate intervals through elementary func-


tions.

• We can use interval arithmetic to create inclusion functions that provide over-
appoximate output intervals for nonlinear functions.

• Taylor inclusion functions overapproximate nonlinear functions by passing


intervals through their Taylor series approximations.

• An nth order Taylor models represent sets using a Taylor approximation of


degree n − 1 and an interval remainder term that bounds the sum of the
remaining terms in the Taylor series.

• While Taylor inclusion functions always output hyperrectangular sets, Taylor


models more expressive reachable sets and tend to produce tighter overap-
proximations.

• We can sample the support function of the reachable set for nonlinear sys-
tems by solving an overapproximate linear program or mixed-integer linear
program.

• Because nonlinear reachability methods tend to produce tighter overapprox-


imations on smaller input sets, we can reduce overapproximation error by
partitioning the input space into smaller sets and computing the reachable set
for each smaller set.

• We can extend some of the techniques outlined in this chapter to analyze the
output sets of neural networks.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10 Reachability for Discrete Systems

While the techniques in chapters 8 and 9 focus on reachability for systems with
continuous states, this chapter focuses on reachability for systems with discrete
states. We begin by representing the transitions of a discrete system as a directed
graph. This formulation allows us to use graph search algorithms to perform
reachability analysis. Next, we discuss techniques for probabilistic reachability
analysis, in which we calculate the probability of reaching a particular state or
set of states. We conclude by discussing a method to apply these techniques to
continuous systems by abstracting them into discrete systems.

10.1 Graph Formulation 0.2 0.1


0.8
Directed graphs are a natural way to represent the transitions of a discrete system.
s1 s2
A directed graph consists of a set of nodes and a set of directed edges connecting
the nodes. For discrete systems, each node represents a state of the system, and 0.9

each edge represents a transition between states (figure 10.1). We can also associate Figure 10.1. Graph representation
a probability with each edge to represent the likelihood of the transition occurring. of a discrete system with two states,
Algorithm 10.1 creates a directed graph from a discrete system. For each dis- s1 and s2 . The graph has a node
for each state and an edge originat-
crete state, it computes the set of possible next states and their corresponding ing from each state for each possi-
probabilities. It then adds an edge to the graph for each possible transition. Fig- ble transition. Each edge is labeled
with the probability of the transi-
ure 10.2 shows the graph representation of the grid world system. For systems tion. For example, when we are in
with large state spaces, it may be inefficient to store the full graph in memory. In s1 , we have a 0.8 probability of tran-
these cases, we can represent the graph implicitly using a function that takes in a sitioning from s1 to s2 .

state and returns its successors and their probabilities.


232 chapter 10. reachability for discrete systems

function to_graph(sys) Algorithm 10.1. Converting


𝒮 = states(sys.env) a discrete system to a directed
g = WeightedGraph(𝒮) weighted graph using an extension
for s in 𝒮 of the Graphs.jl package (see
𝒮′, ws = successors(sys, s) appendix D for more details). For
for (s′, w) in zip(𝒮′, ws) each state returned by the states
add_edge!(g, s, s′, w) function, the algorithm calls
end the system-specific successors
end function to determine the set of
return g possible next states and their
end corresponding probabilities. It
then adds an edge to the graph for
each possible transition.

Figure 10.2. Graph representa-


tion of the grid world system.
Each node represents a grid world
state, and each edge represents a
possible transition between states.
Darker edges have higher probabil-
ities associated with them.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.2. reachable sets 233

10.2 Reachable Sets

To compute reachable sets, we ignore the probabilities associated with the edges of
the graph and focus only on its connectivity. The reachable sets are represented as
collections of discrete states. We focus on two types of reachability analysis: forward
reachability and backward reachability. Forward reachability analysis determines the
set of states that can be reached from a given set of initial states within a specified
time horizon.1 . Backward reachability analysis determines the set of states from 1
This process is sometimes re-
which a given set of target states can be reached within a specified time horizon. ferred to as bounded model checking

Figure 10.3 demonstrates the difference between the two types of reachability
analysis, and the rest of this section presents algorithms for each type.

Forward Reachability Backward Reachability Figure 10.3. Example of forward


and backward reachability on a dis-
crete system with six states. The
Step 1 Step 2 Step 3 Step 1 Step 2 Step 3 forward reachability algorithm de-
termines starts from the initial
S1 S1 S1 S1 S1 S1
set S1 and progresses forward
through the graph to determine the
set of states that can be reached
within a specified time horizon.
The backward reachability algo-
rithm starts from the target set S T
and progresses backward through
the graph to determine the set of
states from which the target state
ST ST ST ST ST ST
can be reached within a specified
time horizon.

10.2.1 Forward Reachability


To compute the forward reachable set from a set of initial states, we perform
a breadth-first search on the graph. We start with the initial set of states S1 = R1
and iteratively add the set of states reachable from the current set of states at the
next time step. In other words, Rd is the set of states for which the graph contains
an edge originating from a state in Rd−1 . We repeat this process for a specified
time horizon or until convergence. Algorithm 10.2 implements this technique,
and figure 10.4 shows the results on the grid world problem.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
234 chapter 10. reachability for discrete systems

struct DiscreteForward <: ReachabilityAlgorithm Algorithm 10.2. Forward reachabil-


h # time horizon ity for discrete systems. The algo-
end rithm first creates the graph repre-
sentation of the system by calling
function reachable(alg::DiscreteForward, sys) algorithm 10.1. For each depth d,
g = to_graph(sys) it computes Rd by finding the set
𝒮 = 𝒮₁(sys.env) of states reachable from Rd−1 ac-
ℛ = 𝒮 cording to the edges in the graph
for d in 2:alg.h and checks for convergence. The
𝒮 = Set(reduce(vcat, [outneighbors(g, s) for s in 𝒮])) outneighbors function returns all
ℛ == (ℛ ∪ 𝒮) && break nodes connected to the current
ℛ = ℛ ∪ 𝒮 node through an outgoing edge.
end The algorithm returns the union of
return ℛ all sets, which corresponds to R1:h .
end

R1 R5 R10 Converged Figure 10.4. Forward reachable


sets for the grid world system.
Reachable states and their corre-
sponding edges are highlighted in
blue. In this example, the reach-
able set converges after 19 steps
and shows that all states are reach-
able from the initial state.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 235

The reachable set has converged once it no longer changes. If we find that
R1:d = R1:d−1 , the reachable set has converged, and R1:∞ = R1:d .2 We can also 2
This condition allows us to per-
check for invariant sets by relaxing this condition. Specifically, if R1:d ⊆ R1:d−1 , form unbounded model checking, in
which the output holds over all pos-
we can conclude that R1:d is an invariant set and that the system will remain sible trajectories.
within this set for all future time steps (R1:∞ ⊆ R1:d ). Performing this check
on discrete sets is straightforward because we can directly compare the states
contained in each set.

10.2.2 Backward Reachability


In contrast with forward reachability, which starts from a set of initial states and
progresses forward through the graph, backward reachability starts from a set
of target states S T and progresses backward through the graph. The target set is
often determined based on a specification for the system. For example, the target
set may represent a set of goal states or a set of states that should be avoided. The
backward reachable set B1:h represents the set of states from which the target set
can be reached within the time horizon h.
Algorithm 10.3 computes backwards reachable sets for discrete systems given
a reachability specification. It has a structure similar to algorithm 10.2. However,
instead of starting with the initial state set, it starts with the target set. It then
iteratively computes Bd as the set of states for which the graph contains an edge
originating from a state in Bd−1 and ending at a state in Bd . We can check for
convergence and invariance using the same conditions we use for forward reach-
ability. Figure 10.5 shows the results of applying the algorithm to the grid world
problem to compute the backward reachable sets from the goal and obstacle
states.

10.3 Satisfiability

We can use the forward and backward reachable sets of discrete systems to
determine whether they satisfy a reachability specification (figure 10.6). For
forward reachability, we check whether the target set intersects with the forward
reachable set. For backward reachability, we check whether the initial set intersects
with the backward reachable set. In both cases, these checks require us to compute
the full forward or backward reachable set. This process can be computationally
expensive, especially for systems with large state spaces.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
236 chapter 10. reachability for discrete systems

struct DiscreteBackward <: ReachabilityAlgorithm Algorithm 10.3. Backward reacha-


h # time horizon bility for discrete systems. The al-
end gorithm first creates the graph rep-
resentation of the system by calling
function backward_reachable(alg::DiscreteBackward, sys, ψ) algorithm 10.1. For each depth d,
g = to_graph(sys) it computes Bd by finding the set
𝒮 = ψ.set of states from which Bd−1 can be
ℬ = 𝒮 reached according to the edges in
for d in 2:alg.h the graph and checks for conver-
𝒮 = Set(reduce(vcat, [inneighbors(g, s) for s in 𝒮])) gence. The inneighbors function
ℬ == (ℬ ∪ 𝒮) && break returns all nodes connected to the
ℬ = ℬ ∪ 𝒮 current node through an incom-
end ing edge. The algorithm returns
return ℬ the union of all sets, which corre-
end sponds to B1:h .

B1 B2 B5 Converged Figure 10.5. Backward reachable


sets for the grid world system.
The top row shows the backward
reachable sets from the goal state
(green), and the bottom row shows
the backward reachable sets from
the obstacle state (red). The reach-
able sets from the goal state con-
verge after 14 steps, while the
reachable sets from the obstacle
state converge after 11 steps. The
results show that the goal can be
reached from any state outside the
obstacle and that the obstacle can
be reached from any state outside
the goal.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 237

Forward Reachability Backward Reachability Figure 10.6. Checking whether a


discrete system satisfies a reacha-
bility specification using forward
Safe 3 Unsafe 7 Safe 3 Unsafe 7 (blue) and backward (red) reach-
able sets. If the forward reach-
S1 S1 S1 S1
able set overlaps (purple) with
the avoid set S T , the system is
unsafe. Furthermore, if the back-
ward reachable set overlaps (pur-
ple) with the initial set S1 , the sys-
tem is unsafe.

ST ST ST ST

10.3.1 Counterexample Search


If our only goal is to check whether a system satisfies a reachability specification,
we can use more efficient techniques that do not require us to compute the full
reachable set. For example, we could perform the same breadth-first search we
perform in algorithms 10.2 and 10.3 while only storing the states in the current and
previous reachable sets, Rd and Rd−1 , and performing a check for overlap with
the target set at each iteration. This approach tends to be more memory-efficient
than storing the full reachable set.
When the target set is an avoid set, we call this analysis counterexample search
because reaching the target set represents a counterexample that proves that the
system does not satisfy the specification.3 If the analysis converges without reach- 3
The term counterexample is an-
ing any states in the avoid set, we can conclude that the avoid set is not reachable other word for failure that is com-
monly used in formal verification.
and the system satisfies the specification. Conversely, if we reach a state in the 4
We can also perform depth first
avoid set, we can terminate the search early and return the counterexample. search to a fixed depth and increase
the depth if no counterexamples
If a counterexample exists, we can save computation by finding it early and are found. This process is known
terminating the search. In these cases, we may want to use a different graph as iterative deepening.

traversal algorithm such as depth-first search. Depth-first search explores the 5


The Graphs.jl package in
graph by following a single path to its maximum depth before backtracking.4 It Julia implements a variety of
graph search algorithms such
therefore allows us to more quickly begin searching over full trajectories, which as Dijkstra’s algorithm, the
could be more efficient for finding counterexamples. Floyd-Warshall algorithm, and
heuristic search. More details on
The use of more sophisticated graph search algorithms may further increase graph search are provided in S.
efficiency.5 Heuristic search algorithms such as the one introduced in section 5.3 Russell and P. Norvig, Artificial
Intelligence: A Modern Approach,
4th ed. Pearson, 2021.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
238 chapter 10. reachability for discrete systems

Breadth First Depth First Heuristic Figure 10.7. Comparison of


breadth-first, depth-first, and
heuristic search for finding
counterexamples in the grid world
system. We use A∗ search as the
heuristic search algorithm, which
significantly improves efficiency
over breadth-first and depth-first
search.

further increase efficiency by using heuristics to prioritize paths that are more
likely to lead to a counterexample. In cases where the system satisfies the specifi-
cation and no counterexample exists, these algorithms have the same computa-
tional complexity as breadth-first search. Figure 10.7 compares the performance
of breadth-first search, depth-first search, and heuristic search for finding coun-
terexamples in the grid world problem.

10.3.2 Boolean Satisfiability


Graph search algorithms may be inefficient for systems with large state spaces,
especially when each state has many neighboring states (example 10.1). In these
cases, it may be more efficient to formulate the reachability problem as a Boolean
satisfiability (SAT) problem. Solving a SAT problem involves searching for a
satisfying assignments of the Boolean variables, or propositions, in a propositional
logic formula (see section 3.4.1).6 6
Boolean satisfiability is also some-
times refered to as propositional
In the context of reachability analysis, the Boolean variables in the SAT problem
satisfiability.
represent the discrete states of the system at each time step. The propositional
logic formula encodes the possible initial states, transitions between states, and 7
More information on SAT solvers
the failure condition. We can then pass the formula to a variety of different SAT is provided in A. Biere, Handbook
of Satisfiability. IOS Press, 2009,
solvers, which use heuristics to efficiently search for a satisfying assignment.7 vol. 185. The Satisfiability.jl
If the SAT solver finds a satisfying assignment, the assignment corresponds to package provide a Julia interface
for many common SAT solvers.E.
a counterexample, and the system does not satisfy the specification. If the SAT Soroka, M. J. Kochenderfer, and S.
solver does not find a satisfying assignment, we can conclude that system satisfies Lall, “Satisfiability.jl: Satisfiability
the specification. Modulo Theories in Julia,” Jour-
nal of Open Source Software, vol. 9,
no. 100, p. 6757, 2024.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 239

The wildfire problem is an example of a problem in which graph search is Example 10.1. Demonstration of
difficulties that arise when apply-
intractable. Consider a wildfire scenario modeled as an n × n grid where ing graph search algorithms to the
each cell is either burning or not burning. At each time step, a burning cell wildfire problem.
has a nonzero probability of spreading the fire to each of its neighboring
cells. A burning cell will also remain burning at the next time step with
2
some probability. This problem has 2n possible states, and a state with b
burning cells has as many as 25b possible successors. For a 5 × 5 grid, the state
space has 225 = 3.4 × 107 states. For a 10 × 10 grid, that number increases to
2100 = 1.27 × 1030 possible states. The example below shows the successors
for a state where only the cell in the center is burning.

···

Even though only one cell is burning, there are still 32 successor states.
This number only increases as we increase the number of burning cells. A
state with 10 burning cells has as many as 250 = 1.13 × 1015 successors.
For most grid sizes, even partially computing and storing the graph for the
wildfire problem is intractable. For this reason, we cannot use graph search
algorithms for this problem and must turn to other methods such as Boolean
satisfiability.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
240 chapter 10. reachability for discrete systems

For a system with states represented as Boolean vectors8 of length m and a


given horizon h, we define s1:h as a set of Boolean variables each of length m 8
SAT solvers require Boolean vari-
representing the states at each time step. The initial state set is encoded as a ables. Satisfiability modulo theories
(SMT) solvers extend SAT solvers
propositional logic formula I (s1 ), which returns true if s1 is in the initial state set to continuous variables. More in-
and false otherwise. Example 10.2 demonstrates how to encode the initial state formation about SMT is provided
in A. Biere, Handbook of Satisfiability.
of the wildfire problem. Next, we define a propositional logic formula for each IOS Press, 2009, vol. 185.
state transition T (st , st+1 ) that returns true if st+1 is a successor of st and false
otherwise (see example 10.3 for the wildfire problem). The failure condition is
the negation of the specification ψ.

Consider a wildfire problem with a 10 × 10 grid and a time horizon of h = 20. Example 10.2. Encoding the initial
state of the wildfire problem as a
The state at a particular time step is represented as a set of Boolean variables propositional logic formula using
that represent whether each grid cell is burning. The SAT problem will the Satisfiability.jl package.
therefore have 100 × 20 = 2000 Boolean variables representing the states
at each time step. We can represent the initial state as a propositional logic
formula that evaluates to true when the bottom left cell is burning and all
other cells are not burning. The following code implements this formula:
n = 10 # grid is n x n
h = 20 # time horizon
@satvariable(burning[1:n, 1:n, 1:h], Bool)
init = burning[1, 1, 1] # bottom left cell is burning
for i in 1:n, j in 1:n
if i ≠ 1 || j ≠ 1 # all other cells are not burning
init = init ∧ ¬burning[i, j, 1]
end
end

Combining the initial state and transition formulas with the failure condition,
we can create a single propositional logic formula that represents the reachability
problem:

I (s1 ) ∧ ( T (s1 , s2 ) ∧ T (s2 , s3 ) . . . ∧ T (sh−1 , sh )) ∧ ¬ψ(s1:h ) (10.1)

A SAT solver will search the space of possible values for the Boolean variables s1:h
to find an assignment that satisfies the formula. A satisfying assignment corre-
sponds to a feasible trajectory that satisfies the failure condition. Therefore, if the
SAT solver determines that there are no satisfying assignments, we can conclude
that the system satisfies the specification. Example 10.4 demonstrates how to use
Boolean satisfiability to check reachability specifications for the wildfire problem.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.3. satisfiability 241

The following code implements the propositional logic formula for the tran- Example 10.3. Encoding the tran-
sitions of the wildfire problem as a
sitions of the wildfire problem: propositional logic formula using
transition = true the Satisfiability.jl package.
for i in 1:n, j in 1:n, t in 1:h-1
transition = transition ∧ (
burning[i, j, t+1] ⟹
(burning[i, j, t] ∨
burning[max(1, i-1), j, t] ∨
burning[min(n, i+1), j, t] ∨
burning[i, max(1, j-1), t] ∨
burning[i, min(n, j+1), t])
)
end

If a particular cell is burning at time t + 1, it must be the case that either


it was burning at time t or one of its neighbors was burning at time t. The
examples below show two evaluations of the transition proposition.

T ( ,
)= true

T ( ,
)= false

In the first case, both cells burning at time t + 1 were either burning at time t
or had a neighbor that was burning at time t. In the second case, the cell at
(3, 4) was not burning at time t, and none of its neighbors were burning at
time t.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
242 chapter 10. reachability for discrete systems

Suppose there is a densely populated area in the top right cell of the wildfire Example 10.4. Checking reachabil-
ity specifications for the wildfire
grid, and we want to determine whether it might burn. We can encode the problem using Boolean satisfiabil-
failure condition as a propositional logic formula that evaluates to true when ity.
the top right cell is burning. We can then combine this formula with the
initial state and transition formulas from equation (10.1) and pass it to a SAT
solver to determine whether the top right cell is reachable. The following
code demonstrates this process:
ψ = ¬burning[n, n, t]
reachable = sat!(init ∧ transition ∧ ¬ψ)

For a 10 × 10 grid with a horizon of 20, the burning variable has 10 ×


10 × 20 = 2000 Boolean variables and therefore 22000 possible assignments.
However, the SAT solver can efficiently search this space to find a satisfying
assignment in a few seconds. If we decrease the time horizon to 18, the SAT
solver is able to determine in a similar amount of time that none of the 21800
possible assignments satisfy the formula. This result indicates that the top
9
right cell is not reachable within 18 time steps. When we consider the transi-
tion probabilities, the system rep-
resents a discrete-time Markov
chain. More information on the
analysis of Markov chains is pro-
10.4 Probabilistic Reachability vided in J. R. Norris, Markov Chains.
Cambridge University Press, 1998.
Probabilistic reachability analysis computes the probability of reaching a target Open source software packages
such as PRISM (M. Kwiatkowska,
set by taking into account the probability of each transition between states.9 In
G. Norman, and D. Parker, “PRISM
some cases, the results allow us to build more confidence in a system than the 4.0: Verification of Probabilistic
reachable sets alone. For example, if the avoid set overlaps with the reachable Real-Time Systems,” in Interna-
tional Conference on Computer Aided
set, our reachability analysis will conclude that the system is unsafe even if the Verification, 2011.) and STORM
probability of reaching the avoid set is very low. Example 10.5 demonstrates this C. Hensel, S. Junges, J.-P. Ka-
property on the grid world problem. Probabilistic reachability analysis allows toen, T. Quatmann, and M. Volk,
“The Probabilistic Model Checker
us to uncover these scenarios and provide a more useful safety assessment that Storm,” International Journal on Soft-
focuses on actual risk. ware Tools for Technology Transfer,
pp. 1–22, 2022. implement these
analysis techniques.
10.4.1 Probability of Occupancy
Determining the probability of occupancy involves computing a distribution over
reachable states at each time step. We denote this distribution as Pt , where Pt (s)
is the probability of occupying state s at time step t. The algorithm begins with an

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.4. probabilistic reachability 243

Consider the grid world problem with a slip probability of 0.3. Running Example 10.5. Comparison of
reachable set analysis and proba-
algorithm 10.2 with a time horizon h = 9 leads to the conclusion that the bilistic forward reachability analy-
system is unsafe because the obstacle is included in the forward reachable sis on the grid world problem.
set. However, the probability of reaching the obstacle after 9 steps when
following the optimal policy is only 0.0004, and the system is more likely to
be in a state near its nominal path to the goal. In this scenario, the probabilistic
reachability provides a more useful assessment of the actual safety of the
system. The plots below show the reachable set (left) and the results of a
probabilistic reachability analysis (right).

Reachable Set Probabilistic Reachability


0.13

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
244 chapter 10. reachability for discrete systems

initial state distribution P1 . It then computes the distribution at each subsequent


time step using the distribution from the previous time step and the transition
probabilities between states as follows

Pt+1 (s) = ∑
0
T (s0 , s) Pt (s0 ) (10.2)
s ∈S

where T (s0 , s) is the probability of transitioning from state s0 to state s.


Algorithm 10.4 implements probabilistic forward reachability using the graph
representation of the system. The weights in the graph correspond to T (s0 , s) in
equation (10.2). The algorithm also uses the fact that the only nonzero terms in the
sum in equation (10.2) are the terms corresponding to the incoming neighbors of s
in the graph. The algorithm terminates after a desired time horizon h. Example 10.5
demonstrates this technique on the grid world problem.

struct ProbabilisticOccupancy <: ReachabilityAlgorithm Algorithm 10.4. Determining the


h # time horizon probability of occupancy for dis-
end crete systems. The algorithm first
creates the graph representation of
function reachable(alg::ProbabilisticOccupancy, sys) the system using algorithm 10.1. It
𝒮, g, dist = states(sys.env), to_graph(sys), Ps(sys.env) then begins with the initial state
P = Dict(s => pdf(dist, s) for s in 𝒮) distribution and iteratively com-
for t in 2:alg.h putes the distribution at each time
P = Dict(s => sum(get_weight(g, s′, s) * P[s′] step using equation (10.2). The
for s′ in inneighbors(g, s)) for s in 𝒮) inneighbors function returns all
end nodes in the graph that are con-
return SetCategorical(P) nected to the current node through
end an incoming edge. The algorithm
terminates when it has reached the
time horizon.

10.4.2 Finite-Horizon Reachability


In addition to the distribution over states at a particular time step, we may also
be interested in the probability that target state or set of states is reached within a
given time horizon. We denote this probability as Rt , where Rt (s) is the probability
of reaching the target set S T when starting from state s within t time steps. Unlike
the output of probabilistic occupancy analysis, Bt is not a distribution over states,
and the probability values for all states will not sum to 1.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.4. probabilistic reachability 245

The plots below show the results from probabilistic occupancy analysis on the Example 10.6. Determining occu-
pancy probabilities for the grid
grid world problem with a slip probability of 0.3. They show the distribution world problem.
over reachable states at different time steps with reachable states appearing
larger and darker states indicating a higher probability of reaching them.
The nominal path is highlighted in gray.

P5 P10 P15 P50


1

While the obstacle state is reachable in three of the plots, the probability of
occupying the obstacle state is low and the probability is much higher for
states near the nominal path. After 50 time steps, most of the probability
mass is in the goal state with a small portion in the obstacle state and the
other grid cells. At this point, the probability of being in the goal state is
0.981 and the probability of being in the obstacle state is 0.018. We can use
these numbers to draw conclusions about the overall safety of the system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
246 chapter 10. reachability for discrete systems

Similar to Pt , we can derive a recursive relationship to compute Rt such that



1 if s ∈ S T
R t +1 ( s ) = (10.3)
0 0
∑ 0 T (s, s ) Rt (s ) otherwise
s ∈S

In other words, for states in the target set, the probability of reaching the target
set is 1. For all other states, the probability of reaching the target set within t + 1
time steps is sum of the probability of transitioning to each of its successors times
the probability that they reach the target set within t time steps. We initialize R1
to be 1 for states in the target set and 0 otherwise.
We can use the results of this analysis to identify dangerous states for the
system. Furthermore, if we know the initial state distribution P1 for the system,
we can determine the probability of reaching the target set within a given time
horizon by summing the probability of reaching the target set from each state
weighted by the probability of occupying that state at time t = 1:

Preach = ∑ Rh (s) P1 (s) (10.4)


s∈S

Algorithm 10.5 implements finite-horizon probabilistic reachability using the


graph representation of the system given a reachability specification, and fig-
ure 10.8 demonstrates this technique on the grid world problem.

struct ProbabilisticFiniteHorizon <: ReachabilityAlgorithm Algorithm 10.5. Finite-horizon


h # time horizon probabilistic reachability for dis-
end crete systems. The algorithm first
creates the graph representation of
function reachable(alg::ProbabilisticFiniteHorizon, sys, ψ) the system using algorithm 10.1. It
𝒮, g, dist = states(sys.env), to_graph(sys), Ps(sys.env) then initializes the probability of
𝒮T = ψ.set reaching the target set from each
R = Dict(s => s ∈ 𝒮T ? 1.0 : 0.0 for s in 𝒮) state and iteratively computes the
for d in 2:alg.h probability at each time step using
R = Dict(s => s ∈ 𝒮T ? 1.0 : sum(get_weight(g, s, s′) * R[s′] equation (10.3). The algorithm ter-
for s′ in outneighbors(g, s)) for s in 𝒮) minates when it has reached the
end time horizon. It returns the proba-
return sum(R[s] * pdf(dist, s) for s in 𝒮) bility of reaching the target set in
end h time steps given the initial state
distribution, which is computed ac-
cording to equation (10.4).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.4. probabilistic reachability 247

Figure 10.8. Finite-horizon prob-


abilistic reachability for the grid
R5 R10 R15 R200 world problem with a slip prob-
1 ability of 0.6. The top row shows
the result when we set the target
set to the goal state, and the bot-
tom row shows the result when
we set the target set to the obsta-
cle state. States with nonzero prob-
ability are colored according to the
probability of reaching the target
0
set. Given an initial state distribu-
1 tion that places all probability on
the state in the bottom left corner,
the probability of reaching the goal
state within 200 time steps is 0.777,
and the probability of reaching the
obstacle state is 0.222.

10.4.3 Infinite-Horizon Reachability


1
If we run finite-horizon reachability analysis over a large horizon, the probability
0.8
of reaching the target set will begin to converge (see figure 10.9). Therefore, in goal
0.6

Preach
many scenarios, running the analysis for a sufficiently long time horizon is enough
0.4
to draw conclusions about the overall safety of the system. However, it is also obstacle
0.2
possible to compute the probability of reaching the target set in the limit as the
0
time horizon approaches infinity. This probability is known as the infinite-horizon 50 100 150 200
h
reachability probability, and we denote it as R∞ (s).
To compute this probability, we rewrite the recursive relationship in equa- Figure 10.9. Probability of reach-
tion (10.3) as ing the goal state and the obsta-
cle state in the grid world problem
Rt+1 (s) = R1 (s) + ∑ TR (s, s0 ) Rt (s0 ) (10.5) with a slip probability of 0.6 as a
s0 ∈S function of the time horizon. We
where  assume the system is initialized in
the bottom left corner. As the hori-
0 if s ∈ S T
TR (s, s0 ) = (10.6) zon increases, the probabilities be-
 T (s, s0 ) otherwise gin to converge.

While this formulation is equivalent to equation (10.3), it allows us to compute the


infinite-horizon reachability probability by solving a system of linear equations.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
248 chapter 10. reachability for discrete systems

We can write this system in matrix form as

Rt+1 = R1 + TR Rt (10.7)

where Rt is a vector of length |S| such that the ith entry corresponds to Rt (si ),
and TR is a matrix of size |S| × |S| such that entry in the ith row and jth column
corresponds to TR (si , s j ).10 10
This formulation is equivalent to
For an infinite horizon, we have that a Markov reward process with an
immediate reward of 1 for all states
in the target set and 0 otherwise.
R∞ = R1 + TR R∞ (10.8) The states in the target set are ter-
minal states.
We can solve for R∞ by rearranging the terms in equation (10.8) to get

R∞ − TR R∞ = R1 (10.9)
(I − TR )R∞ = R1 (10.10)
−1
R∞ = (I − T R ) R1 (10.11)

Algorithm 10.6 implements infinite-horizon probabilistic reachability by convert-


ing the graph representation of the system to matrix form and solving the system
of linear equations in equation (10.11). Example 10.7 shows the results on the
grid world problem for different slip probabilities.

struct ProbabilisticInfiniteHorizon <: ReachabilityAlgorithm end Algorithm 10.6. Infinite-horizon


probabilistic reachability for dis-
function reachable(alg::ProbabilisticInfiniteHorizon, sys, ψ) crete systems. The algorithm cre-
𝒮, g, dist = states(sys.env), to_graph(sys), Ps(sys.env) ates R1 and T R from the graph
𝒮Ti = [index(g, s) for s in ψ.set] representation of the system. The
R₁ = [i ∈ 𝒮Ti ? 1.0 : 0.0 for i in eachindex(𝒮)] to_matrix function converts the
TR = to_matrix(g) graph to a transition matrix rep-
TR[𝒮Ti, :] .= 0 resentation. The transition matrix
R∞ = (I - TR) \ R₁ can be represented as a sparse ma-
return sum(R∞[i] * pdf(dist, state(g, i)) for i in eachindex(𝒮)) trix if memory is constrained. The
end algorithm uses equation (10.11) to
compute the infinite-horizon reach-
ability probability from each state
and uses the initial state distribu-
tion to compute the overall proba-
10.5 Discrete State Abstractions bility of reaching the target set.

The methods discussed in this chapter apply only to discrete systems. However,
we can use them to produce overapproximate reachability results for continuous
systems by creating a discrete state abstraction (DSA). To create a discrete state

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.5. discrete state abstractions 249

Suppose we want to understand the probability of reaching the obstacle state Example 10.7. Infinite-horizon
probability of reaching the obsta-
for grid world problems with different slip probabilities. The plots below cle for different slip probabilities in
show the results of infinite-horizon reachability analysis with the obstacle as the grid world problem.
the target set for slip probabilities of 0.3, 0.5, and 0.7. For each slip probability,
we compute Pfail assuming we start in the bottom left corner of the grid.

Pfail = 0.018 Pfail = 0.102 Pfail = 0.49


1

As the probability of slipping increases, the probability of reaching the ob-


stacle state also increases, especially for states near the obstacle.

abstraction, we partition the continuous state space into a finite number of smaller
regions. We then create a graph where the nodes correspond to the regions, and
the edges correspond to transitions between regions. Figure 10.10 shows the
process of creating a DSA for the inverted pendulum problem.

10.5.1 Reachable Sets


To obtain overapproximate reachable sets of continuous systems using a DSA, it
is important to ensure that we overapproximate the transitions between regions.
In other words, if there exists a state in region S (i) that can transition to a state in
region S ( j) in one step, we add an edge between the nodes corresponding to S (i)
and S ( j) in the graph. This rule creates an overapproximation since there may be
some states in S (i) that cannot reach S ( j) in one step.
For continuous systems with bounded disturbances, we can calculate the reach-
able set using the algorithms in chapters 8 and 9 to determine the connectivity
of the graph. For each region in the partition S (i) ∈ S , we use a forward reach-
ability algorithm to compute the exact or overapproximate one-step reachable
set R(i) . For any region S ( j) ∈ S that intersects with R(i) , we add an edge from

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
250 chapter 10. reachability for discrete systems

Continuous State Space Partition DSA Figure 10.10. Process of creating


a discrete state abstraction for the
inverted pendulum problem. In
1
this particular example, we parti-
tion the continous state space uni-
formly into 36 regions. We then cre-
ω (rad/s)

ate a graph where the nodes corre-


0
spond to the regions and the edges
correspond to the possible transi-
tions between regions.
−1
−1 0 1 −1 0 1 −1 0 1
θ (rad) θ (rad) θ (rad)

the node corresponding to S (i) to the node corresponding to S ( j) . Example 10.8


implements this process to create a DSA for the inverted pendulum problem
using algorithm 10.1.
Once we have the graph representation of the DSA, we can apply algorithms 10.2
and 10.3 to determine its forward and backward reachable sets. We can then use
these results to determine overapproximate reachable sets for the continuous
system. Specifically, the overapproximate reachable set for the continuous sys-
tem is the union of all regions that correspond to a reachable node in the DSA.
Figure 10.11 shows this process for the inverted pendulum system.
The choice of partition in the DSA affects the amount of overapproximation
error in the reachable sets. In general, a finer partition will result in less overap-
proximation error at the cost of increased computational complexity (figure 10.12).
The examples in this chapter use a uniform partitioning strategy, which may be
computationally prohibitive for high-dimensional systems. Adaptive partitioning
strategies reduce the number of regions while maintaining a desired level of
accuracy.11 11
S. M. Katz, K. D. Julian, C. A.
Strong, and M. J. Kochenderfer,
“Generating Probabilistic Safety
10.5.2 Probabilistic Reachability Guarantees for Neural Network
Controllers,” Machine Learning,
For probabilistic reachability, the edges in the graph representation of the DSA vol. 112, pp. 2903–2931, 2023.
correspond to overapproximate transition probabilities. Specifically, the weight
on the edge from region S (i) to S ( j) must be greater than equal to the proba-
bility that any state in S (i) transitions to a state in S ( j) . The calculation of these

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.6. summary 251

R1 R2 R3 Converged Figure 10.11. Forward reachability


for the inverted pendulum system
using a discrete state abstraction.
The top row shows the reachable
sets (blue) of the DSA, and the bot-
tom row shows the corresponding
reachable sets (blue) in the contin-
uous system. The x-axis represents
the angle of the pendulum, and the
y-axis represents the angular veloc-
ity.

overapproximated probabilities is system specific. Example 10.9 demonstrates


this process for a continuum world problem with Gaussian disturbances on its
transitions. Given these transition probabilities, we can apply algorithm 10.4 or
algorithm 10.5 to determine the overapproximate probabilities of occupying or
reaching a set of target states.12 12
Since the transition probabilities
are overapproximations, we may
calculate intermediate overapprox-
10.6 Summary imate probabilities that are greater
than 1. In these cases, these prob-
abilities should be clamped to a
• We can represent discrete systems as directed graphs where the nodes represent value of 1.
states and the edges represent transitions between states.

• Forward reachable sets for discrete systems can be computed by applying


breadth-first search from a set of initial states.

• Backwards reachability algorithms begin with a set of target states and calculate
the set of states that can reach the target set in a given time horizon.

• If our only goal is check whether a system satisfies a reachability specification,


we may be able to use more efficient algorithms that do not directly compute
reachable sets such as heuristic search or Boolean satisfiability.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
252 chapter 10. reachability for discrete systems

We can create a DSA for the inverted pendulum system using algorithm 10.1 Example 10.8. Creating a DSA for
the inverted pendulum system us-
by defining the states function to partition the state space into a grid of ing algorithm 10.1. The plots show
regions and the successors function to determine the connectivity of the the process of determining the con-
graph using a nonlinear forward reachability technique such as conservative nectivity of the graph for a single
linearization. Example implementations are as follows: region S (i) . The plot below shows
the graph for the final DSA with a
function states(env::InvertedPendulum; nθ=8, nω=8)
uniform partition of the state space
θs, ωs = range(-1.2, 1.2, length=nθ+1), range(-1.2, 1.2, length=nω+1)
into 64 regions.
𝒮 = [Hyperrectangle(low=[θlo, ωlo], high=[θhi, ωhi])
for (θlo, θhi) in zip(θs[1:end-1], θs[2:end])
for (ωlo, ωhi) in zip(ωs[1:end-1], ωs[2:end])] 1
return 𝒮

ω (rad/s)
end
function successors(sys, 𝒮⁽ⁱ⁾) 0
_, 𝒳 = sets(sys, 2)
ℛ⁽ⁱ⁾ = conservative_linearization(sys, 𝒮⁽ⁱ⁾ × 𝒳)
ℛ⁽ⁱ⁾ = VPolytope([clamp.(v, -1.2, 1.2) for v in vertices_list(ℛ⁽ⁱ⁾)])
𝒮⁽ʲ⁾s = filter(𝒮⁽ʲ⁾->!isempty(ℛ⁽ⁱ⁾ ∩ 𝒮⁽ʲ⁾), states(sys.env)) −1
return 𝒮⁽ʲ⁾s, ones(length(𝒮⁽ʲ⁾s)) −1 0 1
end
θ (rad)
The plots below demonstrate the successors function on an example state
S (i) . The function first computes R(i) using conservative linearization (left).
It then determines the regions S ( j) that intersect with R(i) (middle). Finally,
the function returns these regions so that they can be connected in the graph
(right). The edge weights can be ignored when computing reachable sets.

R (i )
ω (rad/s)

0
S (i )

−1
−1 0 1 −1 0 1 −1 0 1
θ (rad) θ (rad) θ (rad)

Algorithm 10.1 calls the successors function for each region in the partition
to determine the connectivity of the graph. The result is shown in the caption.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
10.6. summary 253

1 Figure 10.12. Converged overap-


ω (rad/s) proximate forward reachable sets
for the inverted pendulum using
0 different resolutions of the DSA. As
the resolution increases, the over-
approximation error decreases.
−1
−1 0 1 −1 0 1 −1 0 1
θ (rad) θ (rad) θ (rad)

• Probabilistic reachability analysis allows us to compute the probability of


reaching a set of target states in a finite or infinite time horizon.

• We can convert continuous systems into discrete systems by producing a


discrete state abstraction of the continuous system.

• We can apply the reachability algorithms for discrete systems to a DSA to


determine overapproximate reachable sets for its corresponding continuous
system.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
254 chapter 10. reachability for discrete systems

Suppose we have a continuum world problem with Gaussian disturbances Example 10.9. Overapproximation
the transition probabilities for the
on its transitions. For example, if the agent takes the up action, its next DSA of the continuum world sys-
position is sampled from a Gaussian distribution with a mean 1 unit above tem.
its current state and a standard deviation of 1 in each direction. In other words,
T (s, s0 ) = N (s0 | s + d, I) where d is the direction vector corresponding to
the action taken in the state s. Our goal is to determine the overapproximated
transition probabilities T (S , S 0 ) for a DSA of the continuous system.
To obtain the probability of transitioning from a specific state s to a
region in the partition S 0 , we integrate the transition function such that
T (s, S 0 ) = S 0 T (s, s0 ) ds0 . To obtain an overapproximation of the transition
R

probabilities, we select the transition from the current region S 0 that re-
sults in the highest probability of reaching the target region S 0 such that
T (S , S 0 ) = maxs∈S T (s, S 0 ). The plots below show the transition probabili-
ties for a single state s to the regions in the DSA. The plots below demonstrate
this process.

T (s, s0 ) T (s, S 0 ) T (S , S 0 )

s s S

The maximization in the formula for T (S , S 0 ) finds the state in S that puts
the highest amount of probability mass in S 0 . The plots below demonstrate
this maximization for three different next regions S 0 . This process produces
an overapproximation of the transition probabilities since we assume all
states in S transition to the worst-case next state.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11 Explainability

This chapter focuses on understanding system behavior through explanations. An


explanation is a description of a system’s behavior that helps a human understand
why it behaves in a particular manner. In this chapter, we discuss several types of
explanations. We begin by discussing policy visualization techniques that allow
us to interpret an agent’s policy. We then discuss feature importance techniques
to help us understand the input features that are most important to the behav-
ior of a system. For systems with complex policies, we discuss ways to create
interpretable surrogate models that approximate the system’s behavior. We also
discuss techniques for generating counterfactual explanations that change the
outcome of a particular scenario by making small changes to important features.
We conclude by introducing methods to categorize the failure modes of a system.

11.1 Explanations

We often desire explanations of system behavior when metrics such as failure


probabilities or reachable sets are insufficient for an adequate understanding of
the system. For example, it may be impossible to capture all possible edge-case
behaviors when creating a model of a complex system, which may cause us to miss
potential failure modes. We also may not be able to fully specify our objectives
for a system using metrics or logical specifications. This incompleteness may lead
to an alignment problem (section 1.1) in which the metrics and specifications
used to evaluate a system do not perfectly capture our true objectives. In this case,
explanations of system behavior may provide a better understanding of whether
the behavior is aligned with our objectives. The results can be used to debug the
system by informing changes to the policy, model, or specifications.
256 chapter 11. explainability

Collision Avoidance Inverted Pendulum Figure 11.1. Visualization of the


policies for the collision avoidance
400
1 and inverted pendulum systems by
plotting trajectory rollouts. We plot
200 time on the horizontal axis and one
of the state variables on the vertical

θ (rad)
h (m)

0 0 axis.

−200
−1
−400
0 10 20 30 40 0 0.2 0.4 0.6 0.8 1
Time (s) Time (s)

Explanations are also important to the stakeholders of a system. They can be


used to calibrate trust of end users by providing insight into a system’s decision-
making process. This insight helps users understand the strengths of a system as
well as its potential weaknesses. Explanations can also help stakeholders check
the fairness of a system by identifying the factors that influence its decisions.
Moreover, the end users of high-stakes decision-making systems such as loan
approval systems often have a right to an explanation. The General Data Protection
Regulation (GDPR) in the European Union requires that users be provided with
an explanation of automated decisions that significantly affect them.1 1
B. Goodman and S. Flaxman, “Eu-
The algorithms presented in this chapter provide descriptions of system be- ropean Union Regulations on Al-
gorithmic Decision-Making and a
havior to human operators or stakeholders. A good description should be inter- ‘Right to Explanation’,” AI Maga-
pretable to humans in a way that allows them to explain and predict the system’s zine, vol. 38, no. 3, pp. 50–57, 2017.

behavior.2 The interpretability of a description is the degree to which it can be 2


J. Colin, T. Fel, R. Cadène, and
readily parsed by humans. For example, a small decision tree with only a few T. Serre, “What I Cannot Predict,
I Do Not Understand: A Human-
nodes tends to be more interpretable than a large decision tree with hundreds of Centered Evaluation Framework
nodes. The explainability of a description is the degree to which it helps humans for Explainability Methods,” Ad-
vances in Neural Information Process-
understand why a system behaves in a particular way.3 ing Systems (NeurIPS), pp. 2832–
2845, 2022.
3
These definitions are often used
11.2 Policy Visualization interchangeably depending on the
context.
One way to understand the behavior of an agent is to visualize its policy in a
way that is readily interpretable by humans. For example, we might visualize
how the policy affects the state of the system over time. We can generate these

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 257

plots by performing rollouts of the policy and plotting the state of the system at
each time step. This visualization can help us understand how the policy behaves 1 20
in different scenarios and identify potential failure modes. Figure 11.1 shows an
example of this visualization technique for the aircraft collision avoidance and 0 0

ω
inverted pendulum policies.
If the policy is Markov and therefore depends only on the current state, we can −1 −20

also visualize it directly by plotting the action taken by the agent in each state. If −1 0 1
the state space is two-dimensional as in the inverted pendulum example, we can θ

plot the action taken by the agent as a two-dimensional heatmap (figure 11.2). For Figure 11.2. Visualization of the
higher-dimensional state spaces, we often need to apply dimensionality reduction actions taken by the inverted pen-
dulum policy. The colors represent
techniques to visualize the policy. One common technique is to fix all but two of the torque applied in each state.
the state variables, which become associated with the vertical and horizontal axes.
We can indicate the action for every state with a color. Example 11.1 demonstrates
this technique for the collision avoidance policy.
Instead of fixing the hidden state variables, we could also use various tech-
niques to aggregate over them (figure 11.3). One method involves partitioning
the state space into a set of regions and keeping track of the actions taken in
each region over a series of rollouts. We can then aggregate over these actions by
plotting the mean or mode of the actions taken in each region. One benefit of this
technique is that it relies only on rollouts of the policy and therefore extends to
non-Markovian policies. Because all states may not be reachable in practice, some
areas of the policy plot may have no data associated with it.

11.3 Feature Importance

Feature importance algorithms allow us to understand the contribution of various


input features to the overall behavior of a system. We can use this analysis, for
example, to identify the features of an observation that are most important to
the agent’s decision or to identify the disturbances in a trajectory that have the
greatest effect on its outcome. In this section, we use the term feature to refer to
any component of a system trajectory that might affect the outcome. Features
could include the states, observations, actions, or disturbances of the system. We
can also derive more complex features by combining these basic features. For
example, we could create features for an aircraft collision avoidance system by
grouping states together into different configurations that represent different
relative positions of our aircraft and the intruder.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
258 chapter 11. e xplainability

The figure in the caption shows policy plots for the aircraft collision avoidance Example 11.1. Aircraft collision
avoidance policy when the rela-
policy when the relative vertical rate is fixed at 0 m/s and 4 m/s, and the tive vertical rate is fixed at 0 m/s
previous action is fixed at no advisory. The red aircraft represents the relative (left) and 4 m/s (right), and the
previous action is fixed at no advi-
location of the intruder aircraft. We can use these plots to explain the behavior sory. The colors represent the ac-
of the policy in these scenarios. For example, we can see that when the relative tion taken by the agent in each
vertical rate is fixed at zero, the policy advises our aircraft to climb when it is state.

above the intruder and descend when it is below the intruder. This behavior
is aligned with our objective of avoiding collisions.
The plot on the left also reveals some potentially unexpected behaviors. For
example, when the time to collision is near zero and a collision is imminent,
the policy results in no advisory. This behavior may prompt us to perform
further analysis. For example, a counterfactual analysis (see section 11.5)
reveals that a collision is inevitable in this scenario regardless of the action
taken by the agent due to limits on the vertical rate of the aircraft.

ḣ = 0 m/s ḣ = 4 m/s
400
no advisory
descend 200
h (m)

climb
0

−200

−400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)

Mean Mode Figure 11.3. Result of different ag-


gregation methods for plotting the
400
no data four-dimensional collision avoid-
no advisory ance policy using data from 10,000
200
rollouts. On the left, we plot the
descend
h (m)

mean of the vertical rate action


0 climb
taken by the agent in each state.
On the right, we plot the action
−200 taken most frequently by the agent
in each state.
−400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 259

The results of a feature importance analysis can lead to explanations of system


behavior that allow us to check fairness and calibrate trust. For example, we could
use feature importance to ensure that a loan approval system is not focusing on 4
We want to ensure that the classi-
fier is not exploiting spurious corre-
protected characteristics such as race or gender when making decisions. We could
lations between the input and out-
also use feature importance to calibrate trust by ensuring that the agent is focusing put. Neural networks are prone to
on the features required to make a decision. For instance, we could check that an this type of behavior. R. Geirhos,
J.-H. Jacobsen, C. Michaelis, R.
image classifier is not focusing on irrelevant portions of the image when making Zemel, W. Brendel, M. Bethge, and
decisions.4 We can also use feature importance to inform the design of future F. A. Wichmann, “Shortcut Learn-
ing in Deep Neural Networks,”
systems by identifying the features that are most important in causing the system
Nature Machine Intelligence, vol. 2,
to fail. no. 11, pp. 665–673, 2020.
There are a variety of ways to define the importance of a particular feature.
One definition involves holding all features other than the feature of interest 5
We could also imagine corrupt-
constant and observing the effect on the system’s behavior when varying that ing all features except the feature
feature. Techniques that use this definition are often referred to as sensitivity of interest and observing the ef-
fect on the system’s behavior. This
analysis techniques.5 This definition, however, focuses only on the effect of the technique is sometimes referred
feature of interest by itself and does not consider its interaction with other features. to as causal mediation analysis. J.
Pearl, “Direct and Indirect Effects,”
Another definition of feature importance involves examining the effect of the in Conference on Uncertainty in Arti-
feature of interest in the context of the other features. Example 11.2 provides a ficial Intelligence (UAI), 2001.
scenario where considering the interactions between features produces a different
result. This section presents techniques for determining feature importance using
both definitions.

Low High
11.3.1 Sensitivity Analysis Sensititivity Sensititivity

Sensitivity analysis techniques allow us to understand how a particular output


changes when a single feature is changed. Examples of possible outputs include
the robustness of a trajectory or the agent’s decision at a single time step. If we
change the value of an input that has high sensitivity, we expect a large change in
the output, while if we change an input with low sensitivity, we expect a small
change in the output. Figure 11.4 illustrates this concept.
Figure 11.4. The trajectories show
One way to approximate sensitivity is to randomly perturb a single feature and
the effect of randomly changing the
observe the effect on the system’s behavior. By repeating this process multiple disturbance at a single time step on
times and taking the standard deviation or some other variability metric of the the rest of the trajectory (blue). The
system on the right has a higher
outcome of each trial, we obtain a measure of sensitivity. To evaluate the sensitivity sensitivity to the disturbance ap-
of a decision at a single point in time, we often perturb one component of the plied at the time step than the sys-
tem on the left, resulting in a wider
observation and observe the effect on the decision (example 11.3). To evaluate the
variety of outcomes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
260 chapter 11. explainability

Consider a wildfire scenario modeled as a grid where each cell is either Example 11.2. Motivation for con-
sidering the interactions between
burning or not burning. At each time step, there is a 30 % chance that a cell features when determining feature
that was not burning at the previous time step will be burning if at least one importance.
of its neighbors was burning. The plots below show an example of a current
state st and the probability that each cell is burning at the next time step
p(st+1 ) (darker cells indicate higher probability). Suppose we are interested
in understanding the features that are most important in determining the
probability that the cell in the upper right corner will burn.

st p ( s t +1 )

∗ ∗

For this example, we will focus specifically on the feature that indicates
whether the cell directly to the left of the upper right cell is burning. We
can test the first definition of feature importance by changing that cell to
not burning while holding all other cells constant and observing the effect
on the probability that the upper right cell will burn. In this case (leftmost
plots), the probability that the upper right cell will be burning at the next
time step does not change. Therefore, we will conclude that this cell has no
contribution to the output. However, if we remove fire from both this cell
and the cell below the upper right cell (rightmost plots), the upper right
cell changes to zero probability of burning at the next time step. The second
definition of feature importance considers the interaction between these two
features and would conclude that the cell does contribute to the output.

s0t p(s0t+1 ) s00t p(s00t+1 )

∗ ∗ 4 ∗ ∗

0
024

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 261

Suppose we have an agent that selects a steering angle for an aircraft based Example 11.3. Sensitivity analysis
at a single time step. Brighter pix-
on runway images from a camera mounted on its wing. Given a particular els in the sensitivity map indicate
input image, we can generate a sensitivity map to identify the pixels that are pixels with higher sensitivity.
most important in determining the steering angle by fixing all but the pixel
of interest and checking its effect on the steering angle output. The results
are shown below, where the left image is the original image and the right
image is the sensitivity map. This analysis indicates that the agent is focusing
on the portion of the runway in front of it where the lines are most visible.

Original Image Sensitivity Map

sensitivity of a trajectory, we can perturb the disturbance, observation, or action at


one time step and measure the effect on the rest of the trajectory (example 11.4).
Algorithm 11.1 estimates the sensitivity of the robustness of a particular tra-
jectory with respect to the disturbance at each time step. For each time step in
the trajectory, the algorithm perturbs the disturbance m times and computes the
robustness of the resulting trajectories. The algorithm then returns the standard
deviation of the change in robustness for each time step.6 The sensitivity of the 6
This direct sampling algorithm is
robustness of a trajectory with respect to its disturbances can be used to identify one of many different approaches
to quantify the uncertainty in the
the disturbances that have the greatest effect on the outcome of the trajectory. output of a function given uncer-
Because algorithm 11.1 requires performing multiple rollouts for each time tainty in the input. Other methods
such as those that use Taylor se-
step, it tends to be ineffecient for high-dimensional systems with many distur- ries approximations and polyno-
bances and long time horizons. If the output of interest is differentiable with mial chaos can be used instead.
respect to the input features, we can reduce computational cost by calculating the See the ‘‘Uncertainty Propagation’’
chapter in M. J. Kochenderfer and
sensitivity using saliency maps. Saliency maps are a type of sensitivity map that T. A. Wheeler, Algorithms for Opti-
use gradients to identify inputs that are most important, or salient, in determining mization. MIT Press, 2019.
a particular outcome. We can apply saliency maps to measures the sensitivity of
both individual decisions and the outcomes of full trajectories.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
262 chapter 11. explainability

We can use sensitivity analysis to understand the effect of disturbances on the Example 11.4. Sensitivity analysis
over a full trajectory. Brighter col-
outcome of a trajectory. For example, consider an inverted pendulum system ors in the sensitivity map of the
in which the agent’s observation of its current angle is subject to a noise inverted pendulum trajectory indi-
cate higher sensitivity. The black
disturbance. We can estimate the sensitivity of the robustness of a trajectory line shows the true angle of the
with respect to its disturbances by perturbing the disturbances at each time pendulum at each time step and
step and observing the effect on the robustness of the trajectory. The results the colored markers indicate the
noisy observation of the current an-
on a given failure trajectory are shown below. This analysis indicates that gle at each time step.
small changes in the disturbances at the beginning of the trajectory have a
large effect on the robustness of the trajectory. Furthermore, the disturbances
applied towards the end of the failure trajectory have little to no effect because
the controller is saturated and the system cannot recover.

1
θ (rad)

−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

struct Sensitivity Algorithm 11.1. Algorithm for esti-


x # vector of trajectory inputs (s, 𝐱 = extract(sys.env, x)) mating the sensitivity of the robust-
perturb # x′ = perturb(x, t) ness of a trajectory with respect to
m # number of samples per time step its disturbances. It takes as input
end a vector of trajectory features for
the current trajectory we want to
function describe(alg::Sensitivity, sys, ψ) evaluate. These feature can be con-
m, x, perturb = alg.m, alg.x, alg.perturb verted to a trajectory using the sys-
s, 𝐱 = extract(sys.env, x) tem specific extract function. The
τ = rollout(sys, s, 𝐱) perturb function generates a new
ρ₀ = robustness([step.s for step in τ], ψ.formula) trajectory feature vector by perturb-
sensitivities = zeros(length(τ)) ing the feature at a particular time
for t in eachindex(τ) step. The algorithm then computes
x′s = [perturb(x, t) for i in 1:m] the robustness of the perturbed tra-
τ′s = [rollout(sys, extract(sys.env, x′)...) for x′ in x′s] jectories and returns the standard
ρs = [robustness([st.s for st in τ′], ψ.formula) for τ′ in τ′s]
deviation of the resulting change
sensitivities[t] = std(abs.(ρs .- ρ₀))
in robustness for each time step.
end
return sensitivities
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 263

struct GradientSensitivity Algorithm 11.2. Algorithm for ap-


x # vector of trajectory inputs (s, 𝐱 = extract(sys.env, x)) proximating the sensitivity of the
end robustness of a trajectory with re-
spect to its disturbances using gra-
function describe(alg::GradientSensitivity, sys, ψ) dients. It takes as input a vector
function current_robustness(x) of trajectory features for the cur-
s, 𝐱 = extract(sys.env, x) rent trajectory we want to evaluate.
τ = rollout(sys, s, 𝐱) These feature can be converted to a
return robustness([step.s for step in τ], ψ.formula) trajectory using the system specific
end extract function. The algorithm
computes the robustness of the tra-
return ForwardDiff.gradient(current_robustness, alg.x) jectory and returns the gradient of
end the robustness with respect to the
input features.

Sensitivity Figure 11.5. Sensitivity of the ro-


bustness of a trajectory with re-
1 spect to its disturbances for an
inverted pendulum system calcu-
lated using algorithm 11.2 com-
θ (rad)

pared to using algorithm 11.1. The


0
black line shows the true angle
of the pendulum at each time
step and the colored markers indi-
−1 cate the noisy observation of the
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 current angle at each time step.
Time (s) Brighter colors indicate higher sen-
sitivity. The gradient calculation
provides values similar to the sensi-
Gradient Magnitude
tivity estimate from algorithm 11.1.
1
θ (rad)

−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
264 chapter 11. explainability

A simple way to produce a saliency map given a set of inputs is to take the 7
D. Baehrens, T. Schroeter, S.
Harmeling, M. Kawanabe, K.
gradient of the output of interest with respect to the inputs.7 The saliency of a
Hansen, and K.-R. Müller, “How
particular input is related to the magnitude of the gradient at that input. A high to Explain Individual Classi-
gradient magnitude indicates that small changes in the input will result in large fication Decisions,” Journal of
Machine Learning Research, vol. 11,
changes in the output. In other words, inputs with high gradient values are more pp. 1803–1831, 2010.
salient and indicate higher sensitivity. This method is often used to determine the
8
K. Simonyan, A. Vedaldi, and A.
components of an observation (such as the pixels of an image) that contribute
Zisserman, “Deep Inside Convolu-
most to an agent’s decision.8 We can also use it to approximate sensitivity over tional Networks: Visualising Image
a full trajectory by taking the gradient of a performance measure with respect Classification Models and Saliency
Maps,” in International Conference
to input features such as actions or disturbances. Algorithm 11.2 measures the on Learning Representations (ICLR),
sensitivity of the robustness of a trajectory with respect to its disturbances, and 2014.
9
figure 11.5 shows an example on the inverted pendulum system. For image inputs in particular, it
has also been shown that there are
While algorithm 11.2 is more computationally efficient than algorithm 11.1, it
sometimes meaningless local vari-
is limited by its local nature. Important input features, for example, often saturate ations in gradients that can lead to
the output function of interest, causing the gradient to be small even when the noisy sensitivity maps. D. Smilkov,
N. Thorat, B. Kim, F. Viégas, and
feature is important.9 The integrated gradients10 algorithm addresses this limitation M. Wattenberg, “Smoothgrad: Re-
by averaging the gradient along the path between a baseline input and the input moving Noise by Adding Noise,”
in International Conference on Ma-
of interest (figure 11.6). The choice of baseline depends on the context. For images,
chine Learning (ICML), 2017.
a common choice is a black image (figure 11.7). For disturbances, we can set all 10
M. Sundararajan, A. Taly, and
disturbances to zero. Q. Yan, “Axiomatic Attribution
for Deep Networks,” in Interna-
Algorithm 11.3 calculates the sensitivity of the robustness of a trajetory with
tional Conference on Machine Learn-
respect to the disturbances at each time step using integrated gradients. It takes ing (ICML), 2017.
m steps along the path between the baseline and the current input and computes
the gradient of the robustness at each step. The algorithm then returns the av-

Figure 11.6. Example of the inte-


Integrated Gradients grated gradients algorithm for an
Saturated Gradient input feature x. While x has a sig-
nificant effect on the output func-
tion f ( x ), the gradient at the cur-
rent value of x is small because
f ( x ) is saturated (left). If we av-
f (x)

erage the gradient of the output


current value function f ( x ) along the path be-
baseline tween a baseline input and the cur-
rent value of the input of interest,
we can capture this effect (right).
x x Brighter colors of the gradient lines
indicate higher magnitudes.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.3. feature importance 265

Original Image Gradient Values Figure 11.7. Comparison of the


gradients of the output of an air-
craft taxi network that selects a
steering angle based on runway im-
ages from a camera mounted on
its wing. As α moves from 0 to 1,
the image moves from a baseline
black image to the original image.
The gradients are much higher for
0 0.2 0.4 0.6 0.8 1 the pixel marked in red indicating
α that it has a larger effect on the out-
put of the network. However, if we
only computed the gradient of the
original image, the effect would ap-
pear similar to the effect of the pixel
marked in blue.
struct IntegratedGradients Algorithm 11.3. Algorithm for ap-
x # vector of trajectory inputs (s, 𝐱 = extract(sys.env, x)) proximating the sensitivity of the
b # vector of baseline inputs robustness of a trajectory with re-
m # number of steps for numerical integration spect to its disturbances using inte-
end grated gradients. It takes as input a
vector of trajectory features for the
function describe(alg::IntegratedGradients, sys, ψ) current trajectory we want to eval-
function current_robustness(x) uate, a vector of baseline features,
s, 𝐱 = extract(sys.env, x) and the number of steps for nu-
τ = rollout(sys, s, 𝐱) merical integration. For each step
return robustness([step.s for step in τ], ψ.formula) along the path between the base-
end line and the current input, the al-
αs = range(0, stop=1, length=alg.m) gorithm computes the gradient of
xs = [(1 - α) * alg.b .+ α * alg.x for α in αs] the robustness. It then returns the
grads = [ForwardDiff.gradient(current_robustness, x) for x in xs] average gradient at each time step.
return mean(hcat(grads...), dims=2)
end

Original Image Sensitivity Gradient Integrated Gradients Figure 11.8. Comparison of the
sensitivity descriptions using algo-
rithms 11.1 to 11.3 for an aircraft
taxi system that selects a steering
angle from an image observation.
The sensitivity map focuses on the
portion where the edge and cen-
ter lines are most apparent, while
the gradient-based methods focus
only on the edges of the runway.
The integrated gradients method
provides a smoother map than the
single gradient approach.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
266 chapter 11. explainability

erage gradient at each time step. As m approaches infinity, the average gradient
approaches the integral of the gradient along the path. Figure 11.8 compares the
sensitivity estimates from algorithms 11.1 to 11.3 for an aircraft taxi system. All
three methods produce slightly different descriptions of the agent’s behavior, and
in general, the most appropriate sensitivity estimate is application dependent.

11.3.2 Shapley Values


Computing the Shapley value of a feature allows us to evaluate its importance
in the context of its interaction with other features. For example, it may be a
combination of multiple disturbances that leads to a failure rather than a single
disturbance. While sensitivity analysis techniques miss this interaction because
they only vary one feature at a time, Shapley values capture it by varying all
possible subsets of features.11 11
Shapley values were originally
Suppose we have a set of feature indices I = {1, . . . , n}, and let Is ⊆ I be a developed in the context of game
theory in economics and are
subset of these features. Given a set of values for the features x and a function f that named for American mathemati-
maps these values to an outcome, we define the following function to represent cian and economist Lloyd Shapley
(1923–2016). L. S. Shapley, “Notes
the expectation of the outcome while holding the features in Is constant: on the N-Person Game—II: The
Value of an N-Person Game,” 1951.
f Is (x) = E[x0 | xi0 = xi , i ∈ Is ] (11.1)

The Shapley value φi of feature i is then defined as the average marginal contri-
bution of feature i to the expectation of the outcome over all possible subsets of
features:
|Is |!(n − |Is | − 1)!
φi (x) = ∑ n!
( f Is ∪{i} (x) − f Is (x)) (11.2)
Is ⊆I\{i }

Intuitively, computing the Shapley value of feature i involves looping over all
possible subsets of features that do not include i and computing the difference in
the expectation of the outcome when adding i to the subset. The constant factor
in equation (11.2) ensures that subsets of different sizes are weighted equally. In
general, Shapley values are expensive (often intractable) to compute due to the
large number of possible subsets. For example, a function with 100 input features
has 6.3 × 1029 possible subsets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.4. policy explanation through surrogate models 267

We can approximate the Shapley value using randomly sampled subsets.12 12


E. Štrumbelj and I. Kononenko,
First, we rewrite equation (11.2) as follows: “Explaining Prediction Models
and Individual Predictions with
Feature Contributions,” Knowledge
1
φi (x) =
n! ∑ [ f P1:j (x) − f P1:j−1 (x)] (11.3) and Information Systems, vol. 41,
pp. 647–665, 2014.
P ∈π (n)

where π (n) represents the set of all possible permutations of n elements, j is the
index in the permutation P that corresponds to feature i, and P1:j represents the
first j elements of P . We can then approximate the Shapley value using sampling.
For each sample, we randomly permute the features and compute the difference
in the expectation of the outcome when adding feature i to the features before it
in the permutation.
Algorithm 11.4 estimates the Shapley values for the disturbances in a trajectory
to determine their contribution to the robustness of the trajectory. It takes in a
current trajectory τ with disturbance trajectory x and a number of samples per
time step m. For each time step in the trajectory, the algorithm randomly samples
another disturbance trajectory w by performing a rollout using the nominal
trajectory distribution.13 It then samples a random permutation P of the time 13
This step requires that the distur-
steps and performs a rollout in which the disturbances are taken from x for the bances sampled at each time step
are independent of one another.
time steps in P1:j and from w for all other time steps. It similarly performs a This assumption may break if the
rollout in which the disturbances are taken from x for the time steps in P1:j−1 and disturbances depend on the states,
actions, or observations.
from w for all other time steps. The algorithm then computes the difference in
the robustness of the two rollouts and averages the differences over m sampled
permutations to estimate the Shapley value of each disturbance.
Figure 11.9 shows the Shapley values for the disturbances of the inverted
pendulum trajectory used in example 11.3 and figure 11.5. The Shapley values
differ from the sensitivity estimates because they account for interactions between
disturbances. If we remove groups of disturbances with high Shapley values, it
produces a large change in the outcome.

11.4 Policy Explanation through Surrogate Models

For agents with complex policies, it may be difficult to understand the reasoning
behind their decisions. In such cases, we can build surrogate models to approximate
the policy with a model that is easier to interpret. A good surrogate model should
have the following characteristics:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
268 chapter 11. explainability

struct Shapley Algorithm 11.4. Estimating the


τ # current trajectory Shapley values of the disturbances
m # number of samples per time step in a trajectory. The algorithm takes
end as input a trajectory τ and a num-
ber of samples per time step m. For
function shapley_rollout(sys, s, 𝐱, 𝐰, inds) each time step in the trajectory, the
τ = [] algorithms samples a random vec-
for t in 1:length(𝐱) tor of disturbances by performing
x = t ∈ inds ? 𝐱[t] : 𝐰[t] a rollout using the nominal trajec-
o, a, s′ = step(sys, s, x) tory distribution. It then samples
push!(τ, (; s, o, a, x)) a random permutation of the time
s = s′ steps and locates the current time
end step in the permutation. Using the
return τ shapley_rollout function, the al-
end gorithm computes the difference
in the robustness of the trajectory
function describe(alg::Shapley, sys, ψ)
when adding the disturbance at the
τ, m = alg.τ, alg.m
current time step to the subset of
p = NominalTrajectoryDistribution(sys, length(alg.τ))
disturbances in the permutation. It
𝐱 = [step.x for step in τ]
then averages the differences over
ϕs = zeros(length(τ))
for t in eachindex(τ) m sampled permutations to esti-
for _ in 1:m mate the Shapley value of each dis-
𝐰 = [step.x for step in rollout(sys, p)] turbance.
𝒫 = randperm(length(τ))
j = findfirst(𝒫 .== t)
τ₊ = shapley_rollout(sys, τ[1].s, 𝐱, 𝐰, 𝒫[1:j])
τ₋ = shapley_rollout(sys, τ[1].s, 𝐱, 𝐰, 𝒫[1:j-1])
ϕs[t] += robustness([step.s for step in τ₊], ψ.formula) -
robustness([step.s for step in τ₋], ψ.formula)
end
ϕs[t] /= m
end
return ϕs
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.4. policy explanation through surrogate models 269

Shapley Values Figure 11.9. Shapley values for the


disturbances in an inverted pen-
1 dulum failure trajectory. The black
line shows the true angle of the
pendulum at each time step, and
θ (rad)

the colored markers indicate the


0
noisy observation of the current
angle at each time step. Brighter
colors indicate higher Shapley val-
−1 ues. The Shapley values are high-
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 est for distubances that cause the
Time (s) agent to think that the pendulum
is further from tipping over than it
actually is, causing it to apply too
Important Features
small of a torque to move toward
upright. The second plot shows
1
the trajectory with the four distur-
bances with the highest Shapley
θ (rad)

values marked in red. If we remove


0 these disturbances one at a time
(blue trajectories in the third plot),
we cause small changes in the out-
−1 come. However, if we remove all
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 four disturbances (purple trajec-
tory in the third plot) at once, we
Time (s)
cause a large change in the out-
come.
Removing Important Features

1
θ (rad)

0
one at a time all four
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
270 chapter 11. explainability

• High Fidelity: The surrogate model should accurately represent the policy. If
the surrogate model does not adequately represent the policy, the explanations
it provides may be misleading.

• High Interpretability: The surrogate model should be easily interpretable by


humans. If the surrogate model is too complex, it may be difficult to understand
the reasoning behind the decisions.

In general, there is a tradeoff between fidelity and interpretability. A more complex


model may be higher fidelity but less interpretable, while a simpler model may
be more interpretable but lower fidelity.
Surrogate models can provide local explanations or global explanations of a policy.
Local explanations provide insight into a single decision, while global explana-
tions provide insight into the full policy. To create a local surrogate model, we
create a dataset of observations and corresponding decisions near the observation
of interest and fit a model to this dataset.14 For global surrogate models, we gather 14
It is common to weight these data
data across the entire observation space. When selecting a model class to fit the points with higher weights for ob-
servations that are closer to the ob-
data, we must consider the tradeoff between fidelity and interpretability. This servation of interest. M. T. Ribeiro,
section discusses this tradeoff for two common model classes used as surrogate S. Singh, and C. Guestrin, “‘Why
Should I Trust You?’ Explaining
models. the Predictions of Any Classifier,”
in ACM SIGKDD International Con-
ference on Knowledge Discovery and
11.4.1 Linear Models Data Mining, 2016.

One common choice for a surrogate model is a linear model. Linear models have
the form
n
f (x) = ∑ wi x i + b (11.4)
i =1

where xi is a feature of the observation, wi is a weight for feature i, and b is the bias
term. If the action space is discrete, we may apply the logistic or softmax function
to the output of the linear model to obtain probabilities for each action. Linear
surrogate models can be used to determine feature importance. The magnitudes
of the weights of the linear model indicate the contribution of each feature to the
agent’s decision. Figure 11.10 demonstrates how to use a linear surrogate model
to describe the behavior of a collision avoidance policy in two different regions of
the observation space. This technique is particularly useful for high-dimensional
observations, where it may be difficult to visualize the policy directly.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11. 4. policy explanation through surrogate models 271

Original Policy Linear Approximation Feature Weights Figure 11.10. Linear surrogate
model fit to samples in two differ-
ent local regions (highlighted cir-
100 cles) of the observation space for
a collision avoidance policy. The
left column shows the original pol-
h (m)

icy where blue corresponds to the


0
climb action, green corresponds to
the descend action, and white cor-
responds to no advisory. The col-
−100 umn in the middle shows the lin-
ear surrogate model fit to the sam-
ples in the highlighted circle indi-
cated by the purple dots. The right
column shows the feature weights
100 of the linear surrogate model for
each state variable. The linear sur-
rograte model is accurate in the lo-
h (m)

0 cal region where it was fit, but may


not be accurate in other regions
of the observation space. Based
on the feature weights, the linear
−100 surrogate model indicates that the
agent’s decision depends on both
40 30 20 10 0 40 30 20 10 0 h tcol the relative altitude and time to col-
tcol (s) tcol (s) lision when the relative altitude is
around 50 m. In contrast, when the
relative altitude is around 0 m, the
agent’s decision primarily depends
on the relative altitude.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
272 chapter 11. explainability

For complex policies, a model that is simply a linear function of observation


variables may not provide sufficient fidelity. We may need to add more complex
features of the observation to the model. For example, we could add polynomial
features of the observation to the linear model to capture non-linear relationships
between the features. Alternatively, we can train a neural network to learn a
set of nonlinear features that can be linearly combined, but these features are
generally not interpretable. Figure 11.11 shows the tradeoff between fidelity and 15
R. Tibshirani, “Regression
Shrinkage and Selection via
interpretability for a linear model with polynomial features. A common technique
the Lasso,” Journal of the Royal
to simplify linear models with many features is to encourage sparsity in the Statistical Society Series B: Statistical
weights using a technique called LASSO regression.15 The features with nonzero Methodology, vol. 58, no. 1, pp. 267–
288, 1996.
weights in a sparse linear model are tend to be more to the decision.

Low High Figure 11.11. Tradeoff between in-


Fidelity Interpretability terpretability and fidelity in a lin-
ear surrogate model. Each row cor-
responds to a linear model with
first, second, and third order fea-
tures, respectively. The left column
shows the decision boundary of
the surrogate model, with the black
point indicating the state for which
h tcol the model is evaluated. The right
column shows the feature weights
of the surrogate model. The plot be-
low shows the original policy and
the sampled points. As the order of
the polynomial features increases,
the model becomes more accurate
in the local region where it was fit
at the cost of lower interpretability.
h tcol t2col htcol h2
100
h (m)

−100

40 30 20 10 0
tcol tcol (s)
h t2col htcol h2 t3col ht2col h2 t col h3
High Low
Fidelity Interpretability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.5. counterfactual explanations 273

11.4.2 Decision Trees


Decision trees model policies as a series of simple decisions.16 Each node in the tree 16
The DecisionTree.jl package
represents a decision based on a feature of the observation, and the leaf nodes can be used to train a decision tree
model from data.
represent the action taken by the agent. Decisions are typically represented as
binary splits, where we follow the left branch in the tree if the feature value is
less than a threshold and the right branch if the feature value is greater than or
equal to the threshold. Example 11.5 shows a simple decision tree for a slice of
the collision avoidance policy.
The maximum depth of the decision tree controls the tradeoff between fidelity
and interpretability. Shallow decision trees tend to be more interpretable because
they do not require many decisions to make a prediction. However, shallow trees
are less expressive and may miss important features of the policy. Figure 11.12
shows the tradeoff between fidelity and interpretability for decision trees.

11.5 Counterfactual Explanations

A counterfactual is a hypothetical scenario that describes how an outcome would


change if events had unfoldly differently. Counterfactual explanations explain the
behavior of a model by identifying the smallest change to the input that would
result in a different outcome. We can frame the problem of generating a counter-
factual explanation as a multiobjective optimization problem, in which our goal
is to maximize the following four objectives:17 17
S. Dandl, C. Molnar, M. Binder,
and B. Bischl, “Multi-Objective
1. Change in outcome: The counterfactual input should result in an outcome dif- Counterfactual Explanations,” in
International Conference on Parallel
ferent from that obtained with the original input. If our goal is to change a Problem Solving from Nature, 2020.
trajectory τ from a failure to a success, we can use the temporal logic robustness
metric as the objective:

f outcome (τ 0 ) = robustness(τ 0 , ψ) (11.5)

where τ 0 is the counterfactual trajectory. By maximizing robustness, we move


toward a trajectory that satisfies the safety property ψ.

2. Distance to original input: The counterfactual input should be close to the orig-
inal input τ to ensure that the change is minimal, resulting in the following
objective:
f close (τ 0 ) = −kτ 0 − τ k p (11.6)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
274 chapter 11. e xplainability

Suppose we want to train a decision tree to approximate the slice of the Example 11.5. Simple decision tree
for the collision avoidance policy.
collision avoidance policy shown in example 11.1. The following decision The policy represented by the deci-
tree was trained on a dataset of 100,000 randomly sampled states from the sion tree is shown below.
policy slice. The decision tree has a maximum depth of 2 and uses the state 400
variables to make decisions. Nodes that split using h are shown in black,
200
nodes that split using tcol are shown in gray, and the color of the square leaf

h (m)
0
nodes are the actions taken by the agent.
−200
h climb
−400
tcol descend 40 30 20 10 0
no advisory tcol (s)
0
<0 ≥0

−101 98
< −101 ≥ −101 < 98 ≥ 98

With a maximum depth of 2, the decision tree only makes decisions based
on h. If h is positive, the tree selects whether to climb or issue no advisory
based on the magnitude of the relative altitude. Similarly, if h is negative, the
tree selects whether to descend or issue no advisory based on the magnitude
of the relative altitude. The policy represented by the decision tree is shown
in the caption. This decision tree provides a simple, interpretable model of
the agent’s policy. However, the fidelity of the decision tree is limited by its
depth, and it misses some key features of the policy that depend on the time
to collision.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.5. counterfactual explanations 275

Low High Figure 11.12. Tradeoff between fi-


Fidelity Interpretability delity and interpretability when
training a decision tree surrogate
400 model on the slice of the policy
shown in example 11.1. Each row
200
corresponds to a decision tree with
h (m)

0 a maximum depth of 2, 4, and 6, re-


−200 spectively. The left column shows
−400 the decision boundary of the surro-
40 30 20 10 0 gate model. The colors correspond
tcol (s) to the colorscheme shown in exam-
ple 11.5. As the maximum depth
of the decision tree increases, the
400 model becomes more accurate at
200
the cost of lower interpretability.
h (m)

−200

−400
40 30 20 10 0
tcol (s)

400

200
h (m)

−200

−400
40 30 20 10 0
tcol (s)

High Low
Fidelity Interpretability

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
276 chapter 11. explainability

where τ 0 is the counterfactual input and k · k p is the p norm.

3. Sparsity of the change: The difference between the original input and the counter-
factual input should be sparse. In other words, the counterfactual input should
differ in only a few features. We can use the following objective

f sparsity (τ 0 ) = −kτ 0 − τ k0 (11.7)

where k · k0 returns the number of nonzero elements.18 This objective presents 18


This operation is sometimes re-
challenges for gradient-based optimization algorithms because its derivative ferred to as the L0 norm; however,
it is not a proper norm because it
is zero almost everywhere. To use gradient-based optimization, we can use the does not scale properly when mul-
L1 norm, which encourages sparisty in the final solution. tiplied by a scalar.

4. Plausibility: The new input should be a plausible input. We can check plausibil-
ity using the likelihood of the counterfactual trajectory as follows:

f plaus (τ 0 ) = p(τ 0 ) (11.8)


19
An overview of multiobjective
The four counterfactual objectives are often at odds with one another. For optimization is provided in chapter
example, only making small changes to the input is unlikely to change the outcome. 12 of M. J. Kochenderfer and T. A.
Wheeler, Algorithms for Optimiza-
We can use multiobjective optimization techniques to find counterfactual inputs tion. MIT Press, 2019.
that balance these objectives.19 Algorithm 11.5 creates a single objective function 20
It is also common to use ge-
using a weighted sum of the objectives. We can apply a variety of optimization netic algorithms that encourage di-
algorithms to find the counterfactual input that maximizes the objective function versity in the solutions to find a
diverse set of counterfactual ex-
(see section 4.6).20 To ensure compatibility with gradient-based optimization planations. S. Dandl, C. Molnar,
techniques, we omit the objective in equation (11.7) and instead set p = 1 in M. Binder, and B. Bischl, “Multi-
Objective Counterfactual Explana-
equation (11.6) to encourage sparsity. Figure 11.13 shows the generation of a tions,” in International Conference on
counterfactual explanation for a failure of the inverted pendulum system. Parallel Problem Solving from Nature,
2020.

function counterfactual_objective(x, sys, ψ, x₀; ws=ones(3)) Algorithm 11.5. Counterfactual


s, 𝐱 = extract(sys.env, x) objective function that combines
τ = rollout(sys, s, 𝐱) equations (11.5), (11.6) and (11.8).
foutcome = robustness([step.s for step in τ], ψ.formula) The objective function takes in the
fclose = -norm(x - x₀, 1) counterfactual input x, the system
fplaus = logpdf(NominalTrajectoryDistribution(sys, length(𝐱)), τ) sys, the specification ψ, the original
return ws' * [foutcome, fclose, fplaus] input x₀, and a vector of weights ws.
end The algorithm computes each indi-
vidual objective and returns their
weighted sum. We take the loga-
We are often interested in producing counterfactual explanations for inputs rithm of f plaus for numerical stabil-
ity.
that we can control. For example, a counterfactual explanation for a loan approval

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.5. counterfactual explanations 277

1 Figure 11.13. Generation of a


counterfactual trajectory for the in-
Original

verted pendulum system by chang-


0 ing the disturbances on the mea-
surement of θ. The original trajec-
tory is shown in the plot on the top
−1 with the disturbance at each time
step shown in black. The remaining
1 plots show the counterfactual tra-
jectories with different numbers of
6 changes

disturbances changed. The distur-


0 bances that differ from the original
trajectory are shown in blue. We
can create these trajectories by de-
−1 creasing the relative weighting of
the closeness objective in the coun-
terfactual objective until we gener-
1
ate a trajectory that is no longer a
failure. As noted in example 11.4,
7 changes

the disturbances at the beginning


0
of the trajectory have the most sig-
nificant impact on the outcome be-
−1 cause the controller is saturated at
the end of the trajectory.

1
8 changes

−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
278 chapter 11. e xplainability

400 Figure 11.14. Counterfactual expla-


nations (blue) for a failure of the
collision avoidance system (black).
200 The arrows represent the direction
of the commanded collision avoid-
ance advisory at each time step. No
h (m)

0 arrow indicated no advisory. The


black arrows represent the origi-
nal trajectory, while the blue ar-
−200 row represents the action change
used to generate the counterfac-
tual trajectories. The plot on the
left shows the counterfactual ex-
−400
40 30 20 10 0 40 30 20 10 0 planation when holding all other
actions and disturbances constant.
tcol (s) tcol (s)
The plot on the right shows the
counterfactual explanations when
rolling out the trajectory for all
other time steps. In this scenario,
system that involves a change in the income of the applicant is more useful than we can conclude that issuing a de-
scend advisory a few time steps
an explanation that requires a change in their age. While we have control over earlier would have prevented the
the actions of an agent, we often do not have control over the disturbances that failure.
affect the system.
We can generate several different types of counterfactual explanations over
the actions of the agent. The simplest type of counterfactual explanation involves
changing an action at a particular time step or set of time steps while keeping all
other actions constant. This technique is mosts similar to algorithm 11.5. A key
assumption of this method is that the components of the input are independent
of one another. However, changing the action at one time affects the state at the
next time step, which in turn affects the action at the next time step. This cascad-
ing effect breaks the independence assumption and can lead to counterfactual
explanations that are not plausible.
One way to account for the cascading effect is to select actions that maximize the
expected change in the outcome given all possible future actions and disturbances.
If we are searching for counterfactuals that only change the action at one time
step, we can produce a set of counterfactual explanations by performing rollouts
of the policy for the remaining time steps. We can then select the action that 21
S. Dandl, C. Molnar, M. Binder,
and B. Bischl, “Multi-Objective
maximizes the expected change in the outcome. Figure 11.14 shows this technique
Counterfactual Explanations,” in
for the aircraft collision avoidance example. Understanding the effects of changing International Conference on Parallel
multiple actions requires more sophisticated techniques.21 Problem Solving from Nature, 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.6. failure mode characterization 279

11.6 Failure Mode Characterization

Another way to explain the behavior of a system is to characterize its failure modes.
We can use clustering algorithms to create groupings of failure trajectories that
are similar to one another. Identifying the similarities and differences between
failures helps us understand their underlying causes. One common clustering
algorithm is k-means22 (algorithm 11.6), which groups data points into k clusters 22
This algorithm is also referred
based on their similarity to one another.23 to as Lloyd’s algorithm, named af-
ter Stuart P. Lloyd (1923–2007). S.
To apply k-means, we must first extract a set of real-valued features from each Lloyd, “Least Squares Quantiza-
failure trajectory to use for clustering. Let x represent the set of features from tion in PCM,” IEEE Transactions on
Information Theory, vol. 28, no. 2,
trajectory τ and φ be a feature extraction function such that x = φ(τ ). To represent pp. 129–137, 1982.
the clusters C , k-means keeps track of k cluster centroids µ1:k in feature space and 23
A detailed overview of cluster-
assigns each trajectory to the cluster with the closest centroid to its features. We ing algorithms is provided in D.
Xu and Y. Tian, “A Comprehen-
begin by initializing the centroids to the features of k random trajectories. At each sive Survey of Clustering Algo-
iteration, k-means performs the following steps: rithms,” Annals of Data Science,
vol. 2, pp. 165–193, 2015.
1. Assign each trajectory to the cluster with the closest centroid to its feature
vector. In other words, τi is assigned to cluster C j when

j = arg min d(xi , µ j ) (11.9)


j∈1:k

where d(·, ·) is a distance metric such as the L2 norm.

2. Update the centroids to the mean of the feature vectors of the trajectories in
each cluster such that
1
|C j | τ∑
µj = φ(τ ) (11.10)
∈C j

where |C j | is the number of trajectories in cluster C j .

The algorithms repeats until the centroids converge or a maximum number of


iterations is reached. The k-means algorithm may converge to a local minimum de-
pending on the initialization of the centroids, so it is common to run the algorithm
multiple times with different initializations. Figure 11.15 shows the progression
of the k-means algorithm on failure trajectories of the inverted pendulum system
using the average angle and angular velocity of each trajectory as the features.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
280 chapter 11. e xplainability

struct Kmeans Algorithm 11.6. The k-means algo-


τs # trajectories to cluster rithm for clustering failure trajecto-
ϕ # feature extraction function (x = ϕ(τ)) ries. The algorithm takes in a set of
d # distance metric function (d(x[i], μⱼ)) trajectories τs, a feature extraction
k # number of clusters function ϕ, the number of clusters k,
max_iter # maximum number of iterations and the maximum number of itera-
end tions max_iter. The algorithm first
extracts the features from each tra-
function describe(alg::Kmeans, sys, ψ) jectory and initializes the centroids
x = [alg.ϕ(τ) for τ in alg.τs] to random trajectories. At each it-
μ = x[randperm(length(x))[1:alg.k]] eration, it assigns each trajectory
𝒞 = [Int[] for j in 1:alg.k] to the cluster with the closest cen-
for _ in 1:alg.max_iter troid based on the L2 norm and up-
𝒞 = [Int[] for j in 1:alg.k] dates the centroids to the mean of
for i in eachindex(x) the feature vectors of the trajecto-
push!(𝒞[argmin([alg.d(x[i], μⱼ) for μⱼ in μ])], i)
ries in each cluster.
end
for j in 1:alg.k
if !isempty(𝒞[j])
μ[j] = mean(x[i] for i in 𝒞[j])
end
end
end
return 𝒞, μ
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.6. failure mode characterization 281

Initialization Iteration 1 Iteration 2 Converged Figure 11.15. Progression of the k-


means algorithm on failure trajecto-
2 ries of the inverted pendulum sys-
tem with k = 2 and the average an-
1
Average ω

gle and angular velocity of each tra-


0 jectory as the features. The colors
represent the different clusters at
−1 each iteration and the black crosses
represent the cluster centroids. The
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 algorithm converges to a set of clus-
ters that represent two distinct fail-
Average θ Average θ Average θ Average θ
ure modes. One failure mode corre-
sponds to trajectories in which the
pendulum falls to the left, and the
1
other cluster corresponds to trajec-
tories in which the pendulum falls
θ (rad)

0 to the right.

−1
0 2 4 0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s) Time (s)

The clustering results help us understand the failure modes of the system. 1
One way to interpret the clusters is to create a prototypical example for each cluster.
The prototypical example for a given cluster is the trajectory that is closest to

θ (rad)
0
its centroid in feature space. By examining the prototypical examples, we can
understand the characteristics of each failure mode. Figure 11.16 shows the proto-
typical examples for the final clusters in figure 11.15. At runtime, we can assign −1
new failure trajectories to the cluster with the closest centroid to their features 0 2 4
and use the prototypical examples to explain the failure mode of the trajectory. Time (s)
Algorithm 11.6 requires us to select the number of clusters k, the distance
Figure 11.16. Prototypical exam-
function d, and the feature extraction function φ. The clustering results are highly ples of failure modes for in the in-
dependent on these choices. However, selecting the number of clusters and the verted pendulum system using the
clusters in figure 11.15. The proto-
features is often a subjective process that requires domain knowledge. To select
types reveal that one failure mode
the number of clusters, we can try different values for k and select the one that involves the pendulum falling to
results in the most interpretable clusters or that minimizes a clustering objective the left, while the other involves
the pendulum falling to the right.
such as the sum of the squared distances between each point and its cluster
centroid.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
282 chapter 11. e xplainability

We can also use domain knowledge to select features that are likely to capture
the underlying causes of the failures. A simple way to select features is to create a
feature vector by concatenating all of the states in the trajectory. We could create
similar feature vectors for the actions, observations, and disturbances. However,
these feature vectors will be high-dimensional and may not result in interpretable
clusters (figure 11.17).

State Features Action Features Disturbance Features Figure 11.17. Clustering failure tra-
jectories of the inverted pendulum
1 system using features consisting
of the states, actions, and distur-
bances of each trajectory, respec-
θ (rad)

tively. The colors represent the dif-


0
ferent clusters. The clusters based
on the state and action features
show interpretable failure modes,
−1 while the clusters based on the
0 2 4 0 2 4 0 2 4 disturbance features are less inter-
pretable.
Time (s) Time (s) Time (s)

To improve interpretability of the clusters, we can cluster the trajectories based


on temporal logic features. Specifically, we use the parameters of a parametric
signal temporal logic (PSTL) formula as the features for clustering.24 PSTL is an 24
M. Vazquez-Chanlatte, J. V. Desh-
extension of signal temporal logic (section 3.5.2), in which the time constants mukh, X. Jin, and S. A. Seshia,
“Logical Clustering and Learning
in the temporal operators and signal values in the atomic propositions may be for Time-Series Data,” in Interna-
replaced by parameters. PSTL expressions represent template formulas that can tional Conference on Computer Aided
Verification, 2017.
be instantiated to STL formulas with specific values for the parameters.
To perform clustering with PSTL features, we first select a template formula.
We then set the features of each trajectory to the values of the parameters for
which the formula is marginally satisfied. An STL formula is marginally satisfied by
a trajectory if the robustness of the trajectory with respect to the formula is zero.
By plugging these parameters into the template formula, we obtain a temporal
logic formula that describes the behavior of the trajectory.
We can use optimization methods to find the values of the parameters that
marginally satisfy the formula for each trajectory by finding the φ that minimizes
kρ(τ, ψφ )k where ρ is the robustness function and ψφ is the instantiated STL for-
mula with parameters φ. Example 11.6 applies this idea to the inverted pendulum
system. We can then perform clustering on the extracted PSTL features to identify

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
11.7. summary 283

failure modes. Figure 11.18 shows the clusters of failure trajectories of the inverted 1
pendulum system using the PSTL template in example 11.6.
Clustering using PSTL features requires us to select a template formula. The

θ (rad)
template formula should capture the key aspects of the system that are relevant 0

to the failure modes. For systems with complex failure modes, it may be difficult
to hand-design a template formula that captures all the failure modes. In these −1
cases, we can use more sophisticated techniques that build decision trees using a 0 2 4
grammar based on temporal logic.25 Time (s)

Figure 11.18. Clusters of failure tra-


11.7 Summary jectories of the inverted pendulum
system using the PSTL template in
example 11.6 and k = 2. The algo-
• Interpretable descriptions of system behavior are essential for understanding
rithm results in two clusters. The
and calibrating trust. pendulum falls over earlier in the
blue trajectories compared to the
• Policy visualization allows us to interpret the policy that a particular agent purple trajectories.
25
uses to make decisions. R. Lee, M. J. Kochenderfer, O. J.
Mengshoel, and J. Silbermann, “In-
terpretable Categorization of Het-
• Feature importance methods such as saliency maps and Shapley values allow
erogeneous Time Series Data,” in
us to understand the impact of different features on the behavior of a system. SIAM International Conference on
Data Mining, 2018.
• Surrogate models allow us to explain the policy of a complex system using a
simpler model and must balance between fidelity and interpretability.

• Counterfactual explanations provide insights into the decision-making process


of a system by showing how changes in the input affect the output.

• We can characterize the failure modes of a system by clustering them using


interpretable features.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
284 chapter 11. explainability

The following STL formula specifies that the angle of the pendulum should Example 11.6. Example of a PSTL
template formula for the inverted
not exceed π/4 for the first 200 time steps: pendulum system. The plots show
the robustness of the formula for
 π different values of φ. Our goal is
ψ = [0,200] θ <
4 to find the value of φ that causes a
given trajectory to marginally sat-
If we replace the time bound with a parameter φ, we obtain the following isfy the formula.
PSTL template formula:
 π
ψφ = [0,φ] θ <
4
The plots below show the robustness of the formula for different values of φ.
The plot on the left shows a value for φ such that the trajectory satisfies the
formula, the plot in the middle shows a value for φ that marginally satisfies
the formula, and the plot on the right shows a value for φ such that the
trajectory does not satisfy the formula.

Satisfied Marginally Satisfied Not Satisfied

1
φ = 2.45 φ = 3.85 φ = 4.05
θ (rad)

−1

0 2 4 0 2 4 0 2 4
Time (s) Time (s) Time (s)

We can find the value of φ that marginally satisfies ψφ by searching for the
value that causes the robustness to be as close as possible to zero. For this
simple formula, we can solve the optimization problem using a grid search
over the values of φ. The the value of φ that marginally satisfies the formula
will be the time just before the magnitude of the angle of the pendulum
exceeds π/4.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12 Runtime Monitoring

While the validation algorithms in the previous chapters are typically applied
offline prior to deployment, runtime monitoring techniques perform online assess-
ments that check the safety of the system during operation. The goal of a runtime
monitor is to flag situations that may be hazardous when they occur so that we
can trigger fallback mechanisms such as alerting a human operator, switching
to a safe mode or fallback policy, or updating the system’s model of the world.
Offline validation algorithms rely on a set of modeling assumptions about the
environment and disturbances that the system will encounter during operation.
If these models are incorrect or the environment changes, the validation results
may no longer be valid. This chapter begins by discussing techniques to identify
when we are operating outside the assumptions made during offline validation.
We then discuss techniques to monitor uncertainty in the behavior of the system.
Finally, we present techniques to monitor for potential failures in the system.

12.1 Operational Design Domain Monitoring

The operational design domain (ODD) of a system is the set of conditions under
which it is designed to operate safety. For example, the operational design domain
for an image-based aircraft taxi system may consist of the set of weather conditions,
times of day, and taxiways for which the system was designed. A good system
model should cover the ODD so that the validation results from the previous
chapters are valid within the ODD. If the system is operating outside the ODD, the
validation results may no longer be valid, and we cannot provide any guarantees
on the system’s safety. Therefore, it is important for us to monitor at runtime
whether a system is operating within its ODD.
286 chapter 12. runtime monitoring

There are multiple ways to represent the ODD of a system. One option is to
specify the ODD as a set of hand-designed conditions. For example, we could
write down the exact weather conditions, times of day, and taxiways that the
aircraft taxi system is designed to operate under (figure 12.1). We can also specify 3 No Clouds
the ODD in terms of acceptable ranges for the model parameters. For example, we
might expect the variance of our sensor measurements to stay within a particular 3 No Glare
bound. A drawback of this approach is that can require specialized domain 3 Daytime
knowledge to properly specify these conditions.
3 Taxiway A
We could also represent the ODD using data-driven approaches. These ap-
proaches rely on a data set of trajectory features that can be monitored at runtime
User
Figure 12.1. An example of a hand-
and adequately capture the characteristics of the ODD. These features are problem designed operational design do-
dependent. For example, if the characteristics of the ODD are well-described by main for an aircraft taxi system.
the state, the data could be the set of states observed during offline validation
or training (figure 12.2). Some problems may require additional features to ade-
quately represent the ODD. For example, the aircraft taxi system may perform
differently depending on the image observation it receives. In this case, the data
could be the set of images observed during offline validation or training. We
could also use trajectory segments or full trajectories as the data.
Data-driven approaches to ODD monitoring use the data to define a set repre-
sentation of the ODD. When the system encounters new data points at runtime,
Figure 12.2. For the continuum
it can check whether they belong to the set and flag potential dangerous behavior world problem, we can use the
if they do not. The remainder of this section discusses techniques to define this states visited during rollouts used
for offline validation to derive a
set given a representative data set. representation of the operational
design domain.
12.1.1 Nearest Neighbors Representation
One way to define the ODD given a data set is to use a nearest neighbors rep-
resentation. Specifically, we define the ODD as the set of points whose nearest
neighbor in the data set is within a certain threshold distance γ according to a
distance metric. A common distance metric is the Euclidean distance. The thresh-
old γ controls the conservatism of the set representation. A smaller γ results in
a more conservative representation in the sense of being less likely to include
situations that should not be included in the ODD. However, a value for γ that is
too small may be too conservative such that it misses out on situations that should
be included in the ODD. Figure 12.3 shows the ODD for the data in figure 12.3
defined using different threshold values.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.1. operational design domain monitoring 287

γ = 0.1 γ = 0.3 γ = 0.5 Figure 12.3. ODD (blue) defined


using nearest neighbors with dif-
ferent threshold values for the data
in figure 12.2. As the threshold
value increases, the ODD covers
mores space and becomes less con-
servative.

k=1 k=2 k=5 Figure 12.4. ODD (blue) defined


using k nearest neighbors with dif-
ferent values of k for the data in
figure 12.2. The distance threshold
is held constant at γ = 0.5. As k in-
creases, the ODD covers less space
and becomes more conservative.

In addition to the threshold γ, we can also control the number of neighbors


used to define the ODD. Instead of only considering the nearest neighbor, we can
define the ODD as the set of points whose k-nearest neighbors in the data set are
within a certain threshold distance γ. Increasing k ensures that we do not include
points in the ODD that are near outliers in the data set. Figure 12.4 shows the
ODD defined using different values of k for the data in figure 12.4.
Algorithm 12.1 implements a runtime monitor that uses the nearest neighbors
representation to check whether a new input is within the ODD. It computes the
k-nearest neighbors to the input and checks if they are within the threshold γ. If
all of the neighbors are within the threshold, the monitor returns true, indicating
that the input is within the ODD.
A drawback of the nearest neighbors representation is that it requires storing
the entire data set in memory at runtime, which may be infeasible for systems with
memory limitations or large data sets. It may also require a significant amount
of time to compute the nearest neighbors for each input, though a spatial index
can help improve computational efficiency.1 One way to improve the memory 1
H. Samet, Foundations of Multi-
dimensional and Metric Data Struc-
tures. Morgan Kaufmann, 2006.
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
288 chapter 12. runtime monitoring

struct KNNMonitor Algorithm 12.1. ODD monitoring


data # ODD data matrix (each column is a datapoint) using a set defined by k-nearest
k # number of neighbors neighbors. The algorithm creates
γ # threshold a k-d tree from the ODD data
end using the NearestNeighbors.jl
package. A k-d tree is a data struc-
function monitor(alg::KNNMonitor, input) ture that allows for efficient near-
kdtree = KDTree(alg.data) est neighbor queries. The monitor
neighbors, distances = knn(kdtree, input, alg.k) then uses this data structure to find
return all(distances .< alg.γ) the k-nearest neighbors to the in-
end put and checks if they are within a
threshold distance γ.

efficiency of the nearest neighbors representation is to cluster the data into groups
of similar points and compute the nearest neighbors to the center of each cluster
(figure 12.5). A common clustering algorithm called k-means is described in
section 11.6. We can increase the threshold γ to account for the distance between
the input and the center of the cluster.

12.1.2 Polytope Representation


Figure 12.5. Improving the effi-
To avoid storing the entire data set in memory, we can use a more compact set ciency of the nearest neighbors rep-
resentation (blue) by clustering the
representation such as a polytope (see section 8.3.1). One way to define the ODD data in figure 12.2 into 10 clusters
using a polytope is with the convex hull of the Data. We can then check if a new and computing the nearest neigh-
input is within the convex hull to determine if it is within the ODD. However, bors to the center of each cluster.

polytopes are not as expressive as the nearest neighbors representation and may
produce representations that are insufficiently conservative when the ODD is
nonconvex (figure 12.6).
We can represent the ODD using a more expressive set by defining it as the
union of multiple polytopes. For example, we could cluster the data set into k
clusters using a clustering algorithm and take the union of the convex hulls of the
clusters. Algorithm 12.2 implements this monitoring technique given a data set
and a clustering of the data. Figure 12.7 shows the ODD for the data in figure 12.2 Figure 12.6. The ODD (blue) de-
using different numbers of clusters. This approach, however, may still produce fined using the convex hull of the
data in figure 12.2. This ODD repre-
an ODD that contains regions of low data density if outliers are present. sentation is underconservative be-
cause it includes a large area for
which there is very little data.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.1. operational design domain monitoring 289

struct HullMonitor Algorithm 12.2. ODD monitoring


data # ODD data matrix (each column is a datapoint) using a set defined by the convex
𝒞 # collection of vectors containing cluster column indices hull of clustered data. For each clus-
end ter, the algorithm computes the
convex hull and checks if the input
function monitor(alg::HullMonitor, input) is within it. If the input is within
for (k, v) in alg.𝒞 any of the convex hulls, the moni-
hull = convex_hull([alg.data[:, i] for i in v]) tor returns true, indicating that the
if input ∈ VPolytope(hull) input is within the ODD.
return true
end
end
return false
end

2 Clusters 3 Clusters 5 Clusters Figure 12.7. The ODD defined us-


ing the convex hulls of multiple
clustersfor the data in figure 12.2.
These ODDs are more expressive
than the ODD that uses the convex
hull of the entire data set. As we in-
crease the number of clusters, the
ODD more closely approximates
the data at the expense of increased
memory usage.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
290 chapter 12. runtime monitoring

12.1.3 Superlevel Set Representation


Another way to define the ODD is to use a superlevel set. A superlevel set of a
function is the set of points for which the function is greater than a threshold.
For example, we could fit a distribution to the data set and define the ODD as a
superlevel set of its probability density function. Algorithm 12.3 implements this
monitoring technique given a distribution and a likelihood threshold.

struct SuperlevelSetMonitor Algorithm 12.3. ODD monitoring


dist # distribution using a set defined by the super-
γ # likelihood threshold level set of a distribution. The al-
end gorithm checks whether the proba-
bility density function of the dis-
function monitor(alg::SuperlevelSetMonitor, input) tribution evaluated at the input
return pdf(alg.dist, input) > alg.γ is greater than a threshold γ. If
end the probability density function
is greater than the threshold, the
monitor returns true, indicating
When fitting a distribution to the data, it is important that we select a model that the input is within the ODD.

class that is expressive enough to capture the characteristics of the ODD (see
chapter 2). For example, figure 12.8 demonstrates a scenario in which fitting a
mixture of Gaussians to the data results in a better representation of the ODD
than fitting a single Gaussian. Because the distributions of many model classes
can be fully specified using a small number of parameters, this approach is more
memory efficient than the nearest neighbors representation. The method is also
robust to outliers because the likelihood will be high where the data is dense.
Instead of using the superlevel set of a distribution fit to the data, we can also
use the superlevel set of a function that outputs the likelihood of the input being
in the ODD. For example, we can train a classifier to predict the likelihood of
a point being in the ODD and define the ODD as the set of points for which
the model outputs a likelihood greater than a threshold. Figure 12.9 shows an
example of this approach. One drawback of this approach is that training the
model typically requires data from inside and outside the ODD, which may be
difficult to obtain.

12.1.4 High-Dimensional Data


The methods presented in this section may struggle with high-dimensional data
such as image data due to the curse of dimensionality. As the dimension of the data

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.1. operational design domain monitoring 291

Distribution Superlevel Set Figure 12.8. The ODD (blue, right


column) defined using the super-
level set of a distribution (left col-
umn) for the data in figure 12.2.
Gaussian The top rows shows the ODD de-
fined using a Gaussian distribu-
tion, and the bottom row shows the
ODD defined using a mixture of
Gaussians. The mixture of Gaus-
sians is a more expressive distribu-
tion that better captures the data
and results in a more accurate
ODD.
Gaussian Mixture

2
For example, researchers have
shown that normalizing flows
trained on images often assign
increases, the volume of the space the data must cover increases exponentially. higher likelihoods to images out-
This increase in volume makes it difficult to adequately cover the space with a side the ODD. P. Kirichenko, P. Iz-
mailov, and A. G. Wilson, “Why
limited amount of data and can cause distance metrics to lose meaning. Normalizing Flows Fail to De-
One way to address the curse of dimensionality is to gather more data and tect Out-Of-Distribution Data,” Ad-
vances in Neural Information Process-
use more expressive models. For example, we could represent high-dimensional ing Systems, vol. 33, pp. 20 578–
distributions using an expressive model such as a normalizing flow. However, 20 589, 2020.
this approach may lead to overfitting or poor generalization.2 Another approach 3
Common approaches for dimen-
sionality reduction include prin-
is to assume that the data lies on a lower-dimensional manifold and to use dimen-
ciple component analysis and au-
sionality reduction techniques3 to find this manifold. toencoders. A detailed overview is
Given a lower dimensional projection of the data, we can use the methods provided by B. Ghojogh, M. Crow-
ley, F. Karray, and A. Ghodsi, El-
described in this section to define the ODD. When we get new data at runtime, ements of Dimensionality Reduction
we can project it onto this lower dimensional manifold and check whether it and Manifold Learning. Springer,
2023.
fits within the ODD. Figure 12.10 shows an example of a two-dimensional mani-
fold for the aircraft taxi image observations. When creating a lower-dimensional
representation of the data, it is important to ensure that the projection captures
the relevant features of the data that define the ODD and that data outside the

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
292 chapter 12. runtime monitoring

Training Data Classifier Probability Superlevel Set Figure 12.9. The ODD (blue, right
plot) defined using the superlevel
set of a classifier (middle plot)
trained on the data in figure 12.2
as well as additional data sampled
uniformly outside the ODD. The
classifier outputs the probability
that a point is in the ODD. The su-
perlevel set is defined as the set of
points for which the classifier out-
puts a probability greater than 0.5,
though this can be treated as a free
parameter to control conservatism.

ODD in the original space is projected outside the ODD in the lower-dimensional
space. A representation that is not expressive enough may result in feature col-
lapse, where far points in the original space are projected to nearby points in the
lower-dimensional space.4 Figure 12.11 shows an example of feature collapse in 4
J. Postels, M. Segù, T. Sun, L. D.
the two-dimensional manifold for the aircraft taxi example. Sieber, L. Van Gool, F. Yu, and F.
Tombari, “On the Practicality of De-
terministic Epistemic Uncertainty,”
in International Conference on Learn-
12.2 Uncertainty Quantification ing Representations (ICLR), 2022.

Another important aspect of runtime monitoring is uncertainty quantification,


which allows us to understand our uncertainty in various aspects of the current
and future behavior of a system. If we have high uncertainty, we may want to
be more cautious and take actions to reduce this uncertainty. We can quantify
uncertainty by creating models that predict some aspect of the current or future
behavior of the system and understanding the uncertainty in their predictions.
For example, we could train a model to predict the next state of the system and
quantify the uncertainty in this prediction.
We may encounter two different types of uncertainty when monitoring a
system. The first type of uncertainty is output uncertainty, which occurs when
a single input can lead to multiple different outputs.5 For example, the output 5
This type of uncertainty occurs
of the transition model of a system may be uncertain given the current state due to inherent stochastisicty in the
real world and is also referred to
due to various sources of randomness in the real world such as the behavior of as aleatoric uncertainty or irreducible
other agents. Another common cause of output uncertainty is sensor noise, which uncertainty.

results in different possible observations for the same state. In previous chapters,
we modeled this type of uncertainty using disturbance distributions.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 293

Figure 12.10. Projection of the high-


dimensional image data for the
aircraft taxi example onto a two-
dimensional manifold. The black
points represent the projection of
runway images used for validation.
The red points represent the pro-
jection runway images with differ-
ent lighting conditions. The blue
φ2

region shows a nearest neighbors


representation of the ODD defined
using the black points. The images
with different lighting conditions
are projected outside the ODD. The
projection uses the encoder of a
variational autoencoder, which is a
common dimensionality reduction
technique for image data. Y. Pu, Z.
Gan, R. Henao, X. Yuan, C. Li, A.
φ1 Stevens, and L. Carin, “Variational
Autoencoder for Deep Learning
of Images, Labels and Captions,”
Advances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 29,
The second type of uncertainty is model uncertainty, which arises due to limita-
2016.
tions of the model we are using to predict system behavior.6 A common limitation 6
This type of uncertainty is also re-
of data-driven models is a lack of data in certain regions of the input space. For ferred to as epistemic uncertainty or
reducible uncertainty.
example, if we train a model to predict system behavior using data from the ODD
of the system, we cannot expect it to make accurate predictions outside the ODD.
In other words, the model should have high uncertainty in regions of the input
space with no data. This section outlines data-driven approaches to quantify both
types of uncertainty. Outcome uncertainty is present in the data used to create
the model, while model uncertainty occurs in regions of no data (figure 12.12).

12.2.1 Predicting Output Uncertainty


Because output uncertainty is inherent to the system, it will be present in the
data we gather from the system. Therefore, we can use data to quantify outcome
uncertainty by training a model to output the parameters of a distribution over its
prediction. For example, if our goal is to predict the next state of the system, we
can learn a model that outputs the mean and standard deviation of a Gaussian
distribution conditioned on the input.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
294 chapter 12. runtime monitoring

Figure 12.11. Example of feature


collapse for the aircraft taxi system.
The black points show the projec-
tion of runway images used for val-
idation, and the blue region shows
the nearest neightbors representa-
tion of the ODD. The red points
show the projection of runway im-
ages with dark lighting conditions.
φ2

While these images are outside the


ODD in the original space, they are
projected inside the ODD in the
lower-dimensional space.

φ1

To learn the parameters of this model, we must minimize a proper scoring


rule. A scoring rule is a function that takes as input both the true values and
the predicted distribution and outputs a score that quantifies the quality of the
predicted distribution. A proper scoring rule is a scoring rule that is minimized
when the true distribution is equal to the predicted distribution. The negative
log-likelihood is a common proper scoring rule for parameter learning.7 7
More details on proper scoring
A common model used to predict uncertainty in continuous outputs is the con- rules are provided by T. Gneit-
ing and M. Katzfuss, “Probabilis-
ditional Gaussian model, which outputs the parameters of a Gaussian distribution tic Forecasting,” Annual Review of
conditioned on the input.8 Example 12.1 derives the negative log-likelihood ob- Statistics and Its Application, vol. 1,
no. 1, pp. 125–151, 2014.
jective for learning the parameters of a conditional Gaussian model. Figure 12.13 8
We could also select a different
shows the result of fitting a model using this objective to the data set in figure 12.12. continuous distribution to predict
The variance of the model will be larger in regions where the output data is more or train a model to predict quan-
tiles of a distribution in a process
spread out, indicating higher output uncertainty. known as quantile regression. R.
Koenker and K. F. Hallock, “Quan-
tile Regression,” Journal of Economic
Perspectives, vol. 15, no. 4, pp. 143–
156, 2001.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 295

High Figure 12.12. Output and model


Low Output uncertainty in a data set. Output
Output Uncertainty
Uncertainty uncertainty is inherent uncertainty
High
present in the data, while model
Model
Uncertainty uncertainty occurs where there is
y

no data.

Figure 12.13. Fitting a conditional


Gaussian model to the data set in
figure 12.12 to quantify output un-
certainty using the objective de-
rived in example 12.1. We repre-
y

sent pθ (y | x ) as a neural net-


work that outputs a mean and log
standard deviation. The solid blue
line shows the mean prediction,
and the shaded region shows 2σ
x
around the mean and represents
the output uncertainty in the pre-
diction. The model correctly pre-
dicts higher output uncertainty for
the data on the right.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
296 chapter 12. runtime monitoring

Suppose we want to quantify the outcome uncertainty when predicting a Example 12.1. Learning the pa-
rameters of a conditional Gaussian
continuous variable y given an input x fron a data set of ( x, y) pairs. We can model to quantify outcome uncer-
learn the parameters of the following conditional Gaussian model: tainty.

pθ (y | x ) = N (y | µθ ( x ), σθ ( x )2 )

where µθ ( x ) and σθ ( x ) are functions of x parameterized by θ. This model is


similar to the model introduced in examples 2.4 and 2.5. However, instead
of allowing only the mean to depend on the input x, we allow both the mean
and variance to depend on x. We can learn the parameters θ by maximizing
the likelihood of the data using the following optimization problem:
m
θ̂ = arg max ∑ log pθ (yi | xi )
θ i =1
m
= arg max ∑ log N (yi | µθ ( xi ), σθ ( xi )2 )
θ i =1
m
(y − µθ ( xi ))2
 
1
= arg max ∑ log q exp − i
θ i =1 2
2πσθ 2σθ ( xi )2
m  √ 1 (y − µθ ( xi ))2

= arg max ∑ log(1) − log( 2π ) − log(σθ ( xi )2 ) − i
θ0 i =1
2 2σθ ( xi )2
m
(y − µθ ( xi ))2
 
1
= arg max ∑ log(σθ ( xi )2 ) − i
θ0 i =1
2 2σθ ( xi )2
m
(yi − µθ ( xi ))2
 
= arg min ∑ + log ( σ (
θ i x ) 2
)
θ0 i =1
σθ ( xi )2

Intuitively, the first term in the final objective encourages the model to predict
a high variance when the squared error is high, and the second term penalizes
high variances. This objective is commonly used in machine learning and is
referred to as the Gaussian negative log-likelihood loss function. Figure 12.13
shows the result of fitting a model using this objective to the data set in
figure 12.12.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 297

For discrete outputs, we can train a model hθ ( x ) to output a vector y that


specifies the relative likelihood of each possible output given the input. To ensure Low Entropy
that the model outputs valid probabilities, we apply a softmax function to the
output of the model to obtain a vector of probabilities p such that

exp(yi ) 1 2 3 4
pi = k
(12.1)
∑ j=1 exp(y j )
High Entropy

where k is the total number of possible outputs. We can learn the parameters θ of
the model by maximizing the likelihood of the data given this model.
Given a distribution over predicted outputs, we can quantify our uncertainty 1 2 3 4
using the entropy of the distribution. Higher entropy indicates higher uncertainty
in the prediction, and we may want to flag situations that result in outputs with Figure 12.14. Entropy of a dis-
crete distribution over four possi-
high entropy as potentially dangerous. The entropy of a conditional Gaussian ble outputs. If the distribution as-
distribution is a function of the variance of the distribution. A higher variance signs high probability to a single
output, the entropy will be low. If
results in a higher entropy. For discrete distributions, the entropy is defined as the distribution assigns equal prob-
ability to all outputs, the entropy
k
will be high.
− ∑ pi log pi (12.2)
i =1

where pi is the probability of the ith output. A model that assigns equal probability
to all outputs will have maximum entropy, indicating high uncertainty in the
prediction (figure 12.14). We can use a threshold on entropy to create a runtime
monitor that flags uncertain situations.

12.2.2 Calibrating Output Uncertainty


We can measure the performance of our uncertainty quantification model using
the visual diagnostics and summary metrics outlined in sections 2.5.1 and 2.5.2.
Because negative log-likelihood is a proper scoring rule, if we were able to perfectly
minimize the negative log-likelihood of the data, the resulting model would be
perfectly calibrated. However, in practice, we have limited model capacity and
imperfect optimization, so the model distribution may not exactly match the true
distribution. Therefore, it is common to calibrate the model after training to better
match the true distribution.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
298 chapter 12. runtime monitoring

Calibration techniques typically rely on a separate set of calibration data. The 1


calibration data should consist of pairs of inputs and outputs that are sampled
0.8
independently from the data distribution the model will encounter at runtime.
0.6

αtrue
We can use this data to assess baseline performance by comparing the model
distribution to the distribution of the calibration data (figure 12.15). We can then 0.4

apply calibration techniques to adjust the model distribution to better match the 0.2
calibration data distribution. 0
0 0.2 0.4 0.6 0.8 1
A common technique to calibrate a model is to perform histogram binning on the αmodel
desired uncertainty metric. For discrete outputs, a common uncertainty metric
for calibration is the predicted probability of the correct output. For continuous Figure 12.15. Calibration plot of a
neural network model trained to
outputs predicted using a Gaussian distribution, a common uncertainty metric for predict the discrete actions of the
calibration is the predicted variance of the distribution. The first step in histrogram continuum world agent using the
data in figure 12.2. A calibrated
binning is to divide the calibration data into bins using the predicted values of
model should match the dashed
the uncertainty metric. We typically select the bin boundaries to create bins of line. The model is poorly calibrated
equal width or equal number of samples. and tends to be underconfident in
its predictions.
After binning the data, we can calculate the actual value of the uncertainty
metric for each bin. For example, for discrete outputs, we can calculate the average
predicted probability of the correct output for each bin. We can then adjust the
model predictions to match the actual values of the uncertainty metric in each bin.
For example, if the model is underconfident in its predictions, we can increase the
predicted probability of the model in the corresponding bins. When we obtain a
new data point, we calculate its predicted uncertainty metric to determine which
bin it belongs to. We can then use the actual value of the uncertainty metric in
that bin to adjust the model prediction. Example 12.2 provides an example of this
process.
Histogram binning requires storing the bin edges and corresponding cali-
bration values at runtime. Other techniques focus on instead fitting a single
9
calibration parameter to the data. For example, a common calibration technique Because we are fitting a single pa-
rameter, the risk of overfitting is
for models that output probabilities of discrete outputs using a softmax function low, so we do not necessarily need
is to introduce a precision parameter λ such that a separate set of calibration data.
This method is also sometimes re-
exp(λyi ) ferred to as temperature scaling with
pi = k
(12.3) temperature 1/λ. C. Guo, G. Pleiss,
∑ j=1 exp(λy j ) Y. Sun, and K. Q. Weinberger, “On
Calibration of Modern Neural Net-
This model is similar to the softmax response model introduced in section 2.4.2. works,” in International Conference
We can select the precision parameter lambda that minimizes a proper scoring rule on Machine Learning (ICML), 2017.
such as the negative log-likelihood of the calibration data.9 Figure 12.16 shows the

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 299

Suppose we have a model that predicts binary outputs for a system by pre- Example 12.2. Histogram bin-
ning calibration for a model that
dicting a probability p that the output is 1. Given a data set of calibration data, predicts a binary discrete output.
we can see that the model tends to be underconfident in its predictions (left). The rightmost plot shows the cali-
brated runtime data after applying
For example, for data points where the model predicts that the probability of the calibration technique. A well-
the output being 1 is between 0.5 and 0.6, the actual probability of the output calibrated model should follow the
being 1 according to the calibration data is around 0.73. If we were to deploy dashed line.

this model at runtime without additional calibration, we would observe a


similar trend (center).

Calibration Data Runtime Data Calibrated Data


1

0.8 0.73

0.6
ptrue

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
pmodel pmodel p̂model

We can use the bins of the model probability and their correspondong
actual probabilities from the calibration data to adjust the model predictions.
For example, suppose we get a new input at runtime that has a predicted
probability according to the model of 0.52. This probability falls into the high-
lighted bin above where the actual probability is 0.73. We should therefore
adjust that probability such that p̂ = 0.73 and use this probability to make
runtime monitoring decisions. The plot on the right shows the result if we
apply this calibration technique to runtime data in the plot on the right. The
model is now well-calibrated and follows the dashed line.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
300 chapter 12. runtime monitoring

Figure 12.16. Calibrating to con-


1.4 tinuum world action prediction
model shown in figure 12.15 us-
ing a precision parameter λ in the

3 softmax model. The plot shows the


Negative Log Likelihood

1.2 negative log-likelihood of the cal-


ibration data for different values
of λ. The calibration plots show
the calibrated model predictions
1
for different values of λ. We should
select the value of λ that minimizes
the negative log-likelihood of the
0.8 calibration data (shown in the cen-
ter).

0.6

0 0.5 1 1.5 2 2.5 3 3.5 4


λ

result of applying this calibration technique to calibrate the model in figure 12.15.
For Gaussian models, we can apply a similar technique by introducing a single
scaling parameter to the predicted variance of the model.
While fitting a single calibration parameter reduces the complexity of the
calibration procedure, it is not as expressive as histogram binning. For example,
there may not be a single precision parameter λ that adequately calibrates the all
bins of the model. Other calibration techniques use more complex models with
multiple parameters to adjust the model predictions.10 10
These techniques include Platt
scaling and isotonic regression.
A. Niculescu-Mizil and R. Caru-
12.2.3 Prediction Sets ana, “Predicting Good Probabili-
ties with Supervised Learning,” in
Another approach to quantifying uncertainty is to predict a set of possible out- International Conference on Machine
Learning (ICML), 2005.
comes rather than a single outcome. To create a prediction set, we must choose a
desired level of coverage. The coverage of a prediction set is the probability that
the true output lies within the set. For example, a prediction set with coverage
of 0.95 should contain the true output 95 % of the time. A large prediction set
indicates high uncertainty, while a small prediction set indicates low uncertainty.
For example, a large prediction set for the location of an autonomous vehicle
indicates that we have high uncertainty in its true location.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 301

High Uncertainty Low Uncertainty Figure 12.17. Prediction sets with


α = 0.95 for models that predict
continuous and discrete outputs.
The prediction sets are the regions
between the dashed blue lines. The
column on the left shows large pre-
diction sets that indicate high un-
certainty, while the column on the
right shows small prediction sets
−2 0 2 −2 0 2 that indicate low uncertainty.
y y

0.95

0.4 0.33
0.24
0.03 0.03 0.02 0.0
1 2 3 4 1 2 3 4
y y

Given a desired level of coverage c, we can derive a prediction set from a model
trained using the methods in section 12.2.1 to predict a distribution over the
output. For models that predict the parameters of a Gaussian distribution, it is
common to create a prediction set centered around the predicted mean. We create
this set by extending outward from the mean until the set includes c probability
mass.11 For models that predict the probabilities of discrete outputs, we can create 11
Prediction sets do not need to
a prediction set by adding outputs to the set in order of decreasing predicted be centered around the mean. Any
prediction set that occupies c prob-
probability until the sum of the probabilities of the outputs in the set exceeds c. ability mass is a valid prediction
Figure 12.17 shows small and large prediction sets for both discrete and con- set. However, the centered predic-
tion set is the smallest possible set
tinuous models. Before generating prediction sets, it is important to ensure that that contains c probability mass.
the model is well-calibrated. If the model is not well-calibrated, the prediction
12
sets may be too small or too large. For example, if the model is underconfident in A detailed overview of confor-
mal prediction is provided in A. N.
its predictions, the prediction sets may be too small. We can calibrate the model
Angelopoulos, S. Bates, et al., “Con-
using the techniques described in section 12.2.2. formal Prediction: A Gentle Intro-
We can also generate accurate prediction sets from an uncalibrated uncertainty duction,” Foundations and Trends
in Machine Learning, vol. 16, no. 4,
measure using a technique known as conformal prediction.12 Similar to the tech- pp. 494–591, 2023.
niques described in section 12.2.2, conformal prediction uses a calibration set

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
302 chapter 12. runtime monitoring

to adjust the uncertainty measure. The data points in the calibration set must
be exchangeable, meaning that the joint distribution of the data points does not
change based on the order they apppear.13 However, conformal prediction does 13
Exchangeability is a more re-
not require a model that predicts a distribution over the output. Instead, it uses laxed conditioned than indepen-
dence. Any set of variables that are
the calibration set to adjust any heuristic uncertainty measure. independent and identically dis-
The first step of conformal prediction involves identifying a heuristic notion of tributed are also exchangeable.
uncertainty. For example, we can use the parameters of an uncalibrated output
distribution from a model trained using the methods in section 12.2.1. Next,
conformal prediction requires a score function s( x, y) that encodes how well the
predicted uncertainty in the output conditioned on the input x matches the true
output y. The score should be lower when there is good agreement between
the prediction and the true output. Example 12.3 shows a score function for
a Gaussian model, and example 12.4 shows a score function for a model that
predicts discrete outputs.
The final step of conformal prediction involves computing the score for each
point in the calibration set. We then find the score q that corresponds to the
d(n + 1)ce/n quantile of the calibration scores, where d·e is the ceiling function
and n is the number of points in the calibration set. Given a new input at runtime,
the prediction set that guarantees a coverage of at least c is

{y | s( x, y) ≤ q} (12.4)

as long as the new input is exchangeable with the calibration set.14 Example 12.5 14
A proof of this property is pro-
shows an example of a prediction set for a Gaussian model, and example 12.6 vided by A. N. Angelopoulos, S.
Bates, et al., “Conformal Predic-
shows an example of a prediction set for a model that predicts discrete outputs. tion: A Gentle Introduction,” Foun-
The practicality of conformal prediction is highly dependent on the heuristic dations and Trends in Machine Learn-
ing, vol. 16, no. 4, pp. 494–591,
notion of uncertainty and the score function. If these choices do not accurately 2023.
reflect the true uncertainty in the model, the prediction sets may be too large to be
useful. It is also important to note that conformal prediction provides guarantees
on the marginal coverage of the prediction sets and not the conditional coverage. In
other words, the prediction sets will have a coverage of at least c on average, but
the coverage may vary for different inputs (figure 12.18). Example 12.7 highlights
a limitation of conformal prediction caused by this property.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 303

An example of a score function for a Gaussian model described in exam- Example 12.3. Score function
for conformal prediction using a
ple 12.1 is conditional Gaussian model as a
|y − µθ ( x )| heuristic notion of uncertainty.
s( x, y) =
σθ ( x )
where µθ ( x ) and σθ ( x ) are the predicted mean and standard deviation. The
score will be low for inputs where the predicted mean is close to the true
output. The score will also be low if the predicted output is far from the true
output but the predicted standard deviation is large enough to account for
this variation.
The plots below show the score function for different predicted distribu-
tions and corresponding true outputs. Although the true output is further
from the predicted mean in the center plot than it is the left plot, the two
data points produce the same score because the predicted standard deviation
is larger in the center plot. In the plot on the right, the predicted standard
deviation does not account for the increased gap between the predicted mean
and the true output, resulting in a higher score.

s( x, y) = 1.5 s( x, y) = 1.5 s( x, y) = 3.0

µθ ( x ) y µθ ( x ) y µθ ( x ) y

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
304 chapter 12. runtime monitoring

Suppose we want to use the softmax probabilities of a model f θ ( x ) that Example 12.4. Score function
for conformal prediction using a
predicts discrete outputs as a heuristic notion of uncertainty. We first define model that predicts probabilities
the function π j ( x ) to return the index of the output with the jth highest of discrete outputs.
predicted probability. The score function can then be defined as

k
s( x, y) = ∑ f θ ( x )π j ( x)
j =1

where k is selected such that y = πk ( x ). In other words, the procedure for


determining the score is as follows. We first compute the predicted probabil-
ities of the model for the input x. We then add the predicted probabilities
to the score in decreasing order until we have added the probability for the
true class.
The plots below show an example computation of the score function for a
model that predicts probabilities of the action that a continuum world agent
will take. The correct output is y = left. The left plot shows the predicted
probabilities of the model for each action. The right plot shows the predicted
probabilities sorted in decreasing order. The score function is the sum of the
probabilities in the sorted list until the true action is reached, which results
in a score of 0.7.
Predicted Probabilities Sorted Probabilities
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1

up down left right do wn left up right


Action Action

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 305

Plugging the score function from example 12.3 into equation (12.4) provides Example 12.5. Prediction set from
conformal prediction using a Gaus-
us with the following prediction set for a Gaussian model: sian model and the score function
( ) from example 12.3.
|y − µθ ( x )|
{y | s( x, y) ≤ q} = y ≤q
σθ ( x )

= {y | |y − µθ ( x )| ≤ qσθ ( x )}

where q is the quantile of the calibration scores that corresponds to the desired
coverage c. This prediction set is centered around the predicted mean and
extends outward q standard deviations from the mean.
Intuitively, conformal prediction scales the standard deviation based on
the results of the calibration data. For example, suppose we want to create
a prediction set with 95 % coverage. If the model was perfectly calibrated,
we would expect q ≈ 2 because 95 % of the data should lie within two
standard deviations of the mean. However, if the model was underconfident,
we would expect q > 2 to produce larger prediction sets, and if the model
was overconfident, we would expect q < 2 to produce smaller prediction
sets.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
306 chapter 12. runtime monitoring

Plugging the score function from example 12.4 into equation (12.4) provides Example 12.6. Prediction set
from conformal prediction using
us with the following prediction set for a model that predicts probabilities of a model that predicts probabilities
discrete outputs: of discrete outputs and the score
function from example 12.4.
( )
k
{y | s( x, y) ≤ q} = y ∑ f θ ( x )π j ( x) ≤ q
j =1

where q is the quantile of the calibration scores that corresponds to the


desired coverage c. In other words, this prediction set is the set of outputs
with the highest predicted probabilities that have a cumulative probability
mass of at least c.
Suppose we found that the score q that corresponds to the 95 % quantile
of the calibration scores is 0.7. The prediction set for two different inputs is
shown in the plots below. The input on the left results in a prediction set
with two actions ({left, down}), while the input on the right results in a
prediction set with one action ({left}).

Prediction Set = {left, down} Prediction Set = {up}

0.8
0.6

0.2 0.15
0.05 0.08 0.07 0.05

left down up right up right left down


Action Action

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.2. uncertainty quantification 307

Marginal Coverage Conditional Coverage Figure 12.18. Comparison of


marginal and conditional coverage
of prediction sets. The blue lines
indicate the prediction sets with
95 % coverage. Green points are in-
side the prediction set, while red
points are outside the prediction
y

y
set. While 95 % of the points are in-
side the prediction set on average
in both plots, the plot on the left
only provides marginal coverage,
while the plot on the right provides
x x both marginal and conditional cov-
erage.

Suppose we perform conformal prediction on a model that predicts the Example 12.7. Example of the lim-
itations of the coverage guarantees
location of an aircraft from runway images. We use calibration data in which provided by conformal prediction.
95 % of the images are taken during the day, and the remaining 5 % are taken
at night. If we use conformal prediction to produce prediction sets with
95 % coverage, the prediction sets will have a coverage of at least 95 % on
average. However, it is possible that the prediction sets for nighttime images
may have a coverage of 0 %, while the prediction sets for daytime images
may have a coverage of 95 %. In this case, if the aircraft is operating at night,
the prediction sets will be inaccurate, resulting in potentially dangerous
behavior. Therefore, it is important to consider the conditional coverage of
the prediction sets when using conformal prediction.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
308 chapter 12. runtime monitoring

12.2.4 Model Uncertainty


Model uncertainty arises from a lack of knowledge about the behavior of the

y
system, which may arise when the system is operating outside its ODD. In these
scenarios, we do not have data on the system’s behavior, and we therefore cannot
expect data-driven output uncertainty estimates to be accurate (figure 12.19). For
x
this reason, we need other techniques to estimate model uncertainty.
One approach to estimating model uncertainty is to use a Bayesian approach Figure 12.19. We cannot expect
data-driven uncertainty estimates
in which we maintain a distribution over possible models. The key insight behind
to be calibrated in regions of the
this approach is that there are many possible models that could have generated input space where we do not have
the data (figure 12.20), and we should account for this uncertainty when making data. For example, it is possible that
the data we were missing (purple)
predictions. We represent the distribution over possible models as p(θ | D ), when we trained the model in fig-
where θ are the parameters of the model and D is the data. ure 12.13 lies well outside the 2σ
region of the model’s predictions.
Given a new input, we can compute a distribution output using a process
known as Bayesian model averaging. Bayesian model averaging uses the following
equation to make predictions:
Z
p(y | x, D ) = p(y | x, θ) p(θ | D ) dθ (12.5)

where p(y | x, θ) is the distribution over the prediction given the input and a
specific instantiation of the parameters of the model. Intuitively, this equation
computes the distribution over the output by averaging the predictions of all
possible models weighted by the probability of each model given the data.

Figure 12.20. There are many pos-


sible models that could have gen-
erated the black data points. Each
line represents a different model.
y

x
15
A. G. Wilson and P. Izmailov,
“Bayesian Deep Learning and a
Probabilistic Perspective of Gen-
In general, equation (12.5) is intractable to compute because it requires integrat- eralization,” Advances in Neural
ing over the entire parameter space. However, we can use a variety of techniques Information Processing Systems
(NeurIPS), vol. 33, pp. 4697–4708,
to approximate this integral.15 One approach is to use MCMC to sample from 2020.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.3. failure monitoring 309

Figure 12.21. An example in which


all models in the ensemble con-
verge to the same local minima. In
this case, we will underestimate
Loss

y
our uncertainty.

θ x

the posterior distribution over the parameters of the model p(θ | D ) (see sec-
tions 2.3.2 and 6.3). We can then use these samples to approximate the integral
16
in equation (12.5). One drawback of this approach is that it requires running B. Lakshminarayanan, A. Pritzel,
and C. Blundell, “Simple and Scal-
MCMC to make predictions as runtime, which can be computationally expensive.
able Predictive Uncertainty Esti-
Another approach to approximating the integral in equation (12.5) is to create mation Using Deep Ensembles,”
an ensemble consisting of a set of models M that all have high likelihood ac- Advances in Neural Information Pro-
cessing Systems (NeurIPS), vol. 30,
cording to p(θ | D ).16 One way to create these models is to train the same model 2017.
multiple times with different initializations. We then approximate the integral as
an equally weighted mixture of the predictions of each model as follows:

− p (θ | D )
|M| θ∑
p(y | x, D ) ≈ p(y | x, θ) (12.6)
∈M

Figure 12.23 shows an example of an ensemble trained to predict the actions of


an agent in the continuum world using the data in figure 12.2.
It is important to ensure that the models in the ensemble have sufficient diver- θ

sity. By starting from different initializations, we encourage each model to find a Figure 12.22. By training models
different local minima in the loss function (figure 12.22). However, this property with different initializations for θ,
is not guaranteed. It is possible that all models in the ensemble will still converge we hope to arrive at different local
minima in the loss function. The
to the same local minima, which results in overconfident uncertainty estimates colored points represent local min-
(figure 12.21). Therefore, it may be necessary to incorporate other heuristics into ima for each mode in figure 12.20.
the training to ensure that the models in the ensemble are diverse.17 17
V. Dwaracherla, Z. Wen, I.
Osband, X. Lu, S. M. Asghari, and
B. Van Roy, “Ensembles for Uncer-
12.3 Failure Monitoring tainty Estimation: Benefits of Prior
Functions and Bootstrapping,”
Transactions on Machine Learning
Even when a system is operating with low uncertainty within its ODD, it may Research, 2022.
still end up in situations that lead to failure. Therefore, it is important to monitor

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
310 chapter 12. runtime monitoring

Figure 12.23. Ensemble trained to


predict the actions of the contin-
uum world agent using the data
in figure 12.2. Brigher colors indi-
cate higher entropy (and therefore
high uncertainty) in the ensemble
output. The bar plots show the dis-
tribution over actions for three dif-
ferent inputs. When the input is
in the ODD (purple), the ensem-
up down left right ble is confident in a single action.
When the input is outside the ODD
(blue), the ensemble assigns equal
probability to all actions. The white
point is on the decision boundary
between the up and right actions,
so it assigns roughly equal proba-
bility to both actions and low prob-
ability to the others.

up down left right

up down left right

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.3. failure monitoring 311

Figure 12.24. A simple failure mon-


itoring system for the aircraft colli-
Warning! sion avoidance problem that issues
a warning if the aircraft (black) en-
ters a buffer region (red ellipse)
around the intruder aircraft (red).

systems for potential dangerous situations within their ODD. A simple approach
to failure monitoring is to create a set of heuristic rules or properties that describe
dangerous scenarios using information that can be monitored at runtime. For
example, in an aircraft collision avoidance scenarios, we can monitor the distance
between the two aircraft and issue a warning if the distance falls below a certain
threshold (figure 12.24). We typically want to set this threshold to be conservative
so that we have time to take corrective actions mitigate the likelihood of a potential
failure.
In addition to heuristic rules, we can make predictions about whether a partic-
ular situation is likely to lead to failure by performing some additional computa-
tion at runtime. Specifically, we can use a model of the system to run validation
algorithms online during deployment. For example, we could use one of the
reachability algorithms discussed in chapters 8 to 10 to determine whether fail-
ure states are reachable from the current state (first row of figure 12.25). If the
specification for the system can be written as an LTL formula, we can use the
techniques discussed in section 3.6 to convert the specification into a reachability
specification using an automaton. We can then monitor the system at runtime by
traversing the automaton.18 18
A. Bauer, M. Leucker, and C.
Monitors that use reachability analysis may be overly conservative, and we Schallhart, “Runtime Verification
for LTL and TLTL,” ACM Transac-
may instead only want to flag situations that have a signficant probability of tions on Software Engineering and
leading to failure. In this case, we can use the techniques discussed in chapter 7 to Methodology (TOSEM), vol. 20,
no. 4, pp. 1–64, 2011.
compute the probability of reaching a failure state from the current state. We can
then use this probability to determine whether to issue a warning. The second
row of figure 12.25 shows an example of a failure monitoring system that uses
a probabilistic model to predict the likelihood of failure at each time step. In

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
312 chapter 12. runtime monitoring

Figure 12.25. Online estimation of


reachable set and the probability
Pfail ≈ 0.0 Pfail ≈ 0.08 Pfail ≈ 0.04 Pfail ≈ 0.6 of failure for the aircraft collision
400 avoidance problem. The reachable
set estimation (top row) assumes
200 a maximum acceleration and dis-
h (m)

turbance magnitude, and the on-


0
line probability of failure estima-
−200 tion (second row) uses 50 roll-
outs for algorithm 7.1. The third
−400 row shows the result of comput-
ing failure probabilities over the
400 entire state space offline and look-
ing up the results during opera-
200 tion. Brighter red indicates a higher
h (m)

failure probability. Using only the


0
reachable sets to check safety may
−200 be overly conservative. We may in-
stead only want to issue a warning
−400 if the probability of failure exceeds
a threshold.
400

200
h (m)

−200

−400
40 30 20 10 0 40 30 20 10 0 40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s) tcol (s) tcol (s)

addition to failure probabilities, we can also predict other quantities of interest,


such as the distribution over STL robustness values.
When we run validation algorithms offline, we often need to run them over the
entire state space of the system because we do not know which states the system
will encounter at runtime. In constrast, when we run validation algorithms online,
we can focus only on the state that the system reaches. This property of online
validation often allows us to save computation. However, running validation
algorithms online can still be computationally expensive.
One way to further reduce computational expense is to store the results of
offline validation algorithms and query them at runtime. For example, we could
compute the probability of failure from all states in the state space offline and

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
12.4. summary 313

store the results in a lookup table. At runtime, we can then query the lookup
table to determine the probability of failure for the current state. The third row of
figure 12.25 shows an example of a failure monitoring system that uses a lookup
table to determine the probability of failure at each time step. For systems that
have memory limitations, we can use a variety of compression techniques to store
the results in a more compact form. For example, we could train a neural network
to approximate the failure probability from any state in the state space.

12.4 Summary

• Runtime monitors allow us to monitor the safety of a system during deployment


and take corrective action to avoid unsafe situations.

• We can use runtime monitors to check whether the assumptions we used


during offline validation are still valid at runtime by checking whether the
system is operating within its operational design domain.

• We can represent the operational design domain using a variety of set rep-
resentations such as sets defined by nearest neighbors, convex hulls, or level
sets.

• Output uncertainty and model uncertainty are two types of uncertainty we


should monitor at runtime.

• Output uncertainty can be learned from data by training models to output


uncertainty measures.

• Calibration techniques can be used to ensure that the uncertainty measures


output by the model are accurate.

• Model uncertainty can be represented using an ensemble of models.

• Failure monitoring systems can be implemented using heuristic rules, reacha-


bility analysis, or probabilistic models.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a p p e n dices
A Systems

This appendix summarizes some of the systems used as examples in this book.

A.1 Default Implementations

Each component of the system must take in both its typical inputs as well as a
disturbance. For components that do not use the disturbance, the disturbance is
ignored using the default implementation in algorithm A.1.

(env::Environment)(s, a, x) = env(s, a) Algorithm A.1. Default implemen-


(sensor::Sensor)(s, x) = sensor(s) tation for components that do not
(agent::Agent)(o, x) = agent(o) use the disturbance.

Each component must also have a disturbance distribution that takes in the
necessary inputs and returns a distribution over disturbances. Algorithm A.2
provides a default implementation for components that do not use the disturbance.
It returns a distribution object that specifies that the component is deterministic.
The environment component for each system must also have a function that
returns the default distribution over initial states.

Ds(env::Environment, s, a) = Deterministic() Algorithm A.2. Default implemen-


Da(agent::Agent, o) = Deterministic() tation of the disturbance distribu-
Do(sensor::Sensor, s) = Deterministic() tion for components that do not use
the disturbance.
318 appendix a. systems

A.2 Simple Gaussian System

The simple Gaussian system consists of an environment with a single state variable
that is sampled from a Gaussian distribution with mean0 and standard deviation
1. After sampling an initial state, the system will remain in that state for all time
regardless of the action. In other words, the system has no agent, and the state is
fully observable. Algorithm A.3 defines each component of the system. A typical −4 −2 0 2 4
specification (chapter 3) for the simple Gaussian system is τ

Figure A.1. The simple Gaussian


ψ = ( s > γ ) (A.1) system with failure threshold γ =
−2. If the sampled state is below
which requires that the state be above a specified threshold γ. the threshold, the system fails.

struct SimpleGaussian <: Environment end Algorithm A.3. Components of the


(env::SimpleGaussian)(s, a) = s simple Gaussian system. It samples
Ps(env::SimpleGaussian) = Normal() an initial state and stays in that
state for all time. The system has
struct NoAgent <: Agent end no agent, and the state is fully ob-
(c::NoAgent)(s) = nothing servable.

struct IdealSensor <: Sensor end


(sensor::IdealSensor)(s) = s

A.3 Multivariate Gaussian System

The multivariate Gaussian system is an extension to the simple Gaussian system s2


where the state is two dimensions. The state is sampled from a multivariate
Gaussian distribution with its mean at the origin and a covariance of the identity
matrix. Algorithm A.4 defines the environment for this system. Similar to the
simple Gaussian system, the multivariate Gaussian system has no agent, and s1
the state is fully observable. A typical specification for the multivariate Gaussian
Figure A.2. Multivariate Gaussian
distribution is environment with two possible fail-
ψ = (s1 > γ1 ∧ s2 > γ2 ) (A.2) ure modes (red shaded regions).
The contours show the log density
where s1 and s2 are the first and second components of the state. We can also of the distribution over states with
brighter colors indicating higher
string together multiple specifications of this form to create multiple possible
density.
failure modes (figure A.2).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.4. mass-spring-damper system 319

struct MvGaussian <: Environment end Algorithm A.4. The environment


(env::MvGaussian)(s, a) = s for the multivariate Gaussian sys-
Ps(env::MvGaussian) = MvNormal(zeros(2), I) tem. It uses the same agent and sen-
sor as the simple Gaussian system.

A.4 Mass-Spring-Damper System

A common example of a linear system is a mass-spring-damper system, which


can be used to model a wide range of physical systems such as a car suspension
or a bridge. The state of the system is the position (relative to the resting point) p
and velocity v of the mass (s = [ p, v]), the action is the force β applied to the mass,
and the observation is a noisy measurement of state. The equations of motion for
a mass-spring-damper system are

p0 = p + v∆t
  k
0 k c 1 β
v = v + − p − v + β ∆t m
m m m
where m is the mass, k is the spring constant, c is the damping coefficient, and ∆t c

is the discrete time step. Figure A.3. A mass-spring-damper


For linear reachability analysis, it is helpful to write the dynamics in the form system with a mass m attached to
a wall by a spring with spring con-
of equation (8.6) as
stant k and a damper with damp-
" #" # " # ing coefficient c. The system is con-
1 ∆t p 0 trolled by a force β applied to the
T (s, a, xs ) = + 1 β + xs = Ts s + Ta a + xs (A.3)
− mk ∆t 1 − mc ∆t v m ∆t
mass.

We control the mass-spring-damper using a proportional controller such that


α = Π>o = [0, −1] in equation (8.5) is the gain matrix. Similarly, we model the
0.4
sensor as an additive noise sensor such that Os in equation (8.4) is the identity
0.2
matrix and xo is the additive noise. Algorithm A.5 defines the components of the
p (m)

mass-spring-damper system. Typically, trajectories for this system will oscillate 0

back and forth before coming to rest. In general, we want to ensure that the system −0.2
remains stable, meaning that the position does not exceed some magnitude. We −0.4
can write this specification in STL as 0 2 4
Time (s)
ψ = (| p| < γ) (A.4)
Figure A.4. Example trajectories
where γ is the maximum position magnitude of the mass. If the noise becomes of the mass-spring-damper sys-
too large, the system may become unstable. tem. The system oscillates back and
forth before coming to rest.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
320 appendix a. systems

@with_kw struct MassSpringDamper <: Environment Algorithm A.5. Components of the


m = 1.0 # mass mass-spring-damper system. The
k = 10.0 # spring constant environment implements equa-
c = 2.0 # damping coefficient tion (A.3). We implement the ini-
dt = 0.05 # time step tial state distribution Ps to sample
end the initial position from an inter-
val around the resting point and
Ts(env::MassSpringDamper) = [1 env.dt; a small initial velocity. The sensor
-env.k*env.dt/env.m 1-env.c*env.dt/env.m] is an additive noise sensor, which
Ta(env::MassSpringDamper) = [0 env.dt/env.m]' samples noise from Do and adds it
function (env::MassSpringDamper)(s, a) to the state. The agent uses a pro-
return Ts(env) * s + Ta(env) * a portional controller, which multi-
end ples a gain matrix by the observa-
Ps(env::MassSpringDamper) = Product([Uniform(-0.2, 0.2), tion.
Uniform(-1e-12, 1e-12)])

struct AdditiveNoiseSensor <: Sensor


Do # noise distribution
end

(sensor::AdditiveNoiseSensor)(s) = sensor(s, rand(Do(sensor, s)))


(sensor::AdditiveNoiseSensor)(s, x) = s + x
Do(sensor::AdditiveNoiseSensor, s) = sensor.Do
Os(sensor::AdditiveNoiseSensor) = I

struct ProportionalController <: Agent


α # gain matrix (c = α' * o)
end
(c::ProportionalController)(o) = c.α' * o
Πo(agent::ProportionalController) = agent.α'

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.5. inverted pendulum system 321

A.5 Inverted Pendulum System

The inverted pendulum system is a classic nonlinear system in which we balance


a pendulum by applying torques at its base. The state of the system is the angle
θ of the pendulum and its angular velocity ω, the action is the torque applied, s = [0.2, −0.2]
and the observation is a noisy measurement of the state. The next state for the o = [0.3, −0.1]
inverted pendulum system is calculated as follows: a = −4.1
 
0 3g 3a
ω =ω+ sin θ + ∆t
2` m `2 (A.5)
0 0
θ = θ + ω ∆t

where m is the mass of the pendulum, ` is the length of the pendulum, g is the Figure A.5. State, action, and ob-
acceleration due to gravity, and ∆t is the discrete time step. Algorithm A.6 imple- servation for the inverted pendu-
lum system. The goal is to apply a
ments these dynamics. The magnitude of the angular velocity of the pendulum is torque at each time step to balance
limited to ωmax and the torque applied is limited to amax . The initial state for the the pendulum upright. The obser-
pendulum is sampled uniformly from angles near upright with a small angular vation is a noisy measurement of
the state.
velocity.

@with_kw struct InvertedPendulum <: Environment Algorithm A.6. Environment for


m::Float64 = 1.0 # mass of the pendulum the inverted pendulum system.
l::Float64 = 1.0 # length of the pendulum The initial state distribution Ps
g::Float64 = 10.0 # acceleration due to gravity samples the initial angle from an
dt::Float64 = 0.05 # time step interval near upright and the angu-
ω_max::Float64 = 8.0 # maximum angular velocity lar velocity from an interval near
a_max::Float64 = 2.0 # maximum torque zero.
end

function (env::InvertedPendulum)(s, a)
θ, ω = s[1], s[2]
dt, g, m, l = env.dt, env.g, env.m, env.l
a = clamp(a, -env.a_max, env.a_max)
ω = ω + (3g / (2 * l) * sin(θ) + 3 * a / (m * l^2)) * dt
θ = θ + ω * dt
ω = clamp(ω, -env.ω_max, env.ω_max)
return [θ, ω]
end
Ps(env::InvertedPendulum) = Product([Uniform(-π / 16, π / 16),
Uniform(-1.0, 1.0)])

Throughout the book, we use sensor noise as the main source of randomness in
the inverted pendulum system. Therefore, the inverted pendulum system uses the

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
322 appendix a. systems

same additive noise sensor as the mass-spring-damper system in algorithm A.5.


The pendulum also uses the proportional controller from algorithm A.5 with 1
α ∝ [−15, −8]. The specification for the pendulum system requires that the angle
of the pendulum remain within a specified range and is written in STL as

θ (rad)
0
ψ = (|θ | < π/4) (A.6)

Figure A.6 shows an example plot of both a success and failure trajectory for the −1
inverted pendulum system. 0 0.5 1 1.5 2
Time (s)

A.6 Grid World System Figure A.6. Example of a success


(green) and failure (red) trajectory
The grid world system is an example of a system with discrete states, actions, and for the inverted pendulum system.
disturbances. The state of the system is a two-dimensional position on a grid, the The failure trajectory enters the red
failure region when its angle θ ex-
action is a movement in one of four cardinal directions. Most of the time, the agent ceeds π/4.
will move in the direction specified by the action, but with some probability, it
will slip and move in a random direction. If the agent tries to move in a direction up down
left right
outside the grid, it will remain in its current cell. For example, if the agent moves
to the left when it is in a cell on the leftmost side of the grid, it will remain in its
current cell.
Algorithm A.7 implements the agent and environment for the grid world
system. The policy of the agent is a lookup table that maps states to actions.
Figure A.7 shows the action taken in each state for an agent that is trained to reach
the green goal while avoiding the red obstacle. The system is fully observable
and therefore uses the ideal sensor from algorithm A.3. The specification for the Figure A.7. Policy for the grid
grid world system requires that the agent avoid the obstacle states and reach a world agent.
goal state. We can write the specification in LTL as

ψ = ♦ G ( s t ) ∧ ¬ F ( s t ) (A.7)

where G (st ) returns true if st is a goal state and F (st ) returns true if st is an
obstacle state. Figure A.8 shows an example of the grid world system with an
obstacle state in the center of the grid and a goal state near the upper right corner.

A.7 Continuum World System Figure A.8. Example of a success


(green) and failure (red) trajectory
for the grid world system. The fail-
The continuum world system is an extension of the grid world system with ure trajectory enters the cell with
continuous states and disturbances. The action is still selected as one of the the obstacle.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.7. continuum world system 323

@with_kw struct GridWorld <: Environment Algorithm A.7. Environment and


size = (10, 10) # dimensions of the grid agent for the grid world system.
terminal_states = [[5,5],[7,8]] # goal and obstacle states The actions correspond to indices
directions = [[0,1],[0,-1],[-1,0],[1,0]] # up, down, left, right in the directions field of the en-
tprob = 0.7 # probability do not slip vironment. The agent will move in
end the direction specified by the action
with probability tprob and slip
function Ds(env::GridWorld, s, a) with probability 1 − tprob. The
slip_prob = (1 - env.tprob) / (length(env.directions) - 1) agent uses a vector of actions called
probs = fill(slip_prob, length(env.directions)) policy and a function that maps
probs[a] = env.tprob states to their corresponding in-
return Categorical(probs) dices in the policy vector.
end
(env::GridWorld)(s, a) = env(s, a, rand(Ds(env, s, a)))
function (env::GridWorld)(s, a, x)
if s in env.terminal_states
return s
else
dir = env.directions[x]
return clamp.(s .+ dir, [1, 1], env.size)
end
end
Ps(env::GridWorld) = SetCategorical([[1, 1]])

struct DiscreteAgent <: Agent


policy # dictionary mapping states to actions
end
(c::DiscreteAgent)(o) = c.policy[o]

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
324 appendix a. s ystems

up down
cardinal directions; however, instead of slipping in one of the other cardinal left right
directions, the agents slips in a random direction on the unit circle with higher
probability of slipping in directions close to the desired direction. Specifically, we
model this process by first adding a random vector x sampled from a multivariate
Gaussian distribution to the desired direction and then normalizing the result to
have a magnitude of 1.
The agent for the continuum world problem maps continuous states to discrete
actions by interpolating a policy defined on a grid of discrete points. Specifically,
each state in the grid corresponds to a set of values that represent the expected Figure A.9. Policy for the contin-
future return when taking each action from the state.1 Given a new state that uum world agent.
1
is not part of the grid, the agent uses multilinear interpolation to estimate the These values make up the state-
action value function. More details
expected future return for each action. It then takes the action with the highest are provided by M. J. Kochender-
expected return. Figure A.9 shows the resulting policy for an agent trained to fer, T. A. Wheeler, and K. H. Wray,
Algorithms for Decision Making. MIT
reach the green goal while avoiding the red obstacle. Algorithm A.8 defines the Press, 2022.
agent and environment for the continuum world problem.

@with_kw struct ContinuumWorld <: Environment Algorithm A.8. Environment and


size = [10, 10] # dimensions agent for the continuum world sys-
terminal_centers = [[4.5,4.5],[6.5,7.5]] # obstacle and goal centers tem. Instead of slipping in one of
terminal_radii = [0.5, 0.5] # radius of obstacle and goal the cardinal directions, the agents
directions = [[0,1],[0,-1],[-1,0],[1,0]] # up, down, left, right slips in a random direction in the
Σ = 0.5 * I(2) unit circle. The agent interpolates
end on a grid of discrete states using
the GridInterpolations.jl pack-
Ds(env::ContinuumWorld, s, a) = MvNormal(zeros(2), env.Σ) age to determine the expected fu-
(env::ContinuumWorld)(s, a) = env(s, a, rand(Ds(env, s, a))) ture return for each action. It then
function (env::ContinuumWorld)(s, a, x) takes the action with the highest
is_terminal = [norm(s .- c) ≤ r expected return.
for (c, r) in zip(env.terminal_centers, env.terminal_radii)]
if any(is_terminal)
return s
else
dir = normalize(env.directions[a] .+ x)
return clamp.(s .+ dir, [0, 0], env.size)
end
end
Ps(env::ContinuumWorld) = SetCategorical([[0.5, 0.5]])

struct InterpAgent <: Agent


grid # grid of discrete states using GridInteroplations.jl
Q # corresponding state-action values
end
(c::InterpAgent)(s) = argmax(interpolate(c.grid, q, s) for q in c.Q)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
a.8. aircraft collision avoidance system 325

The continuum world agent uses the same specification as the grid world
system with continuous states instead of discrete states. The obstacles and goal
for the continuum world system are represented as balls.

A.8 Aircraft Collision Avoidance System

The aircraft collision avoidance system involves issuing climb or descend advi-
sories to an aircraft to avoid an intruder aircraft.2 There are three actions cor- Figure A.10. Example of a success
(green) and failure (red) trajectory
responding to no advisory, commanding a 5 m/s descend, and commanding a for the continuum world system.
5 m/s climb. The intruder is approaching us head on, with a constant horizontal The failure trajectory enters the cir-
cle with the obstacle.
closing speed. The state is specified by the altitude h of our aircraft measured 2
This formulation is a highly
relative to the intruder aircraft, our vertical rate ḣ measured relative to the in- simplified version of the prob-
truder aircraft, the previous action aprev , and the time to potential collision tcol . lem described by M. J. Kochen-
derfer and J. P. Chryssanthacopou-
Figure A.11 illustrates the problem scenario. los, “Robust Airborne Collision
Given action a, the state variables are updated as follows Avoidance Through Dynamic Pro-
gramming,” Massachusetts Insti-
h0 = h + ḣ∆t (A.8) tute of Technology, Lincoln Labora-
tory, Project Report ATC-371, 2011.
0
ḣ = ḣ + (ḧ + xs ) (A.9)
0 ḣ (m/s)
aprev =a (A.10)
0 tcol (s)
tcol = tcol − ∆t (A.11) our aircraft h (m)

where ∆t = 1 s and xs is noise added to the relative vertical rate to account for
variations in intruder behavior. The value ḧ is given by
intruder
Figure A.11. State variables for the

0

 if a = no advisory aircraft collision avoidance system.
ḧ = a/∆t if | a − ḣ|/∆t < ḧlimit (A.12)


sign( a − ḣ)ḧlimit otherwise

where ḧlimit is the maximum allowable vertical acceleration.


The agent for the aircraft collision avoidance problem is similar to the contin-
uum world agent in that it interpolates on a grid of discrete states to determine the
action to take. We determine that the aircraft have nearly collided if their relative
altitude is less than 50 m when the time to collision is zero. A time to collision of
zero occurs at time step 41, resulting in the following STL specification:

ψ = [40,41] |h| > 50 (A.13)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
326 appendix a. s ystems

no advisory descend climb Figure A.12. Aircraft collision


avoidance policy when the rela-
ḣ = 0 m/s ḣ = 4 m/s tive vertical rate is fixed at 0 m/s
400 (left) and 4 m/s (right), and the
previous action is fixed at no ad-
visory. The colors represent the ac-
200 tion taken by the agent in each state.
The red aircraft represents the lo-
h (m)

0 cation of the intruder aircraft.

−200

−400
40 30 20 10 0 40 30 20 10 0
tcol (s) tcol (s)

@with_kw struct CollisionAvoidance <: Environment Algorithm A.9. Collision avoid-


ddh_max::Float64 = 1.0 # maximum vertical acceleration ance environment. The initial state
𝒜::Vector{Float64} = [-5.0, 0.0, 5.0] # vertical rate commands distribution Ps samples the alti-
Ds::Sampleable = Normal() # vertical rate noise tude h from a uniform distribution
end over [−100, 100], the vertical rate ḣ
from a uniform distribution over
Ds(env::CollisionAvoidance, s, a) = env.Ds [−10, 10]. The previous action is
(env::CollisionAvoidance)(s, a) = env(s, a, rand(Ds(env, s, a))) fixed at no advisory, and we always
function (env::CollisionAvoidance)(s, a, x) bin with 40 seconds until a poten-
a = env.𝒜[a] tial collision.
h, dh, a_prev, τ = s
h = h + dh
if a != 0.0
if abs(a - dh) < env.ddh_max
dh += a
else
dh += sign(a - dh) * env.ddh_max
end
end
a_prev = a
τ = max(τ - 1.0, -1.0)
return [h, dh + x, a_prev, τ]
end
Ps(env::CollisionAvoidance) = product_distribution(Uniform(-100, 100),
Uniform(-10, 10),
DiscreteNonParametric([0], [1.0]),
DiscreteNonParametric([40], [1.0]))

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
B Mathematical Concepts

This appendix provides a brief overview of some of the mathematical concepts


used in this book.

B.1 Measure Spaces

Before introducing the definition of a measure space, we will first discuss the
notion of a sigma-algebra over a set Ω. A sigma-algebra is a collection Σ of subsets
of Ω such that

1. Ω ∈ Σ.

2. If E ∈ Σ, then Ω \ E ∈ Σ (closed under complementation).

3. If E1 , E2 , E3 , . . . ∈ Σ, then E1 ∪ E2 ∪ E3 . . . ∈ Σ (closed under countable unions).

An element E ∈ Σ is called a measurable set.


A measure space is defined by a set Ω, a sigma-algebra Σ, and a measure µ : Ω →
R ∪ {∞}. For µ to be a measure, the following properties must hold:

1. If E ∈ Σ, then µ( E) ≥ 0 (nonnegativity).

2. µ(∅) = 0.

3. If E1 , E2 , E3 , . . . ∈ Σ are pairwise disjoint, then µ( E1 ∪ E2 ∪ E3 . . .) = µ( E1 ) +


µ( E2 ) + µ( E3 ) + · · · (countable additivity).
328 appendix b. mathematical concepts

B.2 Probability Spaces

A probability space is a measure space (Ω, Σ, µ) with the requirement that µ(Ω) = 1. 1
These axioms are sometimes
called the Kolmorogov axioms.
In the context of probability spaces, Ω is called the sample space, Σ is called the
A. Kolmogorov, Foundations of
event space, and µ (or, more commonly, P) is the probability measure. The probability the Theory of Probability, 2nd ed.
axioms1 refer to the nonnegativity and countable additivity properties of measure Chelsea, 1956.

spaces, together with the requirement that µ(Ω) = 1.

B.3 Metric Spaces

A set with a metric is called a metric space. A metric d, sometimes called a distance
metric, is a function that maps pairs of elements in X to nonnegative real numbers
such that for all x, y, z ∈ X:2 2
In section 3.2, we use a less pre-
cise definition of metric to refer to
1. d( x, y) = 0 if and only if x = y (identity of indiscernibles). any function that maps system be-
havior to a real value.
2. d( x, y) = d(y, x ) (symmetry).

3. d( x, y) ≤ d( x, z) + d(z, y) (triangle inequality).

B.4 Normed Vector Spaces

A normed vector space consists of a vector space X and a norm k·k that maps elements
of X to nonnegative real numbers such that for all scalars α and vectors x, y ∈ X:
1. kxk = 0 if and only if x = 0.

2. kαxk = |α|kxk (absolutely homogeneous).

3. kx + yk ≤ kxk + kyk (triangle inequality).


The L p norms are a commonly used set of norms parameterized by a scalar
p ≥ 1. The L p norm of vector x is
1
kxk p = lim (| x1 |ρ + | x2 |ρ + · · · + | xn |ρ ) ρ (B.1)
ρ→ p

where the limit is necessary for defining the infinity norm, L∞ . Several L p norms
are shown in figure B.1.
Norms can be used to induce distance metrics in vector spaces by defining the
metric d(x, y) = kx − yk. We can then, for example, use an L p norm to define
distances.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
b.4. normed vector spaces 329

Figure B.1. Common L p norms.


The illustrations show the shape of
L1 : kxk1 = | x1 | + | x2 | + · · · + | x n | the norm contours in two dimen-
sions. All points on the contour are
This metric is often referred to as the taxicab equidistant from the origin under
norm. that norm.

q
L2 : kxk2 = x12 + x22 + · · · + xn2
This metric is often referred to as the
Euclidean norm.

L∞ : kxk∞ = max(| x1 |, | x2 |, · · · , | xn |)
This metric is often referred to as the max
norm, Chebyshev norm, or chessboard norm.
The latter name comes from the minimum
number of moves that a king needs to move
between two squares in chess.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
330 appendix b. mathematical concepts

B.5 Positive Definiteness

A symmetric matrix A is positive definite if x> Ax is positive for all points other than
the origin. In other words, x> Ax > 0 for all x 6= 0. A symmetric matrix A is positive
semidefinite if x> Ax is always nonnegative. In other words, x> Ax ≥ 0 for all x.

3
B.6 Information Content Sometimes information content is
referred to as Shannon information,
in honor of Claude Shannon, the
If we have a discrete distribution that assigns probability P( x ) to value x, the founder of the field of information
information content3 of observing x is given by theory. C. E. Shannon, “A Math-
ematical Theory of Communica-
tion,” Bell System Technical Journal,
I ( x ) = − log P( x ) (B.2) vol. 27, no. 4, pp. 623–656, 1948.

The unit of information content depends on the base of the logarithm. We generally
assume natural logarithms (with base e), making the unit nat, which is short for
natural. In information theoretic contexts, the base is often 2, making the unit bit.
We can think of this quantity as the number of bits required to transmit the value
x according to an optimal message encoding when the distribution over messages
follows the specified distribution.

B.7 Entropy

Entropy is an information theoretic measure of uncertainty. The entropy associated


with a discrete random variable X is the expected information content:

H ( X ) = Ex [ I ( x )] = ∑ P(x) I (x) = − ∑ P(x) log P(x) (B.3)


x x

where P( x ) is the mass assigned to x. For a continuous distribution where p( x ) is


the density assigned to x, the differential entropy (also known as continuous entropy)
is defined to be
Z Z
h( X ) = p( x ) I ( x ) dx = − p( x ) log p( x ) dx (B.4)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
b.8. cross entropy 331

B.8 Cross Entropy

The cross entropy of one distribution relative to another can be defined in terms
of expected information content. If we have one discrete distribution with mass
function P( x ) and another with mass function Q( x ), then the cross entropy of P
relative to Q is given by

H ( P, Q) = − Ex∼ P [log Q( x )] = − ∑ P( x ) log Q( x ) (B.5)


x

For continuous distributions with density functions p( x ) and q( x ), we have


Z
H ( p, q) = − p( x ) log q( x ) dx (B.6)

B.9 Relative Entropy

Relative entropy, also called the Kullback-Leibler (KL) divergence, is a measure of


how one probability distribution is different from a reference distribution.4 If 4
Named for the two American
P( x ) and Q( x ) are mass functions, then the KL divergence from Q to P is the mathematicians who introduced
this measure, Solomon Kullback
expectation of the logarithmic differences, with the expectation using P: (1907–1994) and Richard A.
Leibler (1914–2003). S. Kullback
P( x ) Q( x )
DKL ( P || Q) = ∑ P(x) log Q(x) = − ∑ P( x ) log
P( x )
(B.7) and R. A. Leibler, “On Informa-
tion and Sufficiency,” Annals of
x x
Mathematical Statistics, vol. 22,
This quantity is defined only if the support of P is a subset of that of Q. The no. 1, pp. 79–86, 1951. S. Kullback,
summation is over the support of P to avoid division by zero. Information Theory and Statistics.
Wiley, 1959.
For continuous distributions with density functions p( x ) and q( x ), we have
p( x ) q( x )
Z Z
DKL ( p || q) = p( x ) log dx = − p( x ) log dx (B.8)
q( x ) p( x )
Similarly, this quantity is defined only if the support of p is a subset of that of q.
The integral is over the support of p to avoid division by zero.

B.10 Taylor Expansion

The Taylor expansion,5 also called the Taylor series, of a function is important to 5
Named for the English mathe-
many approximations used in this book. From the first fundamental theorem of matician Brook Taylor (1685–1731)
who introduced the concept.
calculus,6 we know that 6
The first fundamental theorem of
Z h calculus relates a function to the
f ( x + h) = f ( x ) + f 0 ( x + a) da (B.9) integral of its derivative:
0
Z b
f (b) − f ( a) = f 0 ( x ) dx
© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND a license.
2024-12-23 15:12:43-05:00, comments to [email protected]
332 appendix b. mathematical concepts

Nesting this definition produces the Taylor expansion of f about x:


Z h Z a 
f ( x + h) = f ( x ) + f 0 (x) + f 00 ( x + b) db da (B.10)
0 0
Z hZ a
= f ( x) + f 0 ( x)h + f 00 ( x + b) db da (B.11)
0 0
Z h Z a Z b 
0 00 000
= f ( x) + f ( x)h + f (x) + f ( x + c) dc db da (B.12)
0 0 0
f 00 ( x )
Z hZ aZ b
= f ( x) + f 0 ( x)h + h2 + f 000 ( x + c) dc db da (B.13)
2! 0 0 0
..
. (B.14)
f 0 (x) f 00 ( x ) f 000 ( x )
= f (x) + h+ h2 + h3 + . . . (B.15)
1! 2! 3!

f (n) ( x ) n
= ∑ n!
h (B.16)
n =0

In the formulation given here, x is typically fixed and the function is evaluated
in terms of h. It is often more convenient to write the Taylor expansion of f ( x )
about a point a such that it remains a function of x:

f (n) ( a )
f (x) = ∑ n!
( x − a)n (B.17)
n =0

The Taylor expansion represents a function as an infinite sum of polynomial


terms based on repeated derivatives at a single point. Any analytic function can
be represented by its Taylor expansion within a local neighborhood.
A function can be locally approximated by using the first few terms of the Taylor
expansion. Figure B.2 shows increasingly better approximations for cos( x ) about
x = 1. Including more terms increases the accuracy of the local approximation,
but error still accumulates as one moves away from the expansion point.
A linear Taylor approximation uses the first two terms of the Taylor expansion:

f ( x ) ≈ f ( a) + f 0 ( a)( x − a) (B.18)

A quadratic Taylor approximation uses the first three terms:


1
f ( x ) ≈ f ( a) + f 0 ( a)( x − a) + f ”( a)( x − a)2 (B.19)
2
and so on.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
b.10. taylor expansion 333

3 Figure B.2. Successive approxima-


cos( x ) tions of cos( x ) about 1 based on
0th degree the first n terms of the Taylor ex-
2 1st degree pansion.
2nd degree
3rd degree
1 4th degree
5th degree
0

−1

−2

−3
−4 −2 0 2 4 6
x

In multiple dimensions, the Taylor expansion about a generalizes to

1
f (x) = f (a) + ∇ f (a)> (x − a) + (x − a)> ∇2 f (a)(x − a) + · · · (B.20)
2
The first two terms form the tangent plane at a. The third term incorporates local
curvature. This book will use only the first three terms shown here.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
C Neural Representations

Neural networks are parametric representations of nonlinear functions.1 The func- 1


The name derives from the loose
tion represented by a neural network is differentiable, allowing gradient-based inspiration of networks of neurons
in biological brains. We will not dis-
optimization algorithms such as stochastic gradient descent to optimize their cuss these biological connections,
parameters to better approximate desired input-output relationships.2 Neural but an overview and historical per-
spective is provided by B. Müller,
representations can be helpful in a variety of contexts related to validation, es- J. Reinhardt, and M. T. Strickland,
pecially for tasks that require expressive, flexible models. We also may want to Neural Networks. Springer, 1995.
2
validate systems with neural network components, as discussed in section 9.7 This optimization process when
applied to neural networks with
A neural network is a differentiable function y = fθ (x) that maps inputs x to many layers, as we will discuss
produce outputs y and is parameterized by θ. Modern neural networks may shortly, is often called deep learn-
have millions of parameters and can be used to convert inputs in the form of ing. Several textbooks are dedi-
cated entirely to these techniques,
high-dimensional images or video into high-dimensional outputs like multidi- including I. Goodfellow, Y. Bengio,
mensional classifications or speech. and A. Courville, Deep Learning.
MIT Press, 2016. The Julia package
The parameters of the network θ are generally tuned to minimize a scalar loss Flux.jl provides efficient imple-
function `(fθ (x), y) that is related to how far the network output is from the desired mentations of various learning al-
output. Both the loss function and the neural network are differentiable, allowing gorithms.

us to use the gradient of the loss function with respect to the parameterization
∇θ ` to iteratively improve the parameterization. This process is often referred to
as neural network training or parameter tuning. It is demonstrated in example C.1.
Neural networks are typically trained on a data set of input-output pairs D. In
this case, we tune the parameters to minimize the aggregate loss over the data
set:
arg min ∑ `(fθ (x), y) (C.1)
θ (x,y)∈D

Data sets for modern problems tend to be very large, making the gradient of
equation (C.1) expensive to evaluate. It is common to sample random subsets of
the training data in each iteration, using these batches to compute the loss gradient.
336 appendix c. neural representations

Consider a very simple neural network, f θ ( x ) = θ1 + θ2 x. We wish our Example C.1. The fundamentals
of neural networks and parameter
neural network to take the square footage x of a home and predict its price tuning.
ypred . We want to minimize the squared deviation between the predicted
housing price and the true housing price by the loss function `(ypred , ytrue ) =
(ypred − ytrue )2 . Given a training pair, we can compute the gradient:

∇θ `( f ( x ), ytrue ) = ∇θ (θ1 + θ2 x − ytrue )2


" #
2(θ1 + θ2 x − ytrue )
=
2(θ1 + θ2 x − ytrue ) x

If our initial parameterization were θ = [10,000, 123] and we had the input-
output pair ( x = 2,500, ytrue = 360,000), then the loss gradient would be
∇θ ` = [−85,000, −2.125 × 108 ]. We would take a small step in the opposite
3
direction to improve our function approximation. A sufficiently large, single-layer
neural network can, in theory, ap-
proximate any function. See A.
Pinkus, “Approximation Theory
In addition to reducing computation, computing gradients with smaller batch of the MLP Model in Neural
Networks,” Acta Numerica, vol. 8,
sizes introduces some stochasticity to the gradient, which helps training to avoid pp. 143–195, 1999.
getting stuck in local minima.
Neural networks are typically constructed to pass the input through a series of 4
The nonlinearity introduced by
layers.3 Networks with many layers are often called deep. In feedforward networks, the activation function provides
each layer applies an affine transform, followed by a nonlinear activation function something analogous to the acti-
vation behavior of biological neu-
applied elementwise:4
rons, in which input buildup even-
x0 = φ.(Wx + b) (C.2) tually causes a neuron to fire. A. L.
Hodgkin and A. F. Huxley, “A
where matrix W and vector b are parameters associated with the layer. A fully Quantitative Description of Mem-
connected layer is shown in figure C.1. The dimension of the output layer is brane Current and Its Applica-
tion to Conduction and Excitation
different from that of the input layer when W is nonsquare. Figure C.2 shows a in Nerve,” Journal of Physiology,
more compact depiction of the same network. vol. 117, no. 4, pp. 500–544, 1952.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
337

Figure C.1. A fully connected layer


x10 = φ(w1,1 x1 + w1,2 x2 + w1,3 x3 + b1 ) with a three-component input and
a five-component output.

x1 x20 = φ(w2,1 x1 + w2,2 x2 + w2,3 x3 + b2 )

x2 x30 = φ(w3,1 x1 + w3,2 x2 + w3,3 x3 + b3 )

x3 x40 = φ(w4,1 x1 + w4,2 x2 + w4,3 x3 + b4 )

x50 = φ(w5,1 x1 + w5,2 x2 + w5,3 x3 + b5 )

If there are no activation functions between them, multiple successive affine x ∈ R3


transformations can be collapsed into a single, equivalent affine transform:
fully connected + φ
W2 (W1 x + b1 ) + b2 = W2 W1 x + (W2 b1 + b2 ) (C.3)
x 0 ∈ R5
These nonlinearities are necessary to allow neural networks to adapt to fit arbitrary
Figure C.2. A more compact de-
target functions. To illustrate, figure C.3 shows the output of a neural network piction of figure C.1. Neural net-
trained to approximate a nonlinear function. work layers are often represented
as blocks or slices for simplicity.

true function Figure C.3. A deep neural net-


4 work fit to samples from a nonlin-
training samples
2 ear function so as to minimize the
learned model
squared error. This neural network
0 has four affine layers, with 10 neu-
−2 rons in each intermediate represen-
tation.
0 2 4 6 8 10

There are many types of activation functions that are commonly used. Similar
to their biological inspiration, they tend to be close to zero when their input is
low and large when their input is high. Some common activation functions are
shown in figure C.5.
Sometimes special layers are incorporated to achieve certain effects. For ex-
ample, in figure C.4, we used a softmax layer at the end to force the output to
represent a two-element categorical distribution. The softmax function applies
the exponential function to each element, which ensures that they are positive

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
338 appendix c. neural representations

1 1 Figure C.4. A simple, two-layer,


fully connected network trained
0.8 x ∈ R2 to classify whether a given coor-
0.5 dinate lies within a circle (shown
0.6 fully connected + sigmoid in white). The nonlinearities
allow neural networks to form
x3

0
0.4 ∈ R5 complicated, nonlinear decision
boundaries.
−0.5 fully connected + softmax
0.2

−1 0 ypred ∈ R2
−1 −0.5 0 0.5 1
x1

sigmoid tanh softplus Figure C.5. Several common acti-


vation functions.
1/(1 + exp(− x )) tanh( x ) log(1 + exp( x ))
2

1
φ( x )

−1
relu leaky relu swish
max(0, x ) max(αx, x ) x sigmoid( x )
2

1
φ( x )

−1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
339

and then renormalizes the resulting values:

exp( xi )
softmax(x)i = (C.4)
∑ j exp( x j )

The outputs of the softmax function can be interpreted as probabilities (sec-


tion 12.2.1).
Gradients for neural networks are typically computed using reverse accumu-
lation.5 The method begins with a forward step, in which the neural network is 5
This process is commonly called
evaluated using all input parameters. In the backward step, the gradient of each backpropagation, which specifically
refers to reverse accumulation ap-
term of interest is computed working from the output back to the input. Reverse plied to a scalar loss function. D. E.
accumulation uses the chain rule for derivatives: Rumelhart, G. E. Hinton, and R. J.
Williams, “Learning Representa-
 
∂f(g(h(x))) ∂f(g(h)) ∂h(x) ∂f(g) ∂g(h) ∂h(x) tions by Back-Propagating Errors,”
= = (C.5) Nature, vol. 323, pp. 533–536, 1986.
∂x ∂h ∂x ∂g ∂h ∂x

Example C.2 demonstrates this process. Many deep learning packages compute
gradients using such automatic differentiation techniques.6 Users rarely have to 6
A. Griewank and A. Walther, Eval-
provide their own gradients. uating Derivatives: Principles and
Techniques of Algorithmic Differentia-
tion, 2nd ed. SIAM, 2008.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
340 appendix c. neural representations

Recall the neural network and loss function from example C.1. Here we have Example C.2. How reverse accu-
mulation is used to compute pa-
drawn the computational graph for the loss calculation: rameter gradients given training
x data.

× c1
θ2
+ ypred
− c2 c22 `
θ1
ytrue
Reverse accumulation begins with a forward pass, in which the compu-
tational graph is evaluated. We will again use θ = [10,000, 123] and the
input-output pair ( x = 2,500, ytrue = 360,000) as follows:
2,500
x
307,500
× c1
123
θ2 317,500
+ ypred
−42,500 1.81 × 109
10,000
360,000
− c2 c22 `
θ1
ytrue
The gradient is then computed by working back up the tree:
2,500
x
307,500
× c1
123
θ2 317,500
∂ypred /∂c1 = 1 ypred
+
∂c1 /∂θ2 = 2,500 −42,500 1.81 × 109
10,000 ∂c2 /∂ypred = 1 − c2 c22 `
θ1 360,000
∂ypred /∂θ1 = 1 ytrue ∂ `/∂c2 = −85,000

Finally, we compute:

∂` ∂ ` ∂c2 ∂ypred
∂θ1 = ∂c2 ∂ypred ∂θ1 = −85,000 · 1 · 1 = −85,000
∂ ` ∂c2 ∂ypred ∂c1
∂`
∂θ2 = ∂c2 ∂ypred ∂c1 ∂θ2 = −85,000 · 1 · 1 · 2500 = −2.125 × 108

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
D Julia

Julia is a scientific programming language that is free and open source.1 It is a 1


Julia may be obtained from
relatively new language that borrows inspiration from languages like Python, http://julialang.org.

MATLAB, and R. It was selected for use in this book because it is sufficiently
high level2 so that the algorithms can be compactly expressed and readable while 2
In contrast with languages like
also being fast. This book is compatible with Julia version 1.11. This appendix C++, Julia does not require pro-
grammers to worry about memory
introduces the concepts necessary for understanding the included code, omitting management and other lower-level
many of the advanced features of the language. details, yet it allows low-level con-
trol when needed.

D.1 Types

Julia has a variety of basic types that can represent data given as truth values,
numbers, strings, arrays, tuples, and dictionaries. Users can also define their own
types. This section explains how to use some of the basic types and how to define
new types.

D.1.1 Booleans
The Boolean type in Julia, written as Bool, includes the values true and false. We
can assign these values to variables. Variable names can be any string of characters,
including Unicode, with a few restrictions.
α = true
done = false

The variable name appears on the left side of the equal sign; the value that variable
is to be assigned is on the right side.
342 appendix d. julia

We can make assignments in the Julia console. The console, or REPL (for read,
eval, print, loop), will return a response to the expression being evaluated. The #
symbol indicates that the rest of the line is a comment.
julia> x = true
true
julia> y = false; # semicolon suppresses the console output
julia> typeof(x)
Bool
julia> x == y # test for equality
false

The standard Boolean operators are supported:


julia> !x # not
false
julia> x && y # and
false
julia> x || y # or
true

D.1.2 Numbers
Julia supports integer and floating-point numbers, as shown here:
julia> typeof(42)
Int64
julia> typeof(42.0)
Float64

Here, Int64 denotes a 64-bit integer, and Float64 denotes a 64-bit floating-point
value.3 We can perform the standard mathematical operations: 3
On 32-bit machines, an integer
literal like 42 is interpreted as an
julia> x = 4 Int32.
4
julia> y = 2
2
julia> x + y
6
julia> x - y
2
julia> x * y
8
julia> x / y
2.0

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 343

julia> x ^ y # exponentiation
16
julia> x % y # remainder from division
0
julia> div(x, y) # truncated division returns an integer
2

Note that the result of x / y is a Float64, even when x and y are integers. We
can also perform these operations at the same time as an assignment. For example,
x += 1 is shorthand for x = x + 1.
We can also make comparisons:
julia> 3 > 4
false
julia> 3 >= 4
false
julia> 3 ≥ 4 # unicode also works, use \ge[tab] in console
false
julia> 3 < 4
true
julia> 3 <= 4
true
julia> 3 ≤ 4 # unicode also works, use \le[tab] in console
true
julia> 3 == 4
false
julia> 3 < 4 < 5
true

D.1.3 Strings
A string is an array of characters. Strings are not used very much in this textbook
except to report certain errors. An object of type String can be constructed using
" characters. For example:

julia> x = "optimal"
"optimal"
julia> typeof(x)
String

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
344 appendix d. julia

D.1.4 Symbols
A symbol represents an identifier. It can be written using the : operator or con-
structed from strings:
julia> :A
:A
julia> :Battery
:Battery
julia> Symbol("Failure")
:Failure

D.1.5 Vectors
A vector is a one-dimensional array that stores a sequence of values. We can
construct a vector using square brackets, separating elements by commas:
julia> x = []; # empty vector
julia> x = trues(3); # Boolean vector containing three trues
julia> x = ones(3); # vector of three ones
julia> x = zeros(3); # vector of three zeros
julia> x = rand(3); # vector of three random numbers between 0 and 1
julia> x = [3, 1, 4]; # vector of integers
julia> x = [3.1415, 1.618, 2.7182]; # vector of floats

An array comprehension can be used to create vectors:


julia> [sin(x) for x in 1:5]
5-element Vector{Float64}:
0.8414709848078965
0.9092974268256817
0.1411200080598672
-0.7568024953079282
-0.9589242746631385

We can inspect the type of a vector:


julia> typeof([3, 1, 4]) # 1-dimensional array of Int64s
Vector{Int64} (alias for Array{Int64, 1})
julia> typeof([3.1415, 1.618, 2.7182]) # 1-dimensional array of Float64s
Vector{Float64} (alias for Array{Float64, 1})
julia> Vector{Float64} # alias for a 1-dimensional array
Vector{Float64} (alias for Array{Float64, 1})

We index into vectors using square brackets:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 345

julia> x[1] # first element is indexed by 1


3.1415
julia> x[3] # third element
2.7182
julia> x[end] # use end to reference the end of the array
2.7182
julia> x[end-1] # this returns the second to last element
1.618

We can pull out a range of elements from an array. Ranges are specified using
a colon notation:
julia> x = [1, 2, 5, 3, 1]
5-element Vector{Int64}:
1
2
5
3
1
julia> x[1:3] # pull out the first three elements
3-element Vector{Int64}:
1
2
5
julia> x[1:2:end] # pull out every other element
3-element Vector{Int64}:
1
5
1
julia> x[end:-1:1] # pull out all the elements in reverse order
5-element Vector{Int64}:
1
3
5
2
1

We can perform a variety of operations on arrays. The exclamation mark at the


end of function names is used to indicate that the function mutates (i.e., changes)
the input:
julia> length(x)
5
julia> [x, x] # concatenation
2-element Vector{Vector{Int64}}:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
346 appendix d. julia

[1, 2, 5, 3, 1]
[1, 2, 5, 3, 1]
julia> push!(x, -1) # add an element to the end
6-element Vector{Int64}:
1
2
5
3
1
-1
julia> pop!(x) # remove an element from the end
-1
julia> append!(x, [2, 3]) # append [2, 3] to the end of x
7-element Vector{Int64}:
1
2
5
3
1
2
3
julia> sort!(x) # sort the elements, altering the same vector
7-element Vector{Int64}:
1
1
2
2
3
3
5
julia> sort(x); # sort the elements as a new vector
julia> x[1] = 2; print(x) # change the first element to 2
[2, 1, 2, 2, 3, 3, 5]
julia> x = [1, 2];
julia> y = [3, 4];
julia> x + y # add vectors
2-element Vector{Int64}:
4
6
julia> 3x - [1, 2] # multiply by a scalar and subtract
2-element Vector{Int64}:
2
4
julia> using LinearAlgebra

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 347

julia> dot(x, y) # dot product available after using LinearAlgebra


11
julia> x⋅y # dot product using unicode character, use \cdot[tab] in console
11
julia> prod(y) # product of all the elements in y
12

It is often useful to apply various functions elementwise to vectors. This is a


form of broadcasting. With infix operators (e.g., +, *, and ^), a dot is prefixed to
indicate elementwise broadcasting. With functions like sqrt and sin, the dot is
postfixed:
julia> x .* y # elementwise multiplication
2-element Vector{Int64}:
3
8
julia> x .^ 2 # elementwise squaring
2-element Vector{Int64}:
1
4
julia> sin.(x) # elementwise application of sin
2-element Vector{Float64}:
0.8414709848078965
0.9092974268256817
julia> sqrt.(x) # elementwise application of sqrt
2-element Vector{Float64}:
1.0
1.4142135623730951

D.1.6 Matrices
A matrix is a two-dimensional array. Like a vector, it is constructed using square
brackets. We use spaces to delimit elements in the same row and semicolons to
delimit rows. We can also index into the matrix and output submatrices using
ranges:
julia> X = [1 2 3; 4 5 6; 7 8 9; 10 11 12];
julia> typeof(X) # a 2-dimensional array of Int64s
Matrix{Int64} (alias for Array{Int64, 2})
julia> X[2] # second element using column-major ordering
4
julia> X[3,2] # element in third row and second column
8

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
348 appendix d. julia

julia> X[1,:] # extract the first row


3-element Vector{Int64}:
1
2
3
julia> X[:,2] # extract the second column
4-element Vector{Int64}:
2
5
8
11
julia> X[:,1:2] # extract the first two columns
4×2 Matrix{Int64}:
1 2
4 5
7 8
10 11
julia> X[1:2,1:2] # extract a 2x2 submatrix from the top left of x
2×2 Matrix{Int64}:
1 2
4 5
julia> Matrix{Float64} # alias for a 2-dimensional array
Matrix{Float64} (alias for Array{Float64, 2})

We can also construct a variety of special matrices and use array comprehen-
sions:
julia> Matrix(1.0I, 3, 3) # 3x3 identity matrix
3×3 Matrix{Float64}:
1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0
julia> Matrix(Diagonal([3, 2, 1])) # 3x3 diagonal matrix with 3, 2, 1 on diagonal
3×3 Matrix{Int64}:
3 0 0
0 2 0
0 0 1
julia> zeros(3,2) # 3x2 matrix of zeros
3×2 Matrix{Float64}:
0.0 0.0
0.0 0.0
0.0 0.0
julia> rand(3,2) # 3x2 random matrix
3×2 Matrix{Float64}:
0.892384 0.649514

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 349

0.415807 0.196564
0.543603 0.382245
julia> [sin(x + y) for x in 1:3, y in 1:2] # array comprehension
3×2 Matrix{Float64}:
0.909297 0.14112
0.14112 -0.756802
-0.756802 -0.958924

Matrix operations include the following:


julia> X' # complex conjugate transpose
3×4 adjoint(::Matrix{Int64}) with eltype Int64:
1 4 7 10
2 5 8 11
3 6 9 12
julia> 3X .+ 2 # multiplying by scalar and adding scalar
4×3 Matrix{Int64}:
5 8 11
14 17 20
23 26 29
32 35 38
julia> X = [1 3; 3 1]; # create an invertible matrix
julia> inv(X) # inversion
2×2 Matrix{Float64}:
-0.125 0.375
0.375 -0.125
julia> pinv(X) # pseudoinverse (requires LinearAlgebra)
2×2 Matrix{Float64}:
-0.125 0.375
0.375 -0.125
julia> det(X) # determinant (requires LinearAlgebra)
-8.0
julia> [X X] # horizontal concatenation, same as hcat(X, X)
2×4 Matrix{Int64}:
1 3 1 3
3 1 3 1
julia> [X; X] # vertical concatenation, same as vcat(X, X)
4×2 Matrix{Int64}:
1 3
3 1
1 3
3 1
julia> sin.(X) # elementwise application of sin
2×2 Matrix{Float64}:
0.841471 0.14112

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
350 appendix d. julia

0.14112 0.841471
julia> map(sin, X) # elementwise application of sin
2×2 Matrix{Float64}:
0.841471 0.14112
0.14112 0.841471
julia> vec(X) # reshape an array as a vector
4-element Vector{Int64}:
1
3
3
1

D.1.7 Tuples
A tuple is an ordered list of values, potentially of different types. They are con-
structed with parentheses. They are similar to vectors, but they cannot be mutated:
julia> x = () # the empty tuple
()
julia> isempty(x)
true
julia> x = (1,) # tuples of one element need the trailing comma
(1,)
julia> typeof(x)
Tuple{Int64}
julia> x = (1, 0, [1, 2], 2.5029, 4.6692) # third element is a vector
(1, 0, [1, 2], 2.5029, 4.6692)
julia> typeof(x)
Tuple{Int64, Int64, Vector{Int64}, Float64, Float64}
julia> x[2]
0
julia> x[end]
4.6692
julia> x[4:end]
(2.5029, 4.6692)
julia> length(x)
5
julia> x = (1, 2)
(1, 2)
julia> a, b = x;
julia> a
1
julia> b
2

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 351

D.1.8 Named Tuples


A named tuple is like a tuple, but each entry has its own name:
julia> x = (a=1, b=-Inf)
(a = 1, b = -Inf)
julia> x isa NamedTuple
true
julia> x.a
1
julia> a, b = x;
julia> a
1
julia> (; :a=>10)
(a = 10,)
julia> (; :a=>10, :b=>11)
(a = 10, b = 11)
julia> merge(x, (d=3, e=10)) # merge two named tuples
(a = 1, b = -Inf, d = 3, e = 10)

D.1.9 Dictionaries
A dictionary is a collection of key-value pairs. Key-value pairs are indicated with
a double arrow operator =>. We can index into a dictionary using square brackets,
just as with arrays and tuples:
julia> x = Dict(); # empty dictionary
julia> x[3] = 4 # associate key 3 with value 4
4
julia> x = Dict(3=>4, 5=>1) # create a dictionary with two key-value pairs
Dict{Int64, Int64} with 2 entries:
5 => 1
3 => 4
julia> x[5] # return the value associated with key 5
1
julia> haskey(x, 3) # check whether dictionary has key 3
true
julia> haskey(x, 4) # check whether dictionary has key 4
false

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
352 appendix d. julia

D.1.10 Composite Types


A composite type is a collection of named fields. By default, an instance of a com-
posite type is immutable (i.e., it cannot change). We use the struct keyword and
then give the new type a name and list the names of the fields:
struct A
a
b
end

Adding the keyword mutable makes it so that an instance can change:


mutable struct B
a
b
end

Composite types are constructed using parentheses, between which we pass


in values for each field:
x = A(1.414, 1.732)

The double-colon operator can be used to specify the type for any field:
struct A
a::Int64
b::Float64
end

These type annotations require that we pass in an Int64 for the first field and
a Float64 for the second field. For compactness, this book does not use type
annotations, but it is at the expense of performance. Type annotations allow
Julia to improve runtime performance because the compiler can optimize the Any
underlying code for specific types. Number
..
. Real
..
D.1.11 Abstract Types . AbstractFloat
..
. Float64
So far we have discussed concrete types, which are types that we can construct.
Float32
However, concrete types are only part of the type hierarchy. There are also abstract
Float16
types, which are supertypes of concrete types and other abstract types.
BigFloat
We can explore the type hierarchy of the Float64 type shown in figure D.1
Figure D.1. The type hierarchy for
using the supertype and subtypes functions:
the Float64 type.

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.1. types 353

julia> supertype(Float64)
AbstractFloat
julia> supertype(AbstractFloat)
Real
julia> supertype(Real)
Number
julia> supertype(Number)
Any
julia> supertype(Any) # Any is at the top of the hierarchy
Any
julia> using InteractiveUtils # required for using subtypes in scripts
julia> subtypes(AbstractFloat) # different types of AbstractFloats
4-element Vector{Any}:
BigFloat
Float16
Float32
Float64
julia> subtypes(Float64) # Float64 does not have any subtypes
Type[]

We can define our own abstract types:


abstract type C end
abstract type D <: C end # D is an abstract subtype of C
struct E <: D # E is a composite type that is a subtype of D
a
end

D.1.12 Parametric Types


Julia supports parametric types, which are types that take parameters. The param-
eters to a parametric type are given within braces and delimited by commas. We
have already seen a parametric type with our dictionary example:
julia> x = Dict(3=>1.4, 1=>5.9)
Dict{Int64, Float64} with 2 entries:
3 => 1.4
1 => 5.9

For dictionaries, the first parameter specifies the key type, and the second param-
eter specifies the value type. The example has Int64 keys and Float64 values,
making the dictionary of type Dict{Int64,Float64}. Julia was able to infer these
types based on the input, but we could have specified it explicitly:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
354 appendix d. julia

julia> x = Dict{Int64,Float64}(3=>1.4, 1=>5.9);

While it is possible to define our own parametric types, we do not need to do so


in this text.

D.2 Functions

A function maps its arguments, given as a tuple, to a result that is returned.

D.2.1 Named Functions


One way to define a named function is to use the function keyword, followed by
the name of the function and a tuple of names of arguments:
function f(x, y)
return x + y
end

We can also define functions compactly using assignment form:


julia> f(x, y) = x + y;
julia> f(3, 0.1415)
3.1415

D.2.2 Anonymous Functions


An anonymous function is not given a name, though it can be assigned to a named
variable. One way to define an anonymous function is to use the arrow operator:
julia> h = x -> x^2 + 1 # assign anonymous function with input x to a variable h
#1 (generic function with 1 method)
julia> h(3)
10
julia> g(f, a, b) = [f(a), f(b)]; # applies function f to a and b and returns array
julia> g(h, 5, 10)
2-element Vector{Int64}:
26
101
julia> g(x->sin(x)+1, 10, 20)
2-element Vector{Float64}:
0.4559788891106302
1.9129452507276277

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.2. functions 355

D.2.3 Callable Objects


We can define a type and associate functions with it, allowing objects of that type
to be callable:

julia> (x::A)() = x.a + x.b # adding a zero-argument function to the type A defined earlier
julia> (x::A)(y) = y*x.a + x.b # adding a single-argument function
julia> x = A(22, 8);
julia> x()
30
julia> x(2)
52

D.2.4 Optional Arguments


We can assign a default value to an argument, making the specification of that
argument optional:
julia> f(x=10) = x^2;
julia> f()
100
julia> f(3)
9
julia> f(x, y, z=1) = x*y + z;
julia> f(1, 2, 3)
5
julia> f(1, 2)
3

D.2.5 Keyword Arguments


Functions may use keyword arguments, which are arguments that are named
when the function is called. Keyword arguments are given after all the positional
arguments. A semicolon is placed before any keywords, separating them from
the other arguments:
julia> f(; x = 0) = x + 1;
julia> f()
1
julia> f(x = 10)
11
julia> f(x, y = 10; z = 2) = (x + y)*z;
julia> f(1)

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
356 appendix d. julia

22
julia> f(2, z = 3)
36
julia> f(2, 3)
10
julia> f(2, 3, z = 1)
5

D.2.6 Dispatch
The types of the arguments passed to a function can be specified using the double
colon operator. If multiple methods of the same function are provided, Julia will
execute the appropriate method. The mechanism for choosing which method to
execute is called dispatch:
julia> f(x::Int64) = x + 10;
julia> f(x::Float64) = x + 3.1415;
julia> f(1)
11
julia> f(1.0)
4.141500000000001
julia> f(1.3)
4.4415000000000004

The method with a type signature that best matches the types of the arguments
given will be used:
julia> f(x) = 5;
julia> f(x::Float64) = 3.1415;
julia> f([3, 2, 1])
5
julia> f(0.00787499699)
3.1415

D.2.7 Splatting
It is often useful to splat the elements of a vector or a tuple into the arguments to
a function using the ... operator:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.3. control flow 357

julia> f(x,y,z) = x + y - z;
julia> a = [3, 1, 2];
julia> f(a...)
2
julia> b = (2, 2, 0);
julia> f(b...)
4
julia> c = ([0,0],[1,1]);
julia> f([2,2], c...)
2-element Vector{Int64}:
1
1

D.3 Control Flow

We can control the flow of our programs using conditional evaluation and loops.
This section provides some of the syntax used in the book.

D.3.1 Conditional Evaluation


Conditional evaluation will check the value of a Boolean expression and then
evaluate the appropriate block of code. One of the most common ways to do this
is with an if statement:
if x < y
# run this if x < y
elseif x > y
# run this if x > y
else
# run this if x == y
end

We can also use the ternary operator with its question mark and colon syntax.
It checks the Boolean expression before the question mark. If the expression
evaluates to true, then it returns what comes before the colon; otherwise, it
returns what comes after the colon:
julia> f(x) = x > 0 ? x : 0;
julia> f(-10)
0
julia> f(10)
10

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
358 appendix d. julia

D.3.2 Loops
A loop allows for repeated evaluation of expressions. One type of loop is the
while loop, which repeatedly evaluates a block of expressions until the specified
condition after the while keyword is met. The following example sums the values
in the array X:
X = [1, 2, 3, 4, 6, 8, 11, 13, 16, 18]
s = 0
while !isempty(X)
s += pop!(X)
end

Another type of loop is the for loop, which uses the for keyword. The following
example will also sum over the values in the array X but will not modify X:
X = [1, 2, 3, 4, 6, 8, 11, 13, 16, 18]
s = 0
for y in X
s += y
end

The in keyword can be replaced by = or ∈. The following code block is equivalent:


X = [1, 2, 3, 4, 6, 8, 11, 13, 16, 18]
s = 0
for i = 1:length(X)
s += X[i]
end

D.3.3 Iterators
We can iterate over collections in contexts such as for loops and array comprehen-
sions. To demonstrate various iterators, we will use the collect function, which
returns an array of all items generated by an iterator:

julia> X = ["feed", "sing", "ignore"];


julia> collect(enumerate(X)) # return the count and the element
3-element Vector{Tuple{Int64, String}}:
(1, "feed")
(2, "sing")
(3, "ignore")
julia> collect(eachindex(X)) # equivalent to 1:length(X)
3-element Vector{Int64}:

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.4. packages 359

1
2
3
julia> Y = [-5, -0.5, 0];
julia> collect(zip(X, Y)) # iterate over multiple iterators simultaneously
3-element Vector{Tuple{String, Float64}}:
("feed", -5.0)
("sing", -0.5)
("ignore", 0.0)
julia> import IterTools: subsets
julia> collect(subsets(X)) # iterate over all subsets
8-element Vector{Vector{String}}:
[]
["feed"]
["sing"]
["feed", "sing"]
["ignore"]
["feed", "ignore"]
["sing", "ignore"]
["feed", "sing", "ignore"]
julia> collect(eachindex(X)) # iterate over indices into a collection
3-element Vector{Int64}:
1
2
3
julia> Z = [1 2; 3 4; 5 6];
julia> import Base.Iterators: product
julia> collect(product(X,Y)) # iterate over Cartesian product of multiple iterators
3×3 Matrix{Tuple{String, Float64}}:
("feed", -5.0) ("feed", -0.5) ("feed", 0.0)
("sing", -5.0) ("sing", -0.5) ("sing", 0.0)
("ignore", -5.0) ("ignore", -0.5) ("ignore", 0.0)

D.4 Packages

A package is a collection of Julia code and possibly other external libraries that
can be imported to provide additional functionality. This section briefly reviews
a few of the key packages that we build upon in this book. To add a registered
package like Distributions.jl, we can run
using Pkg
Pkg.add("Distributions")

To update packages, we use

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
360 appendix d. julia

Pkg.update()

To use a package, we use the keyword using as follows:


using Distributions

D.4.1 Distributions.jl
We use the Distributions.jl package (version 0.25) to represent, fit, and sample
from probability distributions:
julia> using Distributions
julia> dist = Categorical([0.3, 0.5, 0.2]) # create a categorical distribution
Distributions.Categorical{Float64, Vector{Float64}}(support=Base.OneTo(3), p=[0.3, 0.5, 0.2])
julia> data = rand(dist) # generate a sample
1
julia> data = rand(dist, 2) # generate two samples
2-element Vector{Int64}:
3
2
julia> μ, σ = 5.0, 2.5; # define parameters of a normal distribution
julia> dist = Normal(μ, σ) # create a normal distribution
Distributions.Normal{Float64}(μ=5.0, σ=2.5)
julia> rand(dist) # sample from the distribution
4.944128552366248
julia> data = rand(dist, 3) # generate three samples
3-element Vector{Float64}:
-1.3094336870488144
6.84427722292975
2.0861877312652815
julia> data = rand(dist, 1000); # generate many samples
julia> Distributions.fit(Normal, data) # fit a normal distribution to the samples
Distributions.Normal{Float64}(μ=4.92941470988734, σ=2.3738926505677984)
julia> μ = [1.0, 2.0];
julia> Σ = [1.0 0.5; 0.5 2.0];
julia> dist = MvNormal(μ, Σ) # create a multivariate normal distribution
FullNormal(
dim: 2
μ: [1.0, 2.0]
Σ: [1.0 0.5; 0.5 2.0]
)
julia> rand(dist, 3) # generate three samples
2×3 Matrix{Float64}:
1.02625 0.648017 1.25499
1.04747 3.78303 1.35686

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.4. packages 361

julia> dist = Dirichlet(ones(3)) # create a Dirichlet distribution Dir(1,1,1)


Distributions.Dirichlet{Float64, Vector{Float64}, Float64}(alpha=[1.0, 1.0, 1.0])
julia> rand(dist) # sample from the distribution
3-element Vector{Float64}:
0.08630277818178314
0.6661794918224854
0.24751772999573152

D.4.2 JuMP.jl
We use the JuMP.jl package (version 1.23) to specify optimization problems that
we can then solve using a variety of solvers, such as those included in GLPK.jl
and Ipopt.jl:
julia> using JuMP
julia> using GLPK
julia> model = Model(GLPK.Optimizer) # create model and use GLPK as solver
A JuMP Model
├ solver: GLPK
├ objective_sense: FEASIBILITY_SENSE
├ num_variables: 0
├ num_constraints: 0
└ Names registered in the model: none
julia> @variable(model, x[1:3]) # define variables x[1], x[2], and x[3]
3-element Vector{JuMP.VariableRef}:
x[1]
x[2]
x[3]
julia> @objective(model, Max, sum(x) - x[2]) # define maximization objective
x[1] + 0 x[2] + x[3]
julia> @constraint(model, x[1] + x[2] ≤ 3) # add constraint
x[1] + x[2] ≤ 3
julia> @constraint(model, x[2] + x[3] ≤ 2) # add another constraint
x[2] + x[3] ≤ 2
julia> @constraint(model, x[2] ≥ 0) # add another constraint
x[2] ≥ 0
julia> optimize!(model) # solve
julia> value.(x) # extract optimal values for elements in x
3-element Vector{Float64}:
3.0
0.0
2.0

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
362 appendix d. julia

D.4.3 Optim.jl
We use the Optim.jl package (version 1.9) to solve unconstrained optimization
problems using a variety of techniques:
julia> using Optim
julia> f(x) = x[1]^4+x[1]^2-x[1]+x[2]^2-20*x[1]^2*x[2]^2; # function to minimize
julia> x₀ = [0.0, 0.0]; # initial guess
julia> result = optimize(f, x₀, LBFGS()) # minimize f starting at x₀ using LBFGS
* Status: success
* Candidate solution
Final objective value: -2.148047e-01
* Found with
Algorithm: L-BFGS
* Convergence measures
|x - x'| = 1.94e-04 ≰ 0.0e+00
|x - x'|/|x'| = 5.04e-04 ≰ 0.0e+00
|f(x) - f(x')| = 7.13e-08 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 3.32e-07 ≰ 0.0e+00
|g(x)| = 5.12e-09 ≤ 1.0e-08
* Work counters
Seconds run: 0 (vs limit Inf)
Iterations: 4
f(x) calls: 9
∇f(x) calls: 9
julia> result.minimizer # extract the optimal value for x
2-element Vector{Float64}:
0.3854584971606701
0.0
julia> result.minimum # extract the optimal value of the function
-0.21480474685286194

D.4.4 SimpleWeightedGraphs.jl
We extend the SimpleWeightedGraphs.jl package (version 1.4) to represent
graphs with weighted edges for discrete reachability analysis in chapter 10. Specif-
ically, we extend the package to create a WeightedGraph type that can represent a
graph with weighted edges between states. The code for this extension is provided
in the ancillaries for this book, and the following code demonstrates its usage:

julia> using SimpleWeightedGraphs


julia> states = [:s1, :s2, :s3]; # define states
julia> g = WeightedGraph(states); # weighted graph with a node for each state

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
d.5. convenience functions 363

julia> add_edge!(g, :s1, :s2, 1.0); # add an edge from s1 to s2 with weight 1.0
julia> get_weight(g, :s1, :s2) # get the weight of the edge from s1 to s2
1.0
julia> inneighbors(g, :s2) # get the nodes with an edge pointing to s2
1-element Vector{Symbol}:
:s1
julia> outneighbors(g, :s1) # get the nodes with an edge starting at s1
1-element Vector{Symbol}:
:s2

D.5 Convenience Functions

We define SetCategorical to represent distributions over discrete sets:


struct SetCategorical{S}
elements::Vector{S} # Set elements (could be repeated)
distr::Categorical # Categorical distribution over set elements

function SetCategorical(elements::AbstractVector{S}) where S


weights = ones(length(elements))
return new{S}(elements, Categorical(normalize(weights, 1)))
end

function SetCategorical(
elements::AbstractVector{S},
weights::AbstractVector{Float64}
) where S

ℓ₁ = norm(weights,1)
if ℓ₁ < 1e-6 || isinf(ℓ₁)
return SetCategorical(elements)
end
distr = Categorical(normalize(weights, 1))
return new{S}(elements, distr)
end
end

Distributions.rand(D::SetCategorical) = D.elements[rand(D.distr)]
Distributions.rand(D::SetCategorical, n::Int) = D.elements[rand(D.distr, n)]
function Distributions.pdf(D::SetCategorical, x)
sum(e == x ? w : 0.0 for (e,w) in zip(D.elements, D.distr.p))
end

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
364 appendix d. julia

julia> D = SetCategorical(["up", "down", "left", "right"],[0.4, 0.2, 0.3, 0.1]);


julia> rand(D)
"left"
julia> rand(D, 5)
5-element Vector{String}:
"up"
"up"
"up"
"up"
"up"
julia> pdf(D, "up")
0.3999999999999999

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
References

1. P. Abbeel and A. Y. Ng, “Apprenticeship Learning via Inverse Reinforcement Learn-


ing,” in International Conference on Machine Learning (ICML), 2004 (cit. on p. 40).
2. H. Abdi and L. J. Williams, “Principal Component Analysis,” Wiley Interdisciplinary
Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010 (cit. on p. 196).
3. A. Ahmadi-Javid, “Entropic Value-At-Risk: A New Coherent Risk Measure,” Journal
of Optimization Theory and Applications, vol. 155, no. 3, pp. 1105–1123, 2011 (cit. on
p. 56).
4. M. Akintunde, A. Lomuscio, L. Maganti, and E. Pirovano, “Reachability Analysis
for Neural Agent-Environment Systems,” in International Conference on Principles of
Knowledge Representation and Reasoning, 2018 (cit. on p. 229).
5. M. Althoff, “Reachability Analysis of Nonlinear Systems Using Conservative Poly-
nomialization and Non-Convex Sets,” in International Conference on Hybrid Systems:
Computation and Control, 2013 (cit. on p. 212).
6. M. Althoff and G. Frehse, “Combining Zonotopes and Support Functions for Effi-
cient Reachability Analysis of Linear Systems,” in IEEE Conference on Decision and
Control (CDC), 2016 (cit. on p. 193).
7. M. Althoff, G. Frehse, and A. Girard, “Set Propagation Techniques for Reachability
Analysis,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 369–
395, 2021 (cit. on p. 186).
8. M. Althoff, G. Frehse, and A. Girard, “Set Propagation Techniques for Reachability
Analysis,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 369–
395, 2021 (cit. on p. 203).
9. M. Althoff, O. Stursberg, and M. Buss, “Reachability Analysis of Nonlinear Systems
with Uncertain Parameters Using Conservative Linearization,” in IEEE Conference
on Decision and Control (CDC), 2008 (cit. on pp. 212, 213).
10. A. N. Angelopoulos, S. Bates, et al., “Conformal Prediction: A Gentle Introduction,”
Foundations and Trends in Machine Learning, vol. 16, no. 4, pp. 494–591, 2023 (cit. on
pp. 301, 302).
366 references

11. T. L. Arel, Safety Management System Manual, Air Traffic Organization, Federal Avia-
tion Administration, 2022 (cit. on p. 13).
12. D. Ariely, Predictably Irrational: The Hidden Forces That Shape Our Decisions. Harper,
2008 (cit. on p. 41).
13. M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A Tutorial on Particle
Filters for Online Nonlinear/non-Gaussian Bayesian Tracking,” IEEE Transactions
on Signal Processing, vol. 50, no. 2, pp. 174–188, 2002 (cit. on p. 157).
14. T. W. Athan and P. Y. Papalambros, “A Note on Weighted Criteria Methods for
Compromise Solutions in Multi-Objective Optimization,” Engineering Optimization,
vol. 27, no. 2, pp. 155–176, 1996 (cit. on p. 60).
15. D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller,
“How to Explain Individual Classification Decisions,” Journal of Machine Learning
Research, vol. 11, pp. 1803–1831, 2010 (cit. on p. 264).
16. C. Baier and J.-P. Katoen, “Principles of Model Checking,” in MIT Press, 2008, ch. 1
(cit. on p. 2).
17. C. Baier and J.-P. Katoen, “Principles of Model Checking,” in MIT Press, 2008, ch. 6
(cit. on p. 67).
18. C. Baier and J.-P. Katoen, Principles of Model Checking. MIT Press, 2008 (cit. on p. 75).
19. G. Barthe, J.-P. Katoen, and A. Silva, Foundations of Probabilistic Programming. Cam-
bridge University Press, 2020 (cit. on p. 134).
20. A. Bauer, M. Leucker, and C. Schallhart, “Runtime Verification for LTL and TLTL,”
ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 20, no. 4,
pp. 1–64, 2011 (cit. on p. 311).
21. R. Bhattacharyya, S. Jung, L. Kruse, R. Senanayake, and M. J. Kochenderfer, “A Hy-
brid Rule-Based and Data-Driven Approach to Driver Modeling Through Particle
Filtering,” IEEE Transactions on Intelligent Transportation Systems, no. 2108.12820,
2021 (cit. on p. 48).
22. A. F. Bielajew, “History of Monte Carlo,” in Monte Carlo Techniques in Radiation
Therapy, CRC Press, 2021, pp. 3–15 (cit. on p. 4).
23. A. Biere, Handbook of Satisfiability. IOS Press, 2009, vol. 185 (cit. on pp. 238, 240).
24. C. M. Bishop and H. Bishop, Deep Learning: Foundations and Concepts. Springer
Nature, 2023 (cit. on p. 28).
25. A. T. Borchers, F. Hagie, C. L. Keen, and M. E. Gershwin, “The History and Contem-
porary Challenges of the US Food and Drug Administration,” Clinical Therapeutics,
vol. 29, no. 1, pp. 1–16, 2007 (cit. on p. 5).
26. G. E. Box, “Science and Statistics,” Journal of the American Statistical Association,
vol. 71, no. 356, pp. 791–799, 1976 (cit. on p. 20).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 367

27. S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press,


2004 (cit. on p. 198).
28. C. B. Browne, E. Powley, D. Whitehouse, et al., “A Survey of Monte Carlo Tree
Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games,
vol. 4, no. 1, pp. 1–43, 2012 (cit. on p. 112).
29. C. F. Camerer, Behavioral Game Theory: Experiments in Strategic Interaction. Princeton
University Press, 2003 (cit. on p. 41).
30. O. Cappé, A. Guillin, J.-M. Marin, and C. P. Robert, “Population Monte Carlo,”
Journal of Computational and Graphical Statistics, vol. 13, no. 4, pp. 907–929, 2004 (cit.
on p. 155).
31. G. Casella and E. I. George, “Explaining the Gibbs Sampler,” The American Statisti-
cian, vol. 46, no. 3, pp. 167–174, 1992 (cit. on p. 134).
32. F. Cérou and A. Guyader, “Adaptive Multilevel Splitting for Rare Event Analysis,”
Stochastic Analysis and Applications, vol. 25, no. 2, pp. 417–443, 2007 (cit. on p. 171).
33. P. Chaudhuri, “On a Geometric Notion of Quantiles for Multivariate Data,” Journal
of the American Statistical Association, vol. 91, no. 434, pp. 862–872, 1996 (cit. on
p. 47).
34. J. K. Choi and Y. G. Ji, “Investigating the Importance of Trust on Adopting an
Autonomous Vehicle,” International Journal of Human-Computer Interaction, vol. 31,
no. 10, pp. 692–702, 2015 (cit. on p. 7).
35. B. Christian, The Alignment Problem: Machine Learning and Human Values. W. W.
Norton & Company, 2020 (cit. on p. 2).
36. J. P. Chryssanthacopoulos and M. J. Kochenderfer, “Collision Avoidance System Op-
timization with Probabilistic Pilot Response Models,” in American Control Conference
(ACC), 2011 (cit. on p. 51).
37. J. Colin, T. Fel, R. Cadène, and T. Serre, “What I Cannot Predict, I Do Not Understand:
A Human-Centered Evaluation Framework for Explainability Methods,” Advances
in Neural Information Processing Systems (NeurIPS), pp. 2832–2845, 2022 (cit. on
p. 256).
38. V. Conitzer, “Eliciting Single-Peaked Preferences Using Comparison Queries,”
Journal of Artificial Intelligence Research, vol. 35, pp. 161–191, 2009 (cit. on p. 60).
39. A. Couëtoux, J.-B. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous
Upper Confidence Trees,” in Learning and Intelligent Optimization (LION), 2011 (cit.
on p. 116).
40. S. Dandl, C. Molnar, M. Binder, and B. Bischl, “Multi-Objective Counterfactual
Explanations,” in International Conference on Parallel Problem Solving from Nature,
2020 (cit. on pp. 273, 276, 278).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
368 references

41. T. Dang and T. Nahhal, “Coverage-Guided Test Generation for Continuous and
Hybrid Systems,” Formal Methods in System Design, vol. 34, pp. 183–213, 2009 (cit.
on p. 108).
42. P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A Tutorial on the
Cross-Entropy Method,” Annals of Operations Research, vol. 134, pp. 19–67, 2005 (cit.
on p. 151).
43. P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo Samplers,” Journal of
the Royal Statistical Society Series B: Statistical Methodology, vol. 68, no. 3, pp. 411–436,
2006 (cit. on p. 159).
44. H. Delecki, A. Corso, and M. J. Kochenderfer, “Model-Based Validation as Proba-
bilistic Inference,” in Conference on Learning for Dynamics and Control (L4DC), 2023
(cit. on p. 129).
45. W. M. Dickie, “A Comparison of the Scientific Method and Achievement of Aristotle
and Bacon,” The Philosophical Review, vol. 31, no. 5, pp. 471–494, 1922 (cit. on p. 3).
46. M. Dowson, “The Ariane 5 Software Failure,” Software Engineering Notes, vol. 22,
no. 2, p. 84, 1997 (cit. on p. 7).
47. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid Monte Carlo,”
Physics Letters B, vol. 195, no. 2, pp. 216–222, 1987 (cit. on p. 134).
48. A. Duret-Lutz, “Manipulating LTL Formulas Using Spot 1.0,” in Automated Technol-
ogy for Verification and Analysis, 2013 (cit. on p. 75).
49. V. Dwaracherla, Z. Wen, I. Osband, X. Lu, S. M. Asghari, and B. Van Roy, “Ensem-
bles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping,”
Transactions on Machine Learning Research, 2022 (cit. on p. 309).
50. EASA AI Task Force, “Concepts of Design Assurance for Neural Networks,” EASA,
2020 (cit. on p. 5).
51. B. Efron, “Bootstrap Methods: Another Look at the Jackknife,” in Breakthroughs in
Statistics: Methodology and Distribution, Springer, 1992, pp. 569–593 (cit. on p. 39).
52. V. Elvira, L. Martino, D. Luengo, and M. F. Bugallo, “Generalized Multiple Im-
portance Sampling,” Statistical Science, vol. 34, no. 1, pp. 129–155, 2019 (cit. on
p. 148).
53. A. Engel, Verification, Validation, and Testing of Engineered Systems. John Wiley & Sons,
2010, vol. 73 (cit. on p. 1).
54. J. M. Esposito, J. Kim, and V. Kumar, “Adaptive RRTs for Validating Hybrid Robotic
Control Systems,” in Algorithmic Foundations of Robotics, Springer, 2005, pp. 107–121
(cit. on p. 106).
55. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural
Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 226).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 369

56. M. Everett, G. Habibi, C. Sun, and J. P. How, “Reachability Analysis of Neural


Feedback Loops,” IEEE Access, vol. 9, pp. 163 938–163 953, 2021 (cit. on p. 229).
57. M. Forets and C. Schilling, “LazySets.jl: Scalable Symbolic-Numeric Set Compu-
tations,” Proceedings of the JuliaCon Conferences, vol. 1, no. 1, pp. 1–11, 2021 (cit. on
p. 179).
58. K. Forsberg and H. Mooz, “The Relationship of System Engineering to the Project
Cycle,” Center for Systems Management, vol. 5333, 1991 (cit. on p. 5).
59. D. Fudenberg and J. Tirole, Game Theory. MIT Press, 1991 (cit. on p. 42).
60. H. Ge, K. Xu, and Z. Ghahramani, “Turing: a Language for Flexible Probabilistic
Inference,” in International Conference on Artificial Intelligence and Statistics (AISTATS),
2018 (cit. on p. 35).
61. R. Geirhos, J.-H. Jacobsen, C. Michaelis, et al., “Shortcut Learning in Deep Neural
Networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020 (cit. on
p. 259).
62. J. W. Gelder, “Air Law: The Federal Aviation Act of 1958,” Michigan Law Review,
vol. 57, no. 8, pp. 1214–1227, 1959 (cit. on p. 5).
63. B. Ghojogh, M. Crowley, F. Karray, and A. Ghodsi, Elements of Dimensionality Reduc-
tion and Manifold Learning. Springer, 2023 (cit. on p. 291).
64. T. Gneiting and M. Katzfuss, “Probabilistic Forecasting,” Annual Review of Statistics
and Its Application, vol. 1, no. 1, pp. 125–151, 2014 (cit. on p. 294).
65. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016 (cit. on
p. 335).
66. I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative Adversarial Nets,”
Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014 (cit. on
p. 24).
67. B. Goodman and S. Flaxman, “European Union Regulations on Algorithmic Decision-
Making and a ‘Right to Explanation’,” AI Magazine, vol. 38, no. 3, pp. 50–57, 2017
(cit. on p. 256).
68. M. A. Gosavi, B. B. Rhoades, and J. M. Conrad, “Application of Functional Safety in
Autonomous Vehicles Using ISO 26262 Standard: A Survey,” in SoutheastCon, 2018
(cit. on p. 5).
69. U. Grenander and M. I. Miller, “Representations of Knowledge in Complex Sys-
tems,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 56, no. 4,
pp. 549–581, 1994 (cit. on p. 131).
70. A. Griewank and A. Walther, Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation, 2nd ed. SIAM, 2008 (cit. on p. 135).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
370 references

71. A. Griewank and A. Walther, Evaluating Derivatives: Principles and Techniques of


Algorithmic Differentiation, 2nd ed. SIAM, 2008 (cit. on p. 339).
72. C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On Calibration of Modern Neural
Networks,” in International Conference on Machine Learning (ICML), 2017 (cit. on
p. 298).
73. H. Hansson and B. Jonsson, “A Logic for Reasoning about Time and Reliability,”
Formal Aspects of Computing, vol. 6, pp. 512–535, 1994 (cit. on p. 55).
74. P. E. Hart, N. J. Nilsson, and B. Raphael, “A Formal Basis for the Heuristic Determi-
nation of Minimum Cost Paths,” IEEE Transactions on Systems Science and Cybernetics,
vol. 4, no. 2, pp. 100–107, 1968 (cit. on p. 112).
75. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics, 2001 (cit. on
p. 28).
76. W. K. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their
Applications,” Biometrika, vol. 57, no. 1, pp. 97–97, 1970 (cit. on p. 126).
77. C. Hensel, S. Junges, J.-P. Katoen, T. Quatmann, and M. Volk, “The Probabilistic
Model Checker Storm,” International Journal on Software Tools for Technology Transfer,
pp. 1–22, 2022 (cit. on p. 242).
78. J. Herkert, J. Borenstein, and K. Miller, “The Boeing 737 MAX: Lessons for Engi-
neering Ethics,” Science and Engineering Ethics, vol. 26, pp. 2957–2974, 2020 (cit. on
p. 7).
79. A. L. Hodgkin and A. F. Huxley, “A Quantitative Description of Membrane Current
and Its Application to Conduction and Excitation in Nerve,” Journal of Physiology,
vol. 117, no. 4, pp. 500–544, 1952 (cit. on p. 336).
80. M. D. Hoffman, A. Gelman, et al., “The No-U-Turn Sampler: Adaptively Setting
Path Lengths in Hamiltonian Monte Carlo.,” Journal of Machine Learning Research
(JMLR), vol. 15, no. 1, pp. 1593–1623, 2014 (cit. on pp. 37, 134).
81. A. L. Hunkenschroer and A. Kriebitz, “Is AI Recruiting (Un)ethical? A Human
Rights Perspective on the Use of AI for Hiring,” AI and Ethics, vol. 3, no. 1, pp. 199–
213, 2023 (cit. on p. 6).
82. M. Huth and M. Ryan, Logic in Computer Science: Modelling and Reasoning about
Systems. Cambridge University Press, 2004 (cit. on pp. 62, 63).
83. K. Ishikawa and J. H. Loftus, “Introduction to Quality Control,” in Springer, 1990,
vol. 98, ch. 1 (cit. on p. 3).
84. V. S. Iyengar, J. Lee, and M. Campbell, “Q-EVAL: Evaluating Multiple Attribute
Items Using Queries,” in ACM Conference on Electronic Commerce, 2001 (cit. on p. 61).
85. L. Jaulin, M. Kieffer, O. Didrit, and É. Walter, Interval Analysis. Springer, 2001 (cit.
on p. 203).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 371

86. E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge University Press,
2003 (cit. on p. 20).
87. P. Jorion, “Risk Management Lessons from Long-Term Capital Management,”
European Financial Management, vol. 6, no. 3, pp. 277–300, 2000 (cit. on p. 7).
88. H. Kahn and T. E. Harris, “Estimation of Particle Transmission by Random Sam-
pling,” National Bureau of Standards Applied Mathematics Series, vol. 12, pp. 27–30,
1951 (cit. on p. 170).
89. G. K. Kamenev, “An Algorithm for Approximating Polyhedra,” Computational Math-
ematics and Mathematical Physics, vol. 4, no. 36, pp. 533–544, 1996 (cit. on p. 194).
90. S. Karaman and E. Frazzoli, “Incremental Sampling-Based Algorithms for Optimal
Motion Planning,” Robotics Science and Systems VI, vol. 104, no. 2, pp. 267–274, 2010
(cit. on p. 111).
91. H. Karloff, Linear Programming. Springer, 2008 (cit. on p. 196).
92. S. M. Katz, K. D. Julian, C. A. Strong, and M. J. Kochenderfer, “Generating Probabilis-
tic Safety Guarantees for Neural Network Controllers,” Machine Learning, vol. 112,
pp. 2903–2931, 2023 (cit. on p. 250).
93. S. M. Katz, A.-C. LeBihan, and M. J. Kochenderfer, “Learning an Urban Air Mobility
Encounter Model from Expert Preferences,” in Digital Avionics Systems Conference
(DASC), 2019 (cit. on p. 28).
94. P. Kirichenko, P. Izmailov, and A. G. Wilson, “Why Normalizing Flows Fail to Detect
Out-Of-Distribution Data,” Advances in Neural Information Processing Systems, vol. 33,
pp. 20 578–20 589, 2020 (cit. on p. 291).
95. J. Kleinberg, S. Mullainathan, and M. Raghavan, “Inherent Trade-Offs in the Fair
Determination of Risk Scores,” in Innovations in Theoretical Computer Science (ITCS)
Conference, 2017 (cit. on p. 7).
96. I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing Flows: An Introduction
and Review of Current Methods,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 43, no. 11, pp. 3964–3979, 2020 (cit. on p. 24).
97. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019
(cit. on pp. 92, 261, 276).
98. M. J. Kochenderfer and J. P. Chryssanthacopoulos, “Robust Airborne Collision
Avoidance Through Dynamic Programming,” Massachusetts Institute of Technol-
ogy, Lincoln Laboratory, Project Report ATC-371, 2011 (cit. on p. 325).
99. M. J. Kochenderfer, M. W. M. Edwards, L. P. Espindle, J. K. Kuchar, and J. D. Grif-
fith, “Airspace Encounter Models for Estimating Collision Risk,” AIAA Journal on
Guidance, Control, and Dynamics, vol. 33, no. 2, pp. 487–499, 2010 (cit. on p. 44).
100. M. J. Kochenderfer, T. A. Wheeler, and K. H. Wray, Algorithms for Decision Making.
MIT Press, 2022 (cit. on pp. 2, 10, 25, 27, 34, 41, 324).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
372 references

101. R. Koenker and K. F. Hallock, “Quantile Regression,” Journal of Economic Perspectives,


vol. 15, no. 4, pp. 143–156, 2001 (cit. on p. 294).
102. D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques.
MIT Press, 2009 (cit. on p. 25).
103. A. Kolmogorov, Foundations of the Theory of Probability, 2nd ed. Chelsea, 1956 (cit.
on p. 328).
104. A. Kossiakoff, S. M. Biemer, S. J. Seymour, and D. A. Flanigan, Systems Engineering
Principles and Practice. John Wiley & Sons, 2020 (cit. on p. 2).
105. J. Kuchar and A. C. Drumm, “The Traffic Alert and Collision Avoidance System,”
Lincoln Laboratory Journal, vol. 16, no. 2, p. 277, 2007 (cit. on p. 6).
106. S. Kullback and R. A. Leibler, “On Information and Sufficiency,” Annals of Mathe-
matical Statistics, vol. 22, no. 1, pp. 79–86, 1951 (cit. on p. 331).
107. S. Kullback, Information Theory and Statistics. Wiley, 1959 (cit. on p. 331).
108. S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals of
Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951 (cit. on p. 46).
109. M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of Probabilis-
tic Real-Time Systems,” in International Conference on Computer Aided Verification,
2011 (cit. on p. 242).
110. B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and Scalable Predictive
Uncertainty Estimation Using Deep Ensembles,” Advances in Neural Information
Processing Systems (NeurIPS), vol. 30, 2017 (cit. on p. 309).
111. S. LaValle, “Planning Algorithms,” Cambridge University Press, vol. 2, pp. 3671–3678,
2006 (cit. on p. 101).
112. R. Lee, M. J. Kochenderfer, O. J. Mengshoel, and J. Silbermann, “Interpretable Cate-
gorization of Heterogeneous Time Series Data,” in SIAM International Conference on
Data Mining, 2018 (cit. on p. 283).
113. R. Lee, O. J. Mengshoel, A. Saksena, et al., “Adaptive Stress Testing: Finding Likely
Failure Events with Reinforcement Learning,” Journal of Artificial Intelligence Research,
vol. 69, pp. 1165–1201, 2020 (cit. on p. 114).
114. J. Lehrer, How We Decide. Houghton Mifflin, 2009 (cit. on p. 41).
115. K. Leung, N. Aréchiga, and M. Pavone, “Backpropagation Through Signal Temporal
Logic Specifications: Infusing Logical Structure into Gradient-Based Methods,” The
International Journal of Robotics Research, vol. 42, no. 6, pp. 356–370, 2023 (cit. on
p. 73).
116. N. G. Leveson and C. S. Turner, “An Investigation of the Therac-25 Accidents,”
Computer, vol. 26, no. 7, pp. 18–41, 1993 (cit. on p. 6).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 373

117. F. Liese and I. Vajda, “On Divergences and Informations in Statistics and Information
Theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412,
2006 (cit. on p. 46).
118. C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, and M. J. Kochenderfer, “Algo-
rithms for Verifying Deep Neural Networks,” Foundations and Trends in Optimization,
vol. 4, no. 3–4, pp. 244–404, 2021 (cit. on p. 226).
119. F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago, “Marginal Likelihood
Computation for Model Selection and Hypothesis Testing: an Extensive Review,”
SIAM Review, vol. 65, no. 1, pp. 3–58, 2023 (cit. on pp. 158, 161).
120. S. Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information
Theory, vol. 28, no. 2, pp. 129–137, 1982 (cit. on p. 279).
121. K. Makino and M. Berz, “Taylor Models and Other Validated Functional Inclusion
Methods,” International Journal of Pure and Applied Mathematics, vol. 4, no. 4, pp. 379–
456, 2003 (cit. on p. 212).
122. O. Maler, “Computing Reachable Sets: An Introduction,” French National Center of
Scientific Research, pp. 1–8, 2008 (cit. on p. 186).
123. O. Maler and D. Nickovic, “Monitoring Temporal Properties of Continuous Signals,”
in International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems,
2004 (cit. on p. 69).
124. R. L. McCarthy, “Autonomous Vehicle Accident Data Analysis: California OL 316
Reports: 2015–2020,” ASCE-ASME Journal of Risk and Uncertainty in Engineering
Systems, Part B: Mechanical Engineering, vol. 8, no. 3, p. 034 502, 2022 (cit. on p. 6).
125. S. B. McGrayne, The Theory That Would Not Die. Yale University Press, 2011 (cit. on
p. 34).
126. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller,
“Equation of State Calculations by Fast Computing Machines,” Journal of Chemical
Physics, vol. 21, no. 6, pp. 1087–1092, 1953 (cit. on p. 126).
127. B. P. Miller, L. Fredriksen, and B. So, “An Empirical Study of the Reliability of UNIX
Utilities,” Communications of the ACM, vol. 33, no. 12, pp. 32–44, 1990 (cit. on p. 85).
128. B. Müller, J. Reinhardt, and M. T. Strickland, Neural Networks. Springer, 1995 (cit. on
p. 335).
129. C. N. Murphy and J. Yates, The International Organization for Standardization (ISO):
Global Governance Through Voluntary Consensus. Routledge, 2009 (cit. on p. 5).
130. K. P. Murphy, Probabilistic Machine Learning: An Introduction. MIT Press, 2022 (cit.
on p. 28).
131. R. Neidinger, “Directions for Computing Truncated Multivariate Taylor Series,”
Mathematics of Computation, vol. 74, no. 249, pp. 321–340, 2005 (cit. on p. 209).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
374 references

132. A. Niculescu-Mizil and R. Caruana, “Predicting Good Probabilities with Supervised


Learning,” in International Conference on Machine Learning (ICML), 2005 (cit. on
p. 300).
133. J. Nocedal, “Updating Quasi-Newton Matrices with Limited Storage,” Mathematics
of Computation, vol. 35, no. 151, pp. 773–782, 1980 (cit. on pp. 94, 99).
134. J. R. Norris, Markov Chains. Cambridge University Press, 1998 (cit. on p. 242).
135. R. Page and R. Gamboa, Essential Logic for Computer Science. MIT Press, 2019 (cit.
on p. 63).
136. J. Pearl, “Direct and Indirect Effects,” in Conference on Uncertainty in Artificial Intelli-
gence (UAI), 2001 (cit. on p. 259).
137. D. Phillips-Donaldson, “100 Years of Juran,” Quality Progress, vol. 37, no. 5, pp. 25–
31, 2004 (cit. on p. 4).
138. A. Pinkus, “Approximation Theory of the MLP Model in Neural Networks,” Acta
Numerica, vol. 8, pp. 143–195, 1999 (cit. on p. 336).
139. A. Pnueli, “The Temporal Logic of Programs,” in Symposium on Foundations of
Computer Science (SFCS), 1977 (cit. on p. 67).
140. J. Postels, M. Segù, T. Sun, et al., “On the Practicality of Deterministic Epistemic
Uncertainty,” in International Conference on Learning Representations (ICLR), 2022
(cit. on p. 292).
141. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes
3rd Edition: The Art of Scientific Computing. Cambridge University Press, 2007 (cit.
on p. 97).
142. Y. Pu, Z. Gan, R. Henao, et al., “Variational Autoencoder for Deep Learning of
Images, Labels and Captions,” Advances in Neural Information Processing Systems
(NeurIPS), vol. 29, 2016 (cit. on p. 293).
143. J. Reason, “Human Error: Models and Management,” British Medical Journal, vol. 320,
no. 7237, pp. 768–770, 2000 (cit. on p. 14).
144. R. G. Regis, “On the Properties of Positive Spanning Sets and Positive Bases,”
Optimization and Engineering, vol. 17, no. 1, pp. 229–262, 2016 (cit. on p. 193).
145. M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ Explaining the
Predictions of Any Classifier,” in ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2016 (cit. on p. 270).
146. L. Rierson, Developing Safety-Critical Software: a Practical Guide for Aviation Software
and DO-178C Compliance. CRC Press, 2017 (cit. on p. 5).
147. C. P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer, 1999, vol. 2 (cit.
on pp. 122, 126).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
references 375

148. R. T. Rockafellar and S. Uryasev, “Optimization of Conditional Value-at-Risk,”


Journal of Risk, vol. 2, pp. 21–42, 2000 (cit. on p. 56).
149. S. Ross, G. J. Gordon, and J. A. Bagnell, “A Reduction of Imitation Learning and
Structured Prediction to No-Regret Online Learning,” in International Conference on
Artificial Intelligence and Statistics (AISTATS), vol. 15, 2011 (cit. on p. 40).
150. W. W. Royce, “Managing the Development of Large Software Systems: Concepts
and Techniques,” IEEE WESCON, 1970 (cit. on p. 5).
151. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Representations by
Back-Propagating Errors,” Nature, vol. 323, pp. 533–536, 1986 (cit. on p. 339).
152. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson,
2021 (cit. on p. 237).
153. H. Samet, Foundations of Multidimensional and Metric Data Structures. Morgan Kauf-
mann, 2006 (cit. on p. 287).
154. R. Seidel, “Convex Hull Computations,” in Handbook of Discrete and Computational
Geometry, Chapman and Hall, 2017, pp. 687–703 (cit. on p. 188).
155. C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical
Journal, vol. 27, no. 4, pp. 623–656, 1948 (cit. on p. 330).
156. L. S. Shapley, “Notes on the N-Person Game—II: The Value of an N-Person Game,”
1951 (cit. on p. 266).
157. Y. Shoham and K. Leyton-Brown, Multiagent Systems: Algorithmic, Game Theoretic,
and Logical Foundations. Cambridge University Press, 2009 (cit. on p. 42).
158. C. Sidrane, A. Maleki, A. Irfan, and M. J. Kochenderfer, “OVERT: An Algorithm
for Safety Verification of Neural Network Control Policies for Nonlinear Systems,”
Journal of Machine Learning Research, vol. 23, no. 117, pp. 1–45, 2022 (cit. on pp. 221,
222).
159. J. Siegel and G. Pappas, “Morals, Ethics, and the Technology Capabilities and
Limitations of Automated and Self-Driving Vehicles,” AI & Society, vol. 38, no. 1,
pp. 213–226, 2023 (cit. on p. 7).
160. S. Sigl and M. Althoff, “M-Representation of Polytopes,” ArXiv:2303.05173, 2023
(cit. on p. 188).
161. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks:
Visualising Image Classification Models and Saliency Maps,” in International Con-
ference on Learning Representations (ICLR), 2014 (cit. on p. 264).
162. A. Sinha, M. O’Kelly, R. Tedrake, and J. C. Duchi, “Neural Bridge Sampling for
Evaluating Safety-Critical Autonomous Systems,” Advances in Neural Information
Processing Systems (NeurIPS), vol. 33, pp. 6402–6416, 2020 (cit. on pp. 158, 168).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
376 references

163. D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg, “Smoothgrad: Re-


moving Noise by Adding Noise,” in International Conference on Machine Learning
(ICML), 2017 (cit. on p. 264).
164. E. Soroka, M. J. Kochenderfer, and S. Lall, “Satisfiability.jl: Satisfiability Modulo
Theories in Julia,” Journal of Open Source Software, vol. 9, no. 100, p. 6757, 2024 (cit.
on p. 238).
165. D. O. Stahl and P. W. Wilson, “Experimental Evidence on Players’ Models of Other
Players,” Journal of Economic Behavior & Organization, vol. 25, no. 3, pp. 309–327,
1994 (cit. on p. 42).
166. C. A. Strong, H. Wu, A. Zeljic, et al., “Global Optimization of Objective Functions
Represented by ReLU Networks,” Machine Learning, vol. 112, pp. 3685–3712, 2023
(cit. on p. 229).
167. E. Štrumbelj and I. Kononenko, “Explaining Prediction Models and Individual
Predictions with Feature Contributions,” Knowledge and Information Systems, vol. 41,
pp. 647–665, 2014 (cit. on p. 267).
168. O. Stursberg and B. H. Krogh, “Efficient Representation and Computation of Reach-
able Sets for Hybrid Systems,” in Hybrid Systems: Computation and Control, 2003 (cit.
on p. 196).
169. M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks,”
in International Conference on Machine Learning (ICML), 2017 (cit. on p. 264).
170. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Second Edition.
MIT Press, 2018 (cit. on p. 116).
171. E. Thiémard, “An Algorithm to Compute Bounds for the Star Discrepancy,” Journal
of Complexity, vol. 17, no. 4, pp. 850–880, 2001 (cit. on p. 107).
172. S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT Press, 2006 (cit. on
p. 161).
173. R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the
Royal Statistical Society Series B: Statistical Methodology, vol. 58, no. 1, pp. 267–288,
1996 (cit. on p. 272).
174. V. Tjeng, K. Y. Xiao, and R. Tedrake, “Evaluating Robustness of Neural Networks
with Mixed Integer Programming,” in International Conference on Learning Represen-
tations (ICLR), 2018 (cit. on pp. 221, 229).
175. H.-D. Tran, D. Manzanas Lopez, P. Musau, et al., “Star-Based Reachability Analysis
of Deep Neural Networks,” in International Symposium on Formal Methods, 2019 (cit.
on p. 212).
176. W. M. Tsutsui, “W. Edwards Deming and the Origins of Quality Control in Japan,”
Journal of Japanese Studies, vol. 22, no. 2, pp. 295–325, 1996 (cit. on p. 4).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
index 377

177. A. M. Turing, “Computing Machinery and Intelligence,” Mind, vol. 59, pp. 433–460,
1950 (cit. on p. 48).
178. M. Vazquez-Chanlatte, J. V. Deshmukh, X. Jin, and S. A. Seshia, “Logical Clustering
and Learning for Time-Series Data,” in International Conference on Computer Aided
Verification, 2017 (cit. on p. 282).
179. A. G. Wilson and P. Izmailov, “Bayesian Deep Learning and a Probabilistic Perspec-
tive of Generalization,” Advances in Neural Information Processing Systems (NeurIPS),
vol. 33, pp. 4697–4708, 2020 (cit. on p. 308).
180. L. A. Wolsey, Integer Programming. Wiley, 2020 (cit. on p. 221).
181. B. Wong, “Points of View: Color Blindness,” Nature Methods, vol. 8, no. 6, pp. 441–
442, 2011 (cit. on p. xi).
182. W. Xiang, H.-D. Tran, and T. T. Johnson, “Output Reachable Set Estimation and
Verification for Multilayer Neural Networks,” IEEE Transactions on Neural Networks
and Learning Systems, vol. 29, no. 11, pp. 5777–5783, 2018 (cit. on p. 229).
183. W. Xiang, H.-D. Tran, J. A. Rosenfeld, and T. T. Johnson, “Reachable Set Estima-
tion and Safety Verification for Piecewise Linear Systems with Neural Network
Controllers,” in American Control Conference (ACC), 2018 (cit. on p. 227).
184. D. Xu and Y. Tian, “A Comprehensive Survey of Clustering Algorithms,” Annals of
Data Science, vol. 2, pp. 165–193, 2015 (cit. on p. 279).
185. H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel, “Efficient Neural
Network Robustness Certification with General Activation Functions,” Advances in
Neural Information Processing Systems (NeurIPS), vol. 31, 2018 (cit. on p. 229).
186. Y.-D. Zhou, K.-T. Fang, and J.-H. Ning, “Mixture Discrepancy for Quasi-Random
Point Sets,” Journal of Complexity, vol. 29, no. 3-4, pp. 283–301, 2013 (cit. on p. 107).
187. B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse
Reinforcement Learning,” in AAAI Conference on Artificial Intelligence (AAAI), 2008
(cit. on p. 41).
188. G. M. Ziegler, Lectures on Polytopes. Springer Science & Business Media, 2012, vol. 152
(cit. on p. 187).
189. A. Zutshi, J. V. Deshmukh, S. Sankaranarayanan, and J. Kapinski, “Multiple Shoot-
ing, CEGAR-Based Falsification for Hybrid Systems,” in International Conference on
Embedded Software, 2014 (cit. on p. 99).

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
Index

H-polytope, 187 bit, 330 concrete types, 352


V -polytope, 187 black-box simulators, 93 concretize, 216
k-means, 279 Boolean, 341 conditional coverage, 302
Boolean satisfiability, 238 conditional distribution, 25
absolutely homogeneous, 328 bootstrap method, 39 conditional Gaussian distribution, 27
abstract types, 352 bounded model checking, 233 conditional value at risk, 56
activation function, 336 breadth-first search, 233 conformal prediction, 301
activation pattern, 228 bridge density, 164 conjugate prior, 34
Adaptive importance sampling, 151 bridge sampling, 164 conjunction, 63
adaptive stress testing, 114 broadcasting, 347 conservative linearization, 212
admissible, 112 burn-in, 126 consistent, 141
adversary, 116 Büchi automaton, 74 continuous entropy, 330, see differential
aleatoric uncertainty, 292, see output entropy
uncertainty convex, 186
alignment problem, 2, 255 calibration, 297 convex combinations, 187
anonymous function, 354 calibration plot, 44 convex hull, 187
array comprehension, 344 callable, 355 convex set, 186
atomic proposition, 62 cascading errors, 40 countable additivity, 327
average dispersion, 106 Chebyshev norm, 329 counterexample, 237
avoid set, 178 chessboard norm, 329 counterexample search, 237
closed under complementation, 327 counterexamples, 81
backpropagation, 339 closed under countable unions, 327 counterfactual, 273
backward reachability, 233 clustering, 279 counterfactual explanation, 273
batch, 335 coefficient of variation, 174 coverage, 105, 300
Bayesian model averaging, 308 coherent risk measure, 56 coverage metrics, 105
Bayesian parameter learning, 34 composite metric, 58 cross entropy, 151, 331
behavior model, 41 composite metrics, 58 cross entropy method, 151
behavioral cloning, 39 composite type, 352 CTL, see Computation tree logic
best response, 41 Computation tree logic, 67 curse of dimensionality, 290
biconditional, 63 Concrete reachability, 216
380 index

CVaR, see conditional value at risk exploration, 112 inclusion function, 205
independently and identically
DAgger, see data set aggregation F-divergences, 46 distributed, 28
data set aggregation, 40 failure region, 104 individuals, 95
decision tree, 273 failure trajectories, 81 information content, 330
deep learning, 335 failures, 81 integrated gradients, 264
deep neural network, 336 Falsification, 13 interaction models, 42
defect, 97 falsification, 81 interpretability, 256
dependency effect, 182 falsifying trajectories, 81 interval arithmetic, 203
depth of rationality, 42 feature, 257 interval box, 203, see hyperrectangle
dictionary, 351 feature collapse, 292 interval counterpart, 204
differential entropy, 330 feature importance, 257 interval hull, 204
direct methods, 93 feedforward network, 336 intervals, 189
discrepancy, 106 first fundamental theorem of calculus, invariant set, 185
discrete state abstraction, 248 331 inverse reinforcement learning, 40
disjunction, 63 first-order, 93 irreducible uncertainty, 292, see output
dispatch, 356 first-order logic, 63 uncertainty
dispersion, 105 forward reachability, 177, 233 iterative deepening, 237
distance metric, 328 function, 354 iterative refinement, 194
disturbance distribution, 84 fuzzing, 85
disturbances, 81, 82 joint distribution, 25
double progressive widening, 116 Gaussian distribution, 21, 22
Gaussian mixture model, 23 k-fold cross validation, 39
earth mover’s distance, 46 generalization performance, 35 K-L divergence, see Kullback-Leibler
ECE, see expected calibration error generative adversarial networks, 24 divergence
elite samples, 153 generative models, 24 K-S statistic, see Kolmogorov-Smirnov
EM, see expectation-maximization generators, 188 statistic
entropic value at risk, 56 global explanation, 270 kernel, 126
entropy, 297, 330 keyword argument, 355
environment, 8 half space, 186 Kolmogorov-Smirnov statistic, 46
episode, 116 Hausdorff distance, 190 Kolmorogov axioms, 328
epistemic uncertainty, 293 hierarchical, 42 Kullback-Leibler divergence, 44, 331,
Euclidean norm, 329 hierarchical softmax, 42 see relative entropy
event space, 328 histogram binning, 298
exchangeable, 302 holdout method, 39 Lagrange remainder, 212
existential quantifier, 65 hyperrectangle, 189 least-squares, 28, 29
expectation-maximization, 34 linear inequalities, 186
expected calibration error, 46 identity of indiscernibles, 328 linear model, 270
expected value, 54 imitation game, 48 linear program, 195
explainability, 256 imitation learning, 39 linear systems, 178
explanation, 255 implication, 63 Linear temporal logic, 67
exploitation, 112 Importance sampling, 145 linearization, 208

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
index 381

local descent methods, 93 Metropolis-Hastings, 126 parametric types, 353


local explanation, 270 Minkowski sum, 182 Pareto frontier, 58
log-likelihood, 28 mixed-integer linear program, 221 Pareto optimal, 58
logic gates, 63 mixture models, 21 partially observable Markov decision
logical formula, 62 mode, 34 process, 10
logical specification, 62 model class, 19 partitioning, 222
logit response, 41 model uncertainty, 293 polyhedron, 186
logit-level-k, 42 Monte Carlo tree search (MCTS), 112 polynomial zonotopes, 212
loop, 358 multimodal, 21 polytope, 186
loss function, 335 multiple importance sampling, 148 polytopes, 186
lower confidence bound (LCB), 114 multiple shooting, 98 population, 155
LTL, see Linear temporal logic multivariate distribution, 25 Population methods, 95
multivariate Gaussian distribution, 25 Population Monte Carlo, 155
mutate, 345 positive definite, 330
machine learning, 28 positive semidefinite, 330
marginal coverage, 302 named function, 354 positive spanning set, 193
marginal satisfaction, 282 named tuple, 351 posterior distribution, 34
Markov chain, 126 nat, 330 precision parameter, 41, 298
Markov chain Monte Carlo, 126 natural, 330 predicate function, 63
matrix, 347 natural inclusion function, 206 predicates, 63
max norm, 329 negation, 63 preference elicitation, 60
maximum a posteriori, 34 neural network, 335 prior distribution, 34
maximum calibration error, 46 neural network verification, 226 Probabilistic programming, 134
maximum entropy inverse nonnegativity, 327 probabilistic programming, 35
reinforcement learning, 40 nonparametric, 161 probability, 20
maximum likelihood estimate, 28 normal distribution, 21, 22 probability axioms, 328
maximum likelihood parameter normalizing flows, 23 probability density functions, 21
learning, 28 normed vector space, 328 probability distribution, 20
maximum margin inverse probability mass function, 20
reinforcement learning, 40 ODD, see operational design domain probability measure, 328
MCE, see maximum calibration error offline, 285 probability space, 328
mean excess loss, 56 online, 285 product system, 77
mean shortfall, 56 operational design domain, 285 progressive widening, 114
mean value theorem, 206 output uncertainty, 292 proper scoring rule, 294
measurable set, 327 overapproximation, 190 proposal distribution, 122
measure, 327 overapproximation error, 190 proposition, 62
measure space, 327 overfitting, 39 propositional logic, 62
metric, 328 prototypical example, 281
metric space, 328 package, 359 pseudoinverse, 33
metrics, 53 parameter learning, 27 pseudorandom number generator, 24
Metropolis-Adjusted Langevin parameter tuning, 335
Algorithm, 131 parametric signal temporal logic, 282

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]
382 index

PSTL, see parametric signal temporal Shannon information, 330 Taylor series, 331
logic Shapley value, 266 temperature scaling, 298
Shooting methods, 97 temporal logic, 65
Q-Q plot, see quantile-quantile plot sigmoid models, 27 ternary operator, 357
quantal response, 41 signal, 69 test set, 39
quantal-level-k, 42 Signal temporal logic, 69 testing, 1
quantifiers, 63 single shooting, 98 training, 27, 335
quantile-quantile plot, 44 smooth robustness, 73 training set, 39
Smoothing, 129 trajectory, 10
rapidly exploring random trees (RRT), SMT, see Satisfiability modulo theories triangle inequality, 328
100 softmax, 73, 297, 337 truthtable, 63
ratio importance sampling, 163 softmax response, 41 tuple, 350
reachability specification, 73 softmin, 73 Turing test, 48
reachable set, 178 spatial index, 287
rectified linear unit, 222 specification, 8, 11 umbrella sampling, 163
reducible uncertainty, 293, see model specifications, 53 unbiased, 141
uncertainty splat, 356 unbounded model checking, 235
Reinforcement learning, 116 spurious correlations, 259 uncertainty quantification, 292
Rejection sampling, 122 standard error, 140 unimodal, 21
relative entropy, 331 Star discrepancy, 107 univariate distribution, 25
reverse accumulation, 339 star sets, 212 universal quantifier, 65
risk metric, 55 stationary, 86 utopia point, 58
robustness, 71 STL, see Signal temporal logic
rollout, 10 string, 343
V model, 5
rotation estimation, 39 superlevel set, 290
validation, 1
runtime monitoring, 285 support, 21
value at risk, 55
support function, 191
VaR, see value at risk
safety case, 14 support vector, 193
variable, 63
saliency maps, 261 supremum, 105
variance, 55
sample efficiency, 117 surrogate model, 267
vector, 344
sample space, 328 Swiss cheese model, 14
vector space, 328
SAT, see Boolean satisfiability symbol, 344
verification, 1
Satisfiability modulo theories, 240 symbolic reachability, 216
second-order, 93 symmetry, 328
Self-normalized importance sampling, system, 8 Wasserstein distance, 46
165 waterfall model, 4
sensitivity analysis, 259 tail value at risk, 56 weighted exponential sum, 60
sensor, 9 taxicab norm, 329 weighted metrics, 58
sequential interactive demonstration, Taylor approximation, 332 weighted sum, 58
40 Taylor expansion, 331 wrapping effect, 218
sequential Monte Carlo, 157 Taylor inclusion function, 208
Set propagation, 179 Taylor models, 212 zonotope, 188

© 2024 Kochenderfer, Katz, Corso, and Moss shared under a Creative Commons CC-BY-NC-ND license.
2024-12-23 15:12:43-05:00, comments to [email protected]

You might also like