0% found this document useful (0 votes)
35 views422 pages

AbstractDynamic Programming

The document is the third edition of 'Abstract Dynamic Programming' by Dimitri P. Bertsekas, detailing various aspects of dynamic programming and its applications. It includes a comprehensive table of contents outlining topics such as contractive models, semicontractive models, and algorithms related to dynamic programming. The author, a prominent figure in optimization and control theory, has an extensive academic background and has authored numerous related works.

Uploaded by

lgavo627
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views422 pages

AbstractDynamic Programming

The document is the third edition of 'Abstract Dynamic Programming' by Dimitri P. Bertsekas, detailing various aspects of dynamic programming and its applications. It includes a comprehensive table of contents outlining topics such as contractive models, semicontractive models, and algorithms related to dynamic programming. The author, a prominent figure in optimization and control theory, has an extensive academic background and has authored numerous related works.

Uploaded by

lgavo627
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 422

Abstract Dynamic Programming

THIRD EDITION

Dimitri P. Bertsekas
Arizona State University

Massachusetts Institute of Technology

WWW site for book information and orders

http://www.athenasc.com

Athena Scientific, Belmont, Massachusetts


Athena Scientific
Post Office Box 805
Nashua, NH 03061-0805
U.S.A.

Email: [email protected]
WWW: http://www.athenasc.com

Cover design: Dimitri Bertsekas

!c 2022 Dimitri P. Bertsekas


All rights reserved. No part of this book may be reproduced in any form
by any electronic or mechanical means (including photocopying, recording,
or information storage and retrieval) without permission in writing from
the publisher.

Publisher’s Cataloging-in-Publication Data


Bertsekas, Dimitri P.
Abstract Dynamic Programming: Third Edition
Includes bibliographical references and index
1. Mathematical Optimization. 2. Dynamic Programming. I. Title.
QA402.5 .B465 2022 519.703 01-75941

ISBN-10: 1-886529-47-7, ISBN-13: 978-1-886529-47-2


ABOUT THE AUTHOR

Dimitri Bertsekas studied Mechanical and Electrical Engineering at the


National Technical University of Athens, Greece, and obtained his Ph.D.
in system science from the Massachusetts Institute of Technology. He
has held faculty positions with the Engineering-Economic Systems Depart-
ment, Stanford University, and the Electrical Engineering Department of
the University of Illinois, Urbana. From 1979 to 2019 he was a professor at
the Electrical Engineering and Computer Science Department of the Mas-
sachusetts Institute of Technology (M.I.T.), where he continues to hold the
title of McAfee Professor of Engineering. In 2019, he joined the School of
Computing and Augmented Intelligence at the Arizona State University,
Tempe, AZ, as Fulton Professor of Computational Decision Making.
Professor Bertsekas’ teaching and research have spanned several fields,
including deterministic optimization, dynamic programming and stochastic
control, large-scale and distributed computation, artificial intelligence, and
data communication networks. He has authored or coauthored numerous
research papers and twenty books, several of which are currently used as
textbooks in MIT classes, including “Dynamic Programming and Optimal
Control,” “Data Networks,” “Introduction to Probability,” and “Nonlinear
Programming.” At ASU, he has been focusing in teaching and research in
reinforcement learning, and he has written several textbooks and research
monographs in this field since 2019.
Professor Bertsekas was awarded the INFORMS 1997 Prize for Re-
search Excellence in the Interface Between Operations Research and Com-
puter Science for his book “Neuro-Dynamic Programming” (co-authored
with John Tsitsiklis), the 2001 AACC John R. Ragazzini Education Award,
the 2009 INFORMS Expository Writing Award, the 2014 AACC Richard
Bellman Heritage Award, the 2014 INFORMS Khachiyan Prize for Life-
Time Accomplishments in Optimization, the 2015 MOS/SIAM George B.
Dantzig Prize, and the 2022 IEEE Control Systems Award. In 2018 he
shared with his coauthor, John Tsitsiklis, the 2018 INFORMS John von
Neumann Theory Prize for the contributions of the research monographs
“Parallel and Distributed Computation” and “Neuro-Dynamic Program-
ming.” Professor Bertsekas was elected in 2001 to the United States Na-
tional Academy of Engineering for “pioneering contributions to fundamen-
tal research, practice and education of optimization/control theory.”
ATHENA SCIENTIFIC
OPTIMIZATION AND COMPUTATION SERIES

1. A Course in Reinforcement Learning by Dimitri P. Bertsekas, 2023,


ISBN 978-1-886529-49-6, 424 pages
2. Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive
Control by Dimitri P. Bertsekas, 2022, ISBN 978-1-886529-17-5, 245
pages
3. Abstract Dynamic Programming, 3rd Edition, by Dimitri P. Bert-
sekas, 2022, ISBN 978-1-886529-47-2, 420 pages
4. Rollout, Policy Iteration, and Distributed Reinforcement Learning, by
Dimitri P. Bertsekas, 2020, ISBN 978-1-886529-07-6, 480 pages
5. Reinforcement Learning and Optimal Control, by Dimitri P. Bert-
sekas, 2019, ISBN 978-1-886529-39-7, 388 pages
6. Dynamic Programming and Optimal Control, Two-Volume Set, by
Dimitri P. Bertsekas, 2017, ISBN 1-886529-08-6, 1270 pages
7. Nonlinear Programming, 3rd Edition, by Dimitri P. Bertsekas, 2016,
ISBN 1-886529-05-1, 880 pages
8. Convex Optimization Algorithms, by Dimitri P. Bertsekas, 2015, ISBN
978-1-886529-28-1, 576 pages
9. Convex Optimization Theory, by Dimitri P. Bertsekas, 2009, ISBN
978-1-886529-31-1, 256 pages
10. Introduction to Probability, 2nd Edition, by Dimitri P. Bertsekas and
John N. Tsitsiklis, 2008, ISBN 978-1-886529-23-6, 544 pages
11. Convex Analysis and Optimization, by Dimitri P. Bertsekas, Angelia
Nedić, and Asuman E. Ozdaglar, 2003, ISBN 1-886529-45-0, 560 pages
12. Network Optimization: Continuous and Discrete Models, by Dimitri
P. Bertsekas, 1998, ISBN 1-886529-02-7, 608 pages
13. Network Flows and Monotropic Optimization, by R. Tyrrell Rockafel-
lar, 1998, ISBN 1-886529-06-X, 634 pages
14. Introduction to Linear Optimization, by Dimitris Bertsimas and John
N. Tsitsiklis, 1997, ISBN 1-886529-19-1, 608 pages
15. Parallel and Distributed Computation: Numerical Methods, by Dim-
itri P. Bertsekas and John N. Tsitsiklis, 1997, ISBN 1-886529-01-9,
718 pages
16. Neuro-Dynamic Programming, by Dimitri P. Bertsekas and John N.
Tsitsiklis, 1996, ISBN 1-886529-10-8, 512 pages
17. Constrained Optimization and Lagrange Multiplier Methods, by Dim-
itri P. Bertsekas, 1996, ISBN 1-886529-04-3, 410 pages
18. Stochastic Optimal Control: The Discrete-Time Case, by Dimitri P.
Bertsekas and Steven E. Shreve, 1996, ISBN 1-886529-03-5, 330 pages
Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . p. 1
1.1. Structure of Dynamic Programming Problems . . . . . . . p. 2
1.2. Abstract Dynamic Programming Models . . . . . . . . . . p. 5
1.2.1. Problem Formulation . . . . . . . . . . . . . . . . p. 5
1.2.2. Monotonicity and Contraction Properties . . . . . . . p. 7
1.2.3. Some Examples . . . . . . . . . . . . . . . . . . p. 10
1.2.4. Reinforcement Learning - Projected and Aggregation . . . .
Bellman Equations . . . . . . . . . . . . . . . . p. 24
1.2.5. Reinforcement Learning - Temporal Difference and . . . . .
Proximal Algorithms . . . . . . . . . . . . . . . . p. 26
1.3. Reinforcement Learning - Approximation in Value Space . . . p. 29
1.3.1. Approximation in Value Space for . . . . . . . . . . . .
Markovian Decision Problems . . . . . . . . . . . . p. 29
1.3.2. Approximation in Value Space and . . . . . . . . . . . .
Newton’s Method . . . . . . . . . . . . . . . . . p. 35
1.3.3. Policy Iteration and Newton’s Method . . . . . . . . p. 39
1.3.4. Approximation in Value Space for General Abstract . . . .
Dynamic Programming . . . . . . . . . . . . . . . p. 41
1.4. Organization of the Book . . . . . . . . . . . . . . . . p. 41
1.5. Notes, Sources, and Exercises . . . . . . . . . . . . . . . p. 45

2. Contractive Models . . . . . . . . . . . . . . . . . p. 53
2.1. Bellman’s Equation and Optimality Conditions . . . . . . . p. 54
2.2. Limited Lookahead Policies . . . . . . . . . . . . . . . p. 61
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . . . p. 66
2.3.1. Approximate Value Iteration . . . . . . . . . . . . . p. 67
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . . . p. 70
2.4.1. Approximate Policy Iteration . . . . . . . . . . . . p. 73
2.4.2. Approximate Policy Iteration Where Policies Converge . p. 75
2.5. Optimistic Policy Iteration and λ-Policy Iteration . . . . . . p. 77
2.5.1. Convergence of Optimistic Policy Iteration . . . . . . p. 79
2.5.2. Approximate Optimistic Policy Iteration . . . . . . . p. 84
2.5.3. Randomized Optimistic Policy Iteration . . . . . . . . p. 87

v
vi Contents

2.6. Asynchronous Algorithms . . . . . . . . . . . . . . . . p. 91


2.6.1. Asynchronous Value Iteration . . . . . . . . . . . . p. 91
2.6.2. Asynchronous Policy Iteration . . . . . . . . . . . . p. 98
2.6.3. Optimistic Asynchronous Policy Iteration with a . . . . . .
Uniform Fixed Point . . . . . . . . . . . . . . . p. 103
2.7. Notes, Sources, and Exercises . . . . . . . . . . . . . . p. 110

3. Semicontractive Models . . . . . . . . . . . . . . p. 121


3.1. Pathologies of Noncontractive DP Models . . . . . . . . p. 123
3.1.1. Deterministic Shortest Path Problems . . . . . . . p. 127
3.1.2. Stochastic Shortest Path Problems . . . . . . . . . p. 129
3.1.3. The Blackmailer’s Dilemma . . . . . . . . . . . . p. 131
3.1.4. Linear-Quadratic Problems . . . . . . . . . . . . p. 134
3.1.5. An Intuitive View of Semicontractive Analysis . . . . p. 139
3.2. Semicontractive Models and Regular Policies . . . . . . . p. 141
3.2.1. S-Regular Policies . . . . . . . . . . . . . . . . p. 144
3.2.2. Restricted Optimization over S-Regular Policies . . . p. 146
3.2.3. Policy Iteration Analysis of Bellman’s Equation . . . p. 152
3.2.4. Optimistic Policy Iteration and λ-Policy Iteration . . p. 160
3.2.5. A Mathematical Programming Approach . . . . . . p. 164
3.3. Irregular Policies/Infinite Cost Case . . . . . . . . . . p. 165
3.4. Irregular Policies/Finite Cost Case - A Perturbation . . . . . .
Approach . . . . . . . . . . . . . . . . . . . . . . p. 171
3.5. Applications in Shortest Path and Other Contexts . . . . p. 177
3.5.1. Stochastic Shortest Path Problems . . . . . . . . . p. 178
3.5.2. Affine Monotonic Problems . . . . . . . . . . . . p. 186
3.5.3. Robust Shortest Path Planning . . . . . . . . . . p. 195
3.5.4. Linear-Quadratic Optimal Control . . . . . . . . . p. 205
3.5.5. Continuous-State Deterministic Optimal Control . . . p. 207
3.6. Algorithms . . . . . . . . . . . . . . . . . . . . . . p. 211
3.6.1. Asynchronous Value Iteration . . . . . . . . . . . p. 211
3.6.2. Asynchronous Policy Iteration . . . . . . . . . . . p. 212
3.7. Notes, Sources, and Exercises . . . . . . . . . . . . . . p. 219

4. Noncontractive Models . . . . . . . . . . . . . . p. 231


4.1. Noncontractive Models - Problem Formulation . . . . . . p. 233
4.2. Finite Horizon Problems . . . . . . . . . . . . . . . . p. 235
4.3. Infinite Horizon Problems . . . . . . . . . . . . . . . p. 241
4.3.1. Fixed Point Properties and Optimality Conditions . . p. 244
4.3.2. Value Iteration . . . . . . . . . . . . . . . . . . p. 256
4.3.3. Exact and Optimistic Policy Iteration - . . . . . . . . . .
λ-Policy Iteration . . . . . . . . . . . . . . . . p. 260
4.4. Regularity and Nonstationary Policies . . . . . . . . . . p. 265
4.4.1. Regularity and Monotone Increasing Models . . . . . p. 271
Contents vii

4.4.2. Nonnegative Cost Stochastic Optimal Control . . . . p. 273


4.4.3. Discounted Stochastic Optimal Control . . . . . . . p. 276
4.4.4. Convergent Models . . . . . . . . . . . . . . . . p. 278
4.5. Stable Policies for Deterministic Optimal Control . . . . . p. 282
4.5.1. Forcing Functions and p-Stable Policies . . . . . . . p. 286
4.5.2. Restricted Optimization over Stable Policies . . . . . p. 289
4.5.3. Policy Iteration Methods . . . . . . . . . . . . . p. 301
4.6. Infinite-Spaces Stochastic Shortest Path Problems . . . . . p. 307
4.6.1. The Multiplicity of Solutions of Bellman’s Equation . p. 315
4.6.2. The Case of Bounded Cost per Stage . . . . . . . . p. 317
4.7. Notes, Sources, and Exercises . . . . . . . . . . . . . . p. 320

5. Sequential Zero-Sum Games and Minimax Control . . p. 337


5.1. Introduction . . . . . . . . . . . . . . . . . . . . . p. 338
5.2. Relations to Single Player Abstract DP Formulations . . . p. 344
5.3. A New PI Algorithm for Abstract Minimax DP Problems . p. 350
5.4. Convergence Analysis . . . . . . . . . . . . . . . . . p. 364
5.5. Approximation by Aggregation . . . . . . . . . . . . . p. 371
5.6. Notes and Sources . . . . . . . . . . . . . . . . . . p. 373

Appendix A: Notation and Mathematical Conventions . . p. 377


A.1. Set Notation and Conventions . . . . . . . . . . . . . p. 377
A.2. Functions . . . . . . . . . . . . . . . . . . . . . . p. 379

Appendix B: Contraction Mappings . . . . . . . . . . p. 381


B.1. Contraction Mapping Fixed Point Theorems . . . . . . . p. 381
B.2. Weighted Sup-Norm Contractions . . . . . . . . . . . p. 385

References . . . . . . . . . . . . . . . . . . . . . p. 391

Index . . . . . . . . . . . . . . . . . . . . . . . p. 401
Preface of the First Edition

This book aims at a unified and economical development of the core the-
ory and algorithms of total cost sequential decision problems, based on
the strong connections of the subject with fixed point theory. The analy-
sis focuses on the abstract mapping that underlies dynamic programming
(DP for short) and defines the mathematical character of the associated
problem. Our discussion centers on two fundamental properties that this
mapping may have: monotonicity and (weighted sup-norm) contraction. It
turns out that the nature of the analytical and algorithmic DP theory is
determined primarily by the presence or absence of these two properties,
and the rest of the problem’s structure is largely inconsequential.
In this book, with some minor exceptions, we will assume that mono-
tonicity holds. Consequently, we organize our treatment around the con-
traction property, and we focus on four main classes of models:
(a) Contractive models, discussed in Chapter 2, which have the richest
and strongest theory, and are the benchmark against which the the-
ory of other models is compared. Prominent among these models are
discounted stochastic optimal control problems. The development of
these models is quite thorough and includes the analysis of recent ap-
proximation algorithms for large-scale problems (neuro-dynamic pro-
gramming, reinforcement learning).
(b) Semicontractive models, discussed in Chapter 3 and parts of Chap-
ter 4. The term “semicontractive” is used qualitatively here, to refer
to a variety of models where some policies have a regularity/contrac-
tion-like property but others do not. A prominent example is stochas-
tic shortest path problems, where one aims to drive the state of
a Markov chain to a termination state at minimum expected cost.
These models also have a strong theory under certain conditions, of-
ten nearly as strong as those of the contractive models.
(c) Noncontractive models, discussed in Chapter 4, which rely on just
monotonicity. These models are more complex than the preceding
ones and much of the theory of the contractive models generalizes in
weaker form, if at all. For example, in general the associated Bell-
man equation need not have a unique solution, the value iteration
method may work starting with some functions but not with others,
and the policy iteration method may not work at all. Infinite hori-
zon examples of these models are the classical positive and negative
DP problems, first analyzed by Dubins and Savage, Blackwell, and

ix
x Preface

Strauch, which are discussed in various sources. Some new semicon-


tractive models are also discussed in this chapter, further bridging
the gap between contractive and noncontractive models.
(d) Restricted policies and Borel space models, which are discussed
in Chapter 5. These models are motivated in part by the complex
measurability questions that arise in mathematically rigorous theories
of stochastic optimal control involving continuous probability spaces.
Within this context, the admissible policies and DP mapping are
restricted to have certain measurability properties, and the analysis
of the preceding chapters requires modifications. Restricted policy
models are also useful when there is a special class of policies with
favorable structure, which is “closed” with respect to the standard DP
operations, in the sense that analysis and algorithms can be confined
within this class.
We do not consider average cost DP problems, whose character bears
a much closer connection to stochastic processes than to total cost prob-
lems. We also do not address specific stochastic characteristics underlying
the problem, such as for example a Markovian structure. Thus our re-
sults apply equally well to Markovian decision problems and to sequential
minimax problems. While this makes our development general and a con-
venient starting point for the further analysis of a variety of different types
of problems, it also ignores some of the interesting characteristics of special
types of DP problems that require an intricate probabilistic analysis.
Let us describe the research content of the book in summary, de-
ferring a more detailed discussion to the end-of-chapter notes. A large
portion of our analysis has been known for a long time, but in a somewhat
fragmentary form. In particular, the contractive theory, first developed by
Denardo [Den67], has been known for the case of the unweighted sup-norm,
but does not cover the important special case of stochastic shortest path
problems where all policies are proper. Chapter 2 transcribes this theory
to the weighted sup-norm contraction case. Moreover, Chapter 2 develops
extensions of the theory to approximate DP, and includes material on asyn-
chronous value iteration (based on the author’s work [Ber82], [Ber83]), and
asynchronous policy iteration algorithms (based on the author’s joint work
with Huizhen (Janey) Yu [BeY10a], [BeY10b], [YuB11a]). Most of this
material is relatively new, having been presented in the author’s recent
book [Ber12a] and survey paper [Ber12b], with detailed references given
there. The analysis of infinite horizon noncontractive models in Chapter 4
was first given in the author’s paper [Ber77], and was also presented in the
book by Bertsekas and Shreve [BeS78], which in addition contains much
of the material on finite horizon problems, restricted policies models, and
Borel space models. These were the starting point and main sources for
our development.
The new research presented in this book is primarily on the semi-
Preface xi

contractive models of Chapter 3 and parts of Chapter 4. Traditionally,


the theory of total cost infinite horizon DP has been bordered by two ex-
tremes: discounted models, which have a contractive nature, and positive
and negative models, which do not have a contractive nature, but rely
on an enhanced monotonicity structure (monotone increase and monotone
decrease models, or in classical DP terms, positive and negative models).
Between these two extremes lies a gray area of problems that are not con-
tractive, and either do not fit into the categories of positive and negative
models, or possess additional structure that is not exploited by the theory
of these models. Included are stochastic shortest path problems, search
problems, linear-quadratic problems, a host of queueing problems, multi-
plicative and exponential cost models, and others. Together these problems
represent an important part of the infinite horizon total cost DP landscape.
They possess important theoretical characteristics, not generally available
for positive and negative models, such as the uniqueness of solution of Bell-
man’s equation within a subset of interest, and the validity of useful forms
of value and policy iteration algorithms.
Our semicontractive models aim to provide a unifying abstract DP
structure for problems in this gray area between contractive and noncon-
tractive models. The analysis is motivated in part by stochastic shortest
path problems, where there are two types of policies: proper , which are
the ones that lead to the termination state with probability one from all
starting states, and improper , which are the ones that are not proper.
Proper and improper policies can also be characterized through their Bell-
man equation mapping: for the former this mapping is a contraction, while
for the latter it is not. In our more general semicontractive models, policies
are also characterized in terms of their Bellman equation mapping, through
a notion of regularity, which generalizes the notion of a proper policy and
is related to classical notions of asymptotic stability from control theory.
In our development a policy is regular within a certain set if its cost
function is the unique asymptotically stable equilibrium (fixed point) of
the associated DP mapping within that set. We assume that some policies
are regular while others are not , and impose various assumptions to ensure
that attention can be focused on the regular policies. From an analytical
point of view, this brings to bear the theory of fixed points of monotone
mappings. From the practical point of view, this allows application to a
diverse collection of interesting problems, ranging from stochastic short-
est path problems of various kinds, where the regular policies include the
proper policies, to linear-quadratic problems, where the regular policies
include the stabilizing linear feedback controllers.
The definition of regularity is introduced in Chapter 3, and its theoret-
ical ramifications are explored through extensions of the classical stochastic
shortest path and search problems. In Chapter 4, semicontractive models
are discussed in the presence of additional monotonicity structure, which
brings to bear the properties of positive and negative DP models. With the
xii Preface

aid of this structure, the theory of semicontractive models can be strength-


ened and can be applied to several additional problems, including risk-
sensitive/exponential cost problems.
The book has a theoretical research monograph character, but re-
quires a modest mathematical background for all chapters except the last
one, essentially a first course in analysis. Of course, prior exposure to DP
will definitely be very helpful to provide orientation and context. A few
exercises have been included, either to illustrate the theory with exam-
ples and counterexamples, or to provide applications and extensions of the
theory. Solutions of all the exercises can be found in Appendix D, at the
book’s internet site
http://www.athenasc.com/abstractdp.html
and at the author’s web site
http://web.mit.edu/dimitrib/www/home.html
Additional exercises and other related material may be added to these sites
over time.
I would like to express my appreciation to a few colleagues for inter-
actions, recent and old, which have helped shape the form of the book. My
collaboration with Steven Shreve on our 1978 book provided the motivation
and the background for the material on models with restricted policies and
associated measurability questions. My collaboration with John Tsitsiklis
on stochastic shortest path problems provided inspiration for the work on
semicontractive models. My collaboration with Janey (Huizhen) Yu played
an important role in the book’s development, and is reflected in our joint
work on asynchronous policy iteration, on perturbation models, and on
risk-sensitive models. Moreover Janey contributed significantly to the ma-
terial on semicontractive models with many insightful suggestions. Finally,
I am thankful to Mengdi Wang, who went through portions of the book
with care, and gave several helpful comments.

Dimitri P. Bertsekas
Spring 2013
Preface xiii

Preface to the Second Edition

The second edition aims primarily to amplify the presentation of the semi-
contractive models of Chapter 3 and Chapter 4, and to supplement it with
a broad spectrum of research results that I obtained and published in jour-
nals and reports since the first edition was written. As a result, the size
of this material more than doubled, and the size of the book increased by
about 40%.
In particular, I have thoroughly rewritten Chapter 3, which deals with
semicontractive models where stationary regular policies are sufficient. I
expanded and streamlined the theoretical framework, and I provided new
analyses of a number of shortest path-type applications (deterministic,
stochastic, affine monotonic, exponential cost, and robust/minimax), as
well as several types of optimal control problems with continuous state
space (including linear-quadratic, regulation, and planning problems).
In Chapter 4, I have extended the notion of regularity to nonstation-
ary policies (Section 4.4), aiming to explore the structure of the solution set
of Bellman’s equation, and the connection of optimality with other struc-
tural properties of optimal control problems. As an application, I have
discussed in Section 4.5 the relation of optimality with classical notions
of stability and controllability in continuous-spaces deterministic optimal
control. In Section 4.6, I have similarly extended the notion of a proper
policy to continuous-spaces stochastic shortest path problems.
I have also revised Chapter 1 a little (mainly with the addition of
Section 1.2.5 on the relation between proximal algorithms and temporal
difference methods), added to Chapter 2 some analysis relating to λ-policy
iteration and randomized policy iteration algorithms (Section 2.5.3), and I
have also added several new exercises (with complete solutions) to Chapters
1-4. Additional material relating to various applications can be found in
some of my journal papers, reports, and video lectures on semicontractive
models, which are posted at my web site.
In addition to the changes in Chapters 1-4, I have also eliminated from
the second edition the analysis that deals with restricted policies (Chap-
ter 5 and Appendix C of the first edition). This analysis is motivated in
part by the complex measurability questions that arise in mathematically
rigorous theories of stochastic optimal control with Borel state and control
spaces. This material is covered in Chapter 6 of the monograph by Bert-
sekas and Shreve [BeS78], and followup research on the subject has been
limited. Thus, I decided to just post Chapter 5 and Appendix C of the first
xiv Preface

edition at the book’s web site (40 pages), and omit them from the second
edition. As a result of this choice, the entire book now requires only a
modest mathematical background, essentially a first course in analysis and
in elementary probability.
The range of applications of dynamic programming has grown enor-
mously in the last 25 years, thanks to the use of approximate simulation-
based methods for large and challenging problems. Because approximations
are often tied to special characteristics of specific models, their coverage in
this book is limited to general discussions in Chapter 1 and to error bounds
given in Chapter 2. However, much of the work on approximation methods
so far has focused on finite-state discounted, and relatively simple deter-
ministic and stochastic shortest path problems, for which there is solid and
robust analytical and algorithmic theory (part of Chapters 2 and 3 in this
monograph). As the range of applications becomes broader, I expect that
the level of mathematical understanding projected in this book will become
essential for the development of effective and reliable solution methods. In
particular, much of the new material in this edition deals with infinite-state
and/or complex shortest path type-problems, whose approximate solution
will require new methodologies that transcend the current state of the art.

Dimitri P. Bertsekas
January 2018
Preface xv

Preface to the Third Edition


The third edition is based on the same theoretical framework as the sec-
ond edition, but contains two major additions. The first is to highlight
the central role of abstract DP methods in the conceptualization of re-
inforcement learning and approximate DP methods, as described in the
author’s recent book “Lessons from AlphaZero for Optimal, Model Predic-
tive, and Adaptive Control,” Athena Scientific, 2022. The main idea here
is that approximation in value space with one-step lookahead amounts to
a step of Newton’s method for solving the abstract Bellman’s equation.
This material is included in summary form in view of its strong reliance on
abstract DP visualization. Our presentation relies primarily on geometric
illustrations rather than mathematical analysis, and is given in Section 1.3.
The second addition is a new Chapter 5 on abstract DP methods for
minimax and zero sum game problems, which is based on the author’s re-
cent paper [Ber21c]. A primary motivation here is the resolution of some
long-standing convergence difficulties of the “natural” policy iteration algo-
rithm, which have been known since the Pollatschek and Avi-Itzhak method
[PoA69] for finite-state Markov games. Mathematically, this “natural” al-
gorithm is a form of Newton’s method for solving the corresponding Bell-
man’s equation, but Newton’s method, contrary to the case of single-player
DP problems, is not globally convergent in the case of a minimax problem,
because the Bellman operator may have components that are neither con-
vex nor concave. Our approach in Chapter 5 has been to introduce a special
type of abstract Bellman operator for minimax problems, and modify the
standard PI algorithm along the lines of the asynchronous optimistic PI al-
gorithm of Section 2.6.3, which involves a parametric contraction mapping
with a uniform fixed point.
The third edition also contains a number of small corrections and
editorial changes. The author wishes to thank the contributions of several
colleagues in this regard, and particularly Yuchao Li, who proofread with
care large portions of the book.

Dimitri P. Bertsekas
February 2022
1

Introduction

Contents

1.1. Structure of Dynamic Programming Problems . . . . . p. 2


1.2. Abstract Dynamic Programming Models . . . . . . . . p. 5
1.2.1. Problem Formulation . . . . . . . . . . . . . . p. 5
1.2.2. Monotonicity and Contraction Properties . . . . . p. 7
1.2.3. Some Examples . . . . . . . . . . . . . . . . p. 10
1.2.4. Reinforcement Learning - Projected and Aggregation . .
Bellman Equations . . . . . . . . . . . . . . p. 24
1.2.5. Reinforcement Learning - Temporal Difference and . . .
Proximal Algorithms . . . . . . . . . . . . . . p. 26
1.3. Reinforcement Learning - Approximation in Value Space . p. 29
1.3.1. Approximation in Value Space for . . . . . . . . . .
Markovian Decision Problems . . . . . . . . . . p. 29
1.3.2. Approximation in Value Space and . . . . . . . . . .
Newton’s Method . . . . . . . . . . . . . . . p. 35
1.3.3. Policy Iteration and Newton’s Method . . . . . . p. 39
1.3.4. Approximation in Value Space for General Abstract . .
Dynamic Programming . . . . . . . . . . . . . p. 41
1.4. Organization of the Book . . . . . . . . . . . . . . p. 41
1.5. Notes, Sources, and Exercises . . . . . . . . . . . . . p. 45

1
2 Introduction Chap. 1

1.1 STRUCTURE OF DYNAMIC PROGRAMMING PROBLEMS

Dynamic programming (DP for short) is the principal method for analysis
of a large and diverse class of sequential decision problems. Examples are
deterministic and stochastic optimal control problems with a continuous
state space, Markov and semi-Markov decision problems with a discrete
state space, minimax problems, and sequential zero-sum games. While the
nature of these problems may vary widely, their underlying structures turn
out to be very similar. In all cases there is an underlying mapping that
depends on an associated controlled dynamic system and corresponding
cost per stage. This mapping, the DP (or Bellman) operator, provides a
compact “mathematical signature” of the problem. It defines the cost func-
tion of policies and the optimal cost function, and it provides a convenient
shorthand notation for algorithmic description and analysis.
More importantly, the structure of the DP operator defines the math-
ematical character of the associated problem. The purpose of this book is to
provide an analysis of this structure, centering on two fundamental prop-
erties: monotonicity and (weighted sup-norm) contraction. It turns out
that the nature of the analytical and algorithmic DP theory is determined
primarily by the presence or absence of one or both of these two properties,
and the rest of the problem’s structure is largely inconsequential.

A Deterministic Optimal Control Example

To illustrate our viewpoint, let us consider a discrete-time deterministic


optimal control problem described by a system equation

xk+1 = f (xk , uk ), k = 0, 1, . . . . (1.1)

Here xk is the state of the system taking values in a set X (the state space),
and uk is the control taking values in a set U (the control space). † At stage
k, there is a cost
αk g(xk , uk )
incurred when uk is applied at state xk , where α is a scalar in (0, 1] that has
the interpretation of a discount factor when α < 1. The controls are chosen
as a function of the current state, subject to a constraint that depends on
that state. In particular, at state x the control is constrained to take values
in a given set U (x) ⊂ U . Thus we are interested in optimization over the
set of (nonstationary) policies
! "
Π = {µ0 , µ1 , . . .} | µk ∈ M, k = 0, 1, . . . ,

† Our discussion of this section is somewhat informal, without strict adher-


ence to mathematical notation and rigor. We will introduce a rigorous mathe-
matical framework later.
Sec. 1.1 Structure of Dynamic Programming Problems 3

where M is the set of functions µ : X #→ U defined by


! "
M = µ | µ(x) ∈ U (x), ∀ x ∈ X .

The total cost of a policy π = {µ0 , µ1 , . . .} over an infinite number of


stages (an infinite horizon) and starting at an initial state x0 is the limit
superior of the N -step costs
N
# −1
$ %
Jπ (x0 ) = lim sup αk g xk , µk (xk ) , (1.2)
N →∞
k=0

where the state sequence {xk } is generated by the deterministic system


(1.1) under the policy π:
$ %
xk+1 = f xk , µk (xk ) , k = 0, 1, . . . .

(We use limit superior rather than limit to cover the case where the limit
does not exist.) The optimal cost function is

J * (x) = inf Jπ (x), x ∈ X.


π∈Π

For any policy π = {µ0 , µ1 , . . .}, consider the policy π1 = {µ1 , µ2 , . . .}


and write by using Eq. (1.2),
$ % $ %
Jπ (x) = g x, µ0 (x) + αJπ1 f (x, µ0 (x)) .

We have for all x ∈ X


& $ % $ %'
J * (x) = inf g x, µ0 (x) + αJπ1 f (x, µ0 (x))
π={µ0 ,π1 }∈Π
& $ % $ %'
= inf g x, µ0 (x) + α inf Jπ1 f (x, µ0 (x))
µ0 ∈M π1 ∈Π
& $ % $ %'
= inf g x, µ0 (x) + αJ * f (x, µ0 (x)) .
µ0 ∈M

The minimization over µ0 ∈ M can be written as minimization over all


u ∈ U (x), so we can write the preceding equation as
& $ %'
J * (x) = inf g(x, u) + αJ * f (x, u) , ∀ x ∈ X. (1.3)
u∈U(x)

This equation is an example of Bellman’s equation, which plays a


central role in DP analysis and algorithms. If it can be solved for J * ,
an optimal stationary policy {µ∗ , µ∗ , . . .} may typically be obtained by
minimization of the right-hand side for each x, i.e.,
& $ %'
µ∗ (x) ∈ arg min g(x, u) + αJ * f (x, u) , ∀ x ∈ X. (1.4)
u∈U(x)
4 Introduction Chap. 1

We now note that both Eqs. (1.3) and (1.4) can be stated in terms of
the expression
$ %
H(x, u, J) = g(x, u) + αJ f (x, u) , x ∈ X, u ∈ U (x).
Defining $ %
(Tµ J)(x) = H x, µ(x), J , x ∈ X,
and
(T J)(x) = inf H(x, u, J), x ∈ X,
u∈U(x)

we see that Bellman’s equation (1.3) can be written compactly as


J * = T J *,
i.e., J * is the fixed point of T , viewed as a mapping from the set of functions
on X into itself. Moreover, it can be similarly seen that Jµ , the cost function
of the stationary policy {µ, µ, . . .}, is a fixed point of Tµ . In addition, the
optimality condition (1.4) can be stated compactly as
Tµ∗ J * = T J * .
We will see later that additional properties, as well as a variety of algorithms
for finding J * can be stated and analyzed using the mappings T and Tµ .
The mappings Tµ can also be used in the context of DP problems
with a finite number of stages (a finite horizon). In particular, for a given
¯ N ) for the state xN at
policy π = {µ0 , µ1 , . . .} and a terminal cost αN J(x
the end of N stages, consider the N -stage cost function
N −1
¯ N) +
# $ %
Jπ,N (x0 ) = αN J(x αk g xk , µk (xk ) . (1.5)
k=0

Then it can be verified by induction that for all initial states x0 , we have
¯ 0 ).
Jπ,N (x0 ) = (Tµ0 Tµ1 · · · TµN−1 J)(x (1.6)
Here Tµ0 Tµ1 · · · TµN−1 is the composition of the mappings Tµ0 , Tµ1 , . . . TµN−1 ,
i.e., for all J,
$ %
(Tµ0 Tµ1 J)(x) = Tµ0 (Tµ1 J) (x), x ∈ X,
and more generally
$ %
(Tµ0 Tµ1 · · · TµN−1 J)(x) = Tµ0 (Tµ1 (· · · (TµN−1 J))) (x), x ∈ X,
(our notational conventions are summarized in Appendix A). Thus the
finite horizon cost functions Jπ,N of π can be defined in terms of the map-
pings Tµ [cf. Eq. (1.6)], and so can the infinite horizon cost function Jπ :
¯
Jπ (x) = lim sup(Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X, (1.7)
N →∞

where J¯ is the zero function, J(x)


¯ = 0 for all x ∈ X.
Sec. 1.2 Abstract Dynamic Programming Models 5

Connection with Fixed Point Methodology

The Bellman equation (1.3) and the optimality condition (1.4), stated in
terms of the mappings Tµ and T , highlight a central theme of this book,
which is that DP theory is intimately connected with the theory of abstract
mappings and their fixed points. Analogs of the Bellman equation, J * =
T J * , optimality conditions, and other results and computational methods
hold for a great variety of DP models, and can be stated compactly as
described above in terms of the corresponding mappings Tµ and T . The
gain from this abstraction is greater generality and mathematical insight,
as well as a more unified, economical, and streamlined analysis.

1.2 ABSTRACT DYNAMIC PROGRAMMING MODELS

In this section we formally introduce and illustrate with examples an ab-


stract DP model, which embodies the ideas just discussed in Section 1.1.

1.2.1 Problem Formulation

Let X and U be two sets, which we loosely refer to as a set of “states”


and a set of “controls,” respectively. For each x ∈ X, let U (x) ⊂ U be a
nonempty subset of controls that are feasible at state x. We denote by M
the set of all functions µ : X #→ U with µ(x) ∈ U (x), for all x ∈ X.
In analogy with DP, we refer to sequences π = {µ0 , µ1 , . . .}, with
µk ∈ M for all k, as “nonstationary policies,” and we refer to a sequence
{µ, µ, . . .}, with µ ∈ M, as a “stationary policy.” In our development,
stationary policies will play a dominant role, and with slight abuse of ter-
minology, we will also refer to any µ ∈ M as a “policy” when confusion
cannot arise.
Let R(X) be the set of real-valued functions J : X #→ &, and let
H : X × U × R(X) #→ & be a given mapping. † For each policy µ ∈ M, we
consider the mapping Tµ : R(X) #→ R(X) defined by
$ %
(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X, J ∈ R(X),

and we also consider the mapping T defined by ‡

(T J)(x) = inf H(x, u, J), ∀ x ∈ X, J ∈ R(X).


u∈U(x)

† Our notation and mathematical conventions are outlined in Appendix A.


In particular, we denote by ! the set of real numbers, and by !n the space of
n-dimensional vectors with real components.
‡ We assume that H, Tµ J, and T J are real-valued for J ∈ R(X) in the
present chapter and in Chapter 2. In Chapters 3 and 4 we will allow H(x, u, J),
and hence also (Tµ J)(x) and (T J)(x), to take the values ∞ and −∞.
6 Introduction Chap. 1

We will generally refer to T and Tµ as the (abstract) DP mappings or DP


operators or Bellman operators (the latter name is common in the artificial
intelligence and reinforcement learning literature).
Similar to the deterministic optimal control problem of the preceding
section, the mappings Tµ and T serve to define a multistage optimization
problem and a DP-like methodology for its solution. In particular, for some
function J¯ ∈ R(X), and nonstationary policy π = {µ0 , µ1 , . . .}, we define
for each integer N ≥ 1 the functions
¯
Jπ,N (x) = (Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X,
where Tµ0 Tµ1 · · · TµN−1 denotes the composition of the mappings Tµ0 , Tµ1 ,
. . . , TµN−1 , i.e.,
$ %
Tµ0 Tµ1 · · · TµN−1 J = Tµ0 (Tµ1 (· · · (TµN−2 (TµN−1 J))) · · ·) , J ∈ R(X).
We view Jπ,N as the “N -stage cost function” of π [cf. Eq. (1.5)]. Consider
also the function
¯
Jπ (x) = lim sup Jπ,N (x) = lim sup(Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X,
N →∞ N →∞

which we view as the “infinite horizon cost function” of π [cf. Eq. (1.7); we
use lim sup for generality, since we are not assured that the limit exists].
We want to minimize Jπ over π, i.e., to find
J * (x) = inf Jπ (x), x ∈ X,
π

and a policy π ∗ that attains the infimum, if one exists.


The key connection with fixed point methodology is that J * “typi-
cally” (under mild assumptions) can be shown to satisfy
J * (x) = inf H(x, u, J * ), ∀ x ∈ X,
u∈U(x)

i.e., it is a fixed point of T . We refer to this as Bellman’s equation [cf. Eq.


(1.3)]. Another fact is that if an optimal policy π ∗ exists, it “typically” can
be selected to be stationary, π ∗ = {µ∗ , µ∗ , . . .}, with µ∗ ∈ M satisfying an
optimality condition, such as for example
(Tµ∗ J * )(x) = (T J * )(x), x ∈ X,
[cf. Eq. (1.4)]. Several other results of an analytical or algorithmic nature
also hold under appropriate conditions, which will be discussed in detail
later.
However, Bellman’s equation and other related results may not hold
without Tµ and T having some special structural properties. Prominent
among these are a monotonicity assumption that typically holds in DP
problems, and a contraction assumption that holds for some important
classes of problems. We describe these assumptions next.
Sec. 1.2 Abstract Dynamic Programming Models 7

1.2.2 Monotonicity and Contraction Properties

Let us now formalize the monotonicity and contraction assumptions. We


will require that both of these assumptions hold for most of the next chap-
ter, and we will gradually relax the contraction assumption in Chapters 3
and 4. Recall also our assumption that Tµ and T map R(X) (the space
of real-valued functions over X) into R(X). In Chapters 3 and 4 we will
relax this assumption as well.

Assumption 1.2.1: (Monotonicity) If J, J & ∈ R(X) and J ≤ J & ,


then
H(x, u, J) ≤ H(x, u, J & ), ∀ x ∈ X, u ∈ U (x).

Note that by taking infimum over u ∈ U (x), we have

J(x) ≤ J # (x), ∀ x ∈ X ⇒ inf H(x, u, J) ≤ inf H(x, u, J # ), ∀ x ∈ X,


u∈U (x) u∈U (x)

or equivalently, †
J ≤ J& ⇒ T J ≤ T J &.

Another way to arrive at this relation, is to note that the monotonicity


assumption is equivalent to

J ≤ J& ⇒ Tµ J ≤ Tµ J & , ∀ µ ∈ M,

and to use the simple but important fact

inf H(x, u, J) = inf (Tµ J)(x), ∀ x ∈ X, J ∈ R(X),


u∈U(x) µ∈M

i.e., for a fixed x ∈ X, infimum over u is equivalent to infimum over µ.


This is true because for any µ, there is no coupling constraint between the
controls µ(x) and & &
! µ(x ) that correspond to" two different states x and x , i.e.,
the set M = µ | µ(x) ∈ U (x), ∀ x ∈ X can be viewed as the Cartesian
product Πx∈X U (x). We will be writing this relation as T J = inf µ∈M Tµ J.
For the contraction assumption, we introduce a function v : X #→ &
with
v(x) > 0, ∀ x ∈ X.

† Unless otherwise stated, in this book, inequalities involving functions, min-


ima and infima of a collection of functions, and limits of function sequences are
meant to be pointwise; see Appendix A for our notational conventions.
8 Introduction Chap. 1

= 0 Tµ J = 0 Tµ J

=0 J TJ =0 ) Jµ J TJ

Figure 1.2.1. Illustration of the monotonicity and the contraction assumptions in


one dimension. The mapping Tµ on the left is monotone but is not a contraction.
The mapping Tµ on the right is both monotone and a contraction. It has a unique
fixed point at Jµ .

Let us denote by B(X) the space of real-valued functions J on X such


that J(x)/v(x) is bounded as x ranges over X, and consider the weighted
sup-norm ( (
(J(x)(
+J+ = sup
x∈X v(x)

on B(X). The properties of B(X) and some of the associated fixed point
theory are discussed in Appendix B. In particular, as shown there, B(X)
is a complete normed space, so any mapping from B(X) to B(X) that is a
contraction or an m-stage contraction for some integer m > 1, with respect
to + · +, has a unique fixed point (cf. Props. B.1 and B.2).

Assumption 1.2.2: (Contraction) For all J ∈ B(X) and µ ∈ M,


the functions Tµ J and T J belong to B(X). Furthermore, for some
α ∈ (0, 1), we have

+Tµ J − Tµ J & + ≤ α+J − J & +, ∀ J, J & ∈ B(X), µ ∈ M. (1.8)

Figure 1.2.1 illustrates the monotonicity and the contraction assump-


tions. It can be shown that the contraction condition (1.8) implies that
+T J − T J & + ≤ α+J − J & +, ∀ J, J & ∈ B(X), (1.9)
so that T is also a contraction with modulus α. To see this we use Eq.
(1.8) to write
(Tµ J)(x) ≤ (Tµ J & )(x) + α+J − J & + v(x), ∀ x ∈ X,
Sec. 1.2 Abstract Dynamic Programming Models 9

from which, by taking infimum of both sides over µ ∈ M, we have


(T J)(x) − (T J & )(x)
≤ α+J − J & +, ∀ x ∈ X.
v(x)
Reversing the roles of J and J & , we also have
(T J & )(x) − (T J)(x)
≤ α+J − J & +, ∀ x ∈ X,
v(x)
and combining the preceding two relations, and taking the supremum of
the left side over x ∈ X, we obtain Eq. (1.9).
Nearly all mappings related to DP satisfy the monotonicity assump-
tion, and many important ones satisfy the weighted sup-norm contraction
assumption as well. When both assumptions hold, the most powerful an-
alytical and computational results can be obtained, as we will show in
Chapter 2. These are:
(a) Bellman’s equation has a unique solution, i.e., T and Tµ have unique
fixed points, which are the optimal cost function J * and the cost
functions Jµ of the stationary policies {µ, µ, . . .}, respectively [cf. Eq.
(1.3)].
(b) A stationary policy {µ∗ , µ∗ , . . .} is optimal if and only if
Tµ∗ J * = T J * ,
[cf. Eq. (1.4)].
(c) J * and Jµ can be computed by the value iteration method,
J * = lim T k J, Jµ = lim Tµk J,
k→∞ k→∞
starting with any J ∈ B(X).
(d) J * can be computed by the policy iteration method, whereby we gen-
erate a sequence of stationary policies via
Tµk+1 Jµk = T Jµk ,
starting from some initial policy µ0 [here Jµk is obtained as the fixed
point of Tµk by several possible methods, including value iteration as
in (c) above].
These are the most favorable types of results one can hope for in
the DP context, and they are supplemented by a host of other results,
involving approximate and/or asynchronous implementations of the value
and policy iteration methods, and other related methods that combine
features of both. As the contraction property is relaxed and is replaced
by various weaker assumptions, some of the preceding results may hold
in weaker form. For example J * turns out to be a solution of Bellman’s
equation in most of the models to be discussed, but it may not be the
unique solution. The interplay between the monotonicity and contraction-
like properties, and the associated results of the form (a)-(d) described
above is a recurring analytical theme in this book.
10 Introduction Chap. 1

1.2.3 Some Examples

In what follows in this section, we describe a few special cases, which indi-
cate the connections of appropriate forms of the mapping H with the most
popular total cost DP models. In all these models the monotonicity As-
sumption 1.2.1 (or some closely related version) holds, but the contraction
Assumption 1.2.2 may not hold, as we will indicate later. Our descriptions
are by necessity brief, and the reader is referred to the relevant textbook
literature for more detailed discussion.

Example 1.2.1 (Stochastic Optimal Control - Markovian


Decision Problems)

Consider the stationary discrete-time dynamic system

xk+1 = f (xk , uk , wk ), k = 0, 1, . . . , (1.10)

where for all k, the state xk is an element of a space X, the control uk is


an element of a space U , and wk is a random “disturbance,” an element of a
space W . We consider problems with infinite state and control spaces, as well
as problems with discrete (finite or countable) state space (in which case the
underlying system is a Markov chain). However, for technical reasons that
relate to measure-theoretic issues, we assume that W is a countable set.
The control uk is constrained to take values in a given nonempty subset
U (xk ) of U , which depends on the current state xk [uk ∈ U (xk ), for all
xk ∈ X]. The random disturbances wk , k = 0, 1, . . ., are characterized by
probability distributions P (· | xk , uk ) that are identical for all k, where P (wk |
xk , uk ) is the probability of occurrence of wk , when the current state and
control are xk and uk , respectively. Thus the probability of wk may depend
explicitly on xk and uk , but not on values of prior disturbances wk−1 , . . . , w0 .
Given an initial state x0 , we want to find a policy π = {µ0 , µ1 , . . .},
where µk : X (→ U , µk (xk ) ∈ U (xk ), for all xk ∈ X, k = 0, 1, . . ., that
minimizes the cost function
)N−1 *
# k
$ %
Jπ (x0 ) = lim sup E α g xk , µk (xk ), wk , (1.11)
N→∞ wk
k=0,1,... k=0

where α ∈ (0, 1] is a discount factor, subject to the system equation constraint


$ %
xk+1 = f xk , µk (xk ), wk , k = 0, 1, . . . .

This is a classical problem, which is discussed extensively in various sources,


including the author’s text [Ber12a]. It is usually referred to as the stochastic
optimal control problem or the Markovian Decision Problem (MDP for short).
Note that the expected value of the N -stage cost of π,
)N−1 *
# k
$ %
E α g xk , µk (xk ), wk ,
wk
k=0,1,... k=0
Sec. 1.2 Abstract Dynamic Programming Models 11

is defined as a (possibly countably infinite) sum, since the disturbances wk ,


k = 0, 1, . . ., take values in a countable set. Indeed, the reader may verify
that all the subsequent mathematical expressions that involve an expected
value can be written as summations over a finite or a countable set, so they
make sense without resort to measure-theoretic integration concepts. †
In what follows we will often impose appropriate assumptions on the
cost per stage g and the scalar α, which guarantee that the infinite horizon
cost Jπ (x0 ) is defined as a limit (rather than as a lim sup):
)N−1 *
# k
$ %
Jπ (x0 ) = lim E α g xk , µk (xk ), wk .
N→∞ wk
k=0,1,... k=0

In particular, it can be shown that the limit exists if α < 1 and the expected
value of |g| is uniformly bounded, i.e., for some B > 0,
!( ("
E (g(x, u, w)( ≤ B, ∀ x ∈ X, u ∈ U (x). (1.12)

In this case, we obtain the classical discounted infinite horizon DP prob-


lem, which generally has the most favorable structure of all infinite horizon
stochastic DP models (see [Ber12a], Chapters 1 and 2).
To make the connection with abstract DP, let us define
! $ %"
H(x, u, J) = E g(x, u, w) + αJ f (x, u, w) ,

so that
! $ % $ %"
(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) ,

and ! $ %"
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) .
u∈U (x)

Similar to the deterministic optimal control problem of Section 1.1, the N -


stage cost of π, can be expressed in terms of Tµ :
)N−1 *
#
¯ 0) = αk g xk , µk (xk ), wk
$ %
(Tµ0 · · · TµN−1 J)(x E ,
wk
k=0,1,... k=0

† As noted in Appendix A, the formula for the expected value of a random


variable w defined over a space Ω is

E{w} = E{w+ } + E{w− },

where w+ and w− are the positive and negative parts of w,

w+ (ω) = max 0, w(ω) , w− (ω) = min 0, w(ω) ,


! " ! "
∀ ω ∈ Ω.

In this way, taking also into account the rule ∞−∞ = ∞ (see Appendix A), E{w}
is well-defined as an extended real number if Ω is finite or countably infinite.
12 Introduction Chap. 1

where J¯ is the zero function, J(x)


¯ = 0 for all x ∈ X. The same is true for
the infinite-stage cost [cf. Eq. (1.11)]:

Jπ (x0 ) = lim sup (Tµ0 · · · TµN−1 J¯)(x0 ).


N→∞

It can be seen that the mappings Tµ and T are monotone, and it is


well-known that if α < 1 and the boundedness condition (1.12) holds, they
are contractive as well (under the unweighted sup-norm); see e.g., [Ber12a],
Chapter 1. In this case, the model has the powerful analytical and algorith-
mic properties (a)-(d) mentioned at the end of the preceding subsection. In
particular, the optimal cost function J ∗ [i.e., J ∗ (x) = inf π Jπ (x) for all x ∈ X]
can be shown to be the unique solution of the fixed point equation J ∗ = T J ∗ ,
also known as Bellman’s equation, which has the form

J ∗ (x) = E g(x, u, w) + αJ ∗ f (x, u, w)


! $ %"
inf , x ∈ X,
u∈U (x)

and parallels the one given for deterministic optimal control problems [cf. Eq.
(1.3)].
These properties can be expressed and analyzed in an abstract setting
by using just the mappings Tµ and T , both when Tµ and T are contractive
(see Chapter 2), and when they are only monotone and not contractive while
either g ≥ 0 or g ≤ 0 (see Chapter 4). Moreover, under some conditions, it is
possible to analyze these properties in cases where Tµ is contractive for some
but not all µ (see Chapter 3, and Section 4.4).

Example 1.2.2 (Finite-State Discounted Markovian Decision


Problems)

In the special case of the preceding example where the number of states is
finite, the system equation (1.10) may be defined in terms of the transition
probabilities
$ %
pxy (u) = Prob y = f (x, u, w) | x , x, y ∈ X, u ∈ U (x),

so H takes the form


# $ %
H(x, u, J) = pxy (u) g(x, u, y) + αJ(y) .
y∈X

When α < 1 and the boundedness condition


( (
(g(x, u, y)( ≤ B, ∀ x, y ∈ X, u ∈ U (x),

[cf. Eq. (1.12)] holds (or more simply, when U is a finite set), the mappings Tµ
and T are contraction mappings with respect to the standard (unweighted)
sup-norm. This is a classical model, referred to as discounted finite-state
MDP , which has a favorable theory and has found extensive applications (cf.
[Ber12a], Chapters 1 and 2). The model is additionally important, because it
is often used for computational solution of continuous state space problems
via discretization.
Sec. 1.2 Abstract Dynamic Programming Models 13

Example 1.2.3 (Discounted Semi-Markov Problems)

With x, y, and u as in Example 1.2.2, consider a mapping of the form


#
H(x, u, J) = G(x, u) + mxy (u)J(y),
y∈X

where G is some function representing expected cost per stage, and mxy (u)
are nonnegative scalars with
#
mxy (u) < 1, ∀ x ∈ X, u ∈ U (x).
y∈X

The equation J ∗ = T J ∗ is Bellman’s equation for a finite-state continuous-


time semi-Markov decision problem, after it is converted into an equivalent
discrete-time problem (cf. [Ber12a], Section 1.4). Again, the mappings Tµ and
T are monotone and can be shown to be contraction mappings with respect
to the unweighted sup-norm.

Example 1.2.4 (Discounted Zero-Sum Dynamic Games)

Let us consider a zero-sum game analog of the finite-state MDP Example


1.2.2. Here there are two players that choose actions at each stage: the
first (called the minimizer ) may choose a move i out of n moves and the
second (called the maximizer ) may choose a move j out of m moves. Then
the minimizer gives a specified amount aij to the maximizer, called a payoff .
The minimizer wishes to minimize aij , and the maximizer wishes to maximize
aij .
The players use mixed strategies, whereby the minimizer selects a prob-
ability distribution u = (u1 , . . . , un ) over his n possible moves and the max-
imizer selects a probability distribution v = (v1 , . . . , vm ) over his m possible
moves. Thus the probability of selecting i and j is ui vj , and the expected
a u v or u# Av, where A is the n × m matrix
+
payoff for this stage is i,j ij i j
with components aij .
In a single-stage version of the game, the minimizer must minimize
maxv∈V u# Av and the maximizer must maximize minu∈U u# Av, where U and
V are the sets of probability distributions over {1, . . . , n} and {1, . . . , m},
respectively. A fundamental result (which will not be proved here) is that
these two values are equal:

min max u# Av = max min u# Av. (1.13)


u∈U v∈V v∈V u∈U

Let us consider the situation where a separate game of the type just
described is played at each stage. The game played at a given stage is repre-
sented by a “state” x that takes values in a finite set X. The state evolves
according to transition probabilities qxy (i, j) where i and j are the moves
selected by the minimizer and the maximizer, respectively (here y represents
14 Introduction Chap. 1

the next game to be played after moves i and j are chosen at the game rep-
resented by x). When the state is x, under u ∈ U and v ∈ V , the one-stage
expected payoff is u# A(x)v, where A(x) is the n × m payoff matrix, and the
state transition probabilities are
n m
# #
pxy (u, v) = ui vj qxy (i, j) = u# Qxy v,
i=1 j=1

where Qxy is the n × m matrix that has components qxy (i, j). Payoffs are
discounted by α ∈ (0, 1), and the objectives of the minimizer and maximizer,
roughly speaking, are to minimize and to maximize the total discounted ex-
pected payoff. This requires selections of u and v to strike a balance between
obtaining favorable current stage payoffs and playing favorable games in fu-
ture stages.
We now introduce an abstract DP framework related to the sequential
move selection process just described. We consider the mapping G given by
#
G(x, u, v, J) = u# A(x)v + α pxy (u, v)J(y)
y∈X
,
#
- (1.14)
#
=u A(x) + α Qxy J(y) v,
y∈X

where α ∈ (0, 1) is discount factor, and the mapping H given by

H(x, u, J) = max G(x, u, v, J).


v∈V

The corresponding mappings Tµ and T are


$ %
(Tµ J)(x) = max G x, µ(x), v, J , x ∈ X,
v∈V

and
(T J)(x) = min max G(x, u, v, J).
u∈U v∈V

It can be shown that Tµ and T are monotone and (unweighted) sup-norm


contractions. Moreover, the unique fixed point J ∗ of T satisfies

J ∗ (x) = min max G(x, u, v, J ∗ ), ∀ x ∈ X,


u∈U v∈V

(see [Ber12a], Section 1.6.2).


We now note that since
#
A(x) + α Qxy J(y)
y∈X

[cf. Eq. (1.14)] is a matrix that is independent of u and v, we may view J ∗ (x)
as the value of a static game (which depends on the state x). In particular,
from the fundamental minimax equality (1.13), we have

min max G(x, u, v, J ∗ ) = max min G(x, u, v, J ∗ ), ∀ x ∈ X.


u∈U v∈V v∈V u∈U
Sec. 1.2 Abstract Dynamic Programming Models 15

This implies that J ∗ is also the unique fixed point of the mapping

(T J)(x) = max H(x, v, J),


v∈V

where
H(x, v, J) = min G(x, u, v, J),
u∈U


i.e., J is the fixed point regardless of the order in which minimizer and
maximizer select mixed strategies at each stage.
In the preceding development, we have introduced J ∗ as the unique
fixed point of the mappings T and T . However, J ∗ also has an interpretation
in game theoretic terms. In particular, it can be shown that J ∗ (x) is the value
of a dynamic game, whereby at state x the two opponents choose multistage
(possibly nonstationary) policies that consist of functions of the current state,
and continue to select moves using these policies over an infinite horizon. For
further discussion of this interpretation, we refer to [Ber12a] and to books on
dynamic games such as [FiV96]; see also [PaB99] and [Yu14] for an analysis
of the undiscounted case (α = 1) where there is a termination state, as in
the stochastic shortest path problems of the subsequent Example 1.2.6. An
alternative and more general formulation of sequential zero-sum games, which
allows for an infinite state space, will be given in Chapter 5.

Example 1.2.5 (Minimax Problems)

Consider a minimax version of Example 1.2.1, where w is not random but is


rather chosen from within a set W (x, u) by an antagonistic opponent. Let
. $ %/
H(x, u, J) = sup g(x, u, w) + αJ f (x, u, w) .
w∈W (x,u)

Then the equation J ∗ = T J ∗ is Bellman’s equation for an infinite horizon


minimax DP problem. A special case of this mapping arises in zero-sum
dynamic games (cf. Example 1.2.4). We will also discuss alternative and
more general abstract DP formulations of minimax problems in Chapter 5.

Example 1.2.6 (Stochastic Shortest Path Problems)

The stochastic shortest path (SSP for short) problem is the special case of
the stochastic optimal control Example 1.2.1 where:
(a) There is no discounting (α = 1).
(b) The state space is X = {t, 1, . . . , n} and we are given transition proba-
bilities, denoted by

pxy (u) = P (xk+1 = y | xk = x, uk = u), x, y ∈ X, u ∈ U (x).

(c) The control constraint set U (x) is finite for all x ∈ X.


16 Introduction Chap. 1

(d) A cost g(x, u) is incurred when control u ∈ U (x) is selected at state x.


(e) State t is a special termination state, which is cost-free and absorbing,
i.e., for all u ∈ U (t),

g(t, u) = 0, ptt (u) = 1.

To simplify the notation, we have assumed that the cost per stage does not
depend on the successor state, which amounts to using expected cost per
stage in all calculations.
Since the termination state t is cost-free, the cost starting from t is zero
for every policy. Accordingly, for all cost functions, we ignore the component
that corresponds to t, and define
n
#
H(x, u, J) = g(x, u) + pxy (u)J(y), x = 1, . . . , n, u ∈ U (x), J ∈ !n .
y=1

The mappings Tµ and T are defined by

n
$ % # $ %
(Tµ J)(x) = g x, µ(x) + pxy µ(x) J(y), x = 1, . . . , n,
y=1

0 n
1
#
(T J)(x) = min g(x, u) + pxy (u)J(y) , x = 1, . . . , n.
u∈U (x)
y=1

Note that the matrix that has components pxy (u), x, y = 1, . . . , n, is sub-
stochastic (some of its row sums may be less than 1) because there may be
a positive transition probability from a state x to the termination state t.
Consequently Tµ may be a contraction for some µ, but not necessarily for all
µ ∈ M.
The SSP problem has been discussed in many sources, including the
books [Pal67], [Der70], [Whi82], [Ber87], [BeT89], [HeL99], [Ber12a], and
[Ber17a], where it is sometimes referred to by earlier names such as “first
passage problem” and “transient programming problem.” In the framework
that is most relevant to our purposes, given in the paper by Bertsekas and
Tsitsiklis [BeT91], there is a classification of stationary policies for SSP into
proper and improper . We say that µ ∈ M is proper if, when using µ, there is
positive probability that termination will be reached after at most n stages,
regardless of the initial state; i.e., if

ρµ = max P {xn ,= 0 | x0 = x, µ} < 1.


x=1,...,n

Otherwise, we say that µ is improper. It can be seen that µ is proper if and


only if in the Markov chain corresponding to µ, each state x is connected to
the termination state with a path of positive probability transitions.
For a proper policy µ, it can be shown that Tµ is a weighted sup-norm
contraction, as well as an n-stage contraction with respect to the unweighted
Sec. 1.2 Abstract Dynamic Programming Models 17

sup-norm. For an improper policy µ, Tµ is not a contraction with respect to


any norm. Moreover, T also need not be a contraction with respect to any
norm (think of the case where there is only one policy, which is improper).
However, T is a weighted sup-norm contraction in the important special case
where all policies are proper (see [BeT96], Prop. 2.2, or [Ber12a], Chapter 3).
Nonetheless, even in the case where there are improper policies and T
is not a contraction, results comparable to the case of discounted finite-state
MDP are available for SSP problems assuming that:
(a) There exists at least one proper policy.
(b) For every improper policy there is an initial state that has infinite cost
under this policy.
Under the preceding two assumptions, referred to as the strong SSP conditions
in Section 3.5.1, it was shown in [BeT91] that T has a unique fixed point J ∗ ,
the optimal cost function of the SSP problem. Moreover, a policy {µ∗ , µ∗ , . . .}
is optimal if and only if

Tµ∗ J ∗ = T J ∗ .
In addition, J ∗ and Jµ can be computed by value iteration,

J ∗ = lim T k J, Jµ = lim Tµk J,


k→∞ k→∞

n
starting with any J ∈ ! (see [Ber12a], Chapter 3, for a textbook account).
These properties are in analogy with the desirable properties (a)-(c), given at
the end of the preceding subsection in connection with contractive models.
Regarding policy iteration, it works in its strongest form when there are
no improper policies, in which case the mappings Tµ and T are weighted sup-
norm contractions. When there are improper policies, modifications to the
policy iteration method are needed; see [Ber12a], [YuB13a], and also Section
3.6.2, where these modifications will be discussed in an abstract setting.
In Section 3.5.1 we will also consider SSP problems where the strong
SSP conditions (a) and (b) above are not satisfied. Then we will see that
unusual phenomena can occur, including that J ∗ may not be a solution of
Bellman’s equation. Still our line of analysis of Chapter 3 will apply to such
problems.

Example 1.2.7 (Deterministic Shortest Path Problems)

The special case of the SSP problem where the state transitions are determin-
istic is the classical shortest path problem. Here, we have a graph of n nodes
x = 1, . . . , n, plus a destination t, and an arc length axy for each directed arc
(x, y). At state/node x, a policy µ chooses an outgoing arc from x. Thus the
controls available at x can be identified with the outgoing neighbors of x [the
nodes u such that (x, u) is an arc]. The corresponding mapping H is
axu + J(u) if u ,= t,
&
H(x, u, J) = x = 1, . . . , n.
axt if u = t,
$ %
A stationary policy µ defines a graph whose arcs are x, µ(x) , x =
1, . . . , n. The policy µ is proper if and only if this graph is acyclic (it consists of
18 Introduction Chap. 1

a tree of directed paths leading from each node to the destination). Thus there
exists a proper policy if and only if each node is connected to the destination
with a directed path. Furthermore, an improper policy has finite cost starting
from every initial state if and only if all the cycles of the corresponding graph
have nonnegative cycle cost. It follows that the favorable analytical and
algorithmic results described for SSP in the preceding example hold if the
given graph is connected and the costs of all its cycles are positive. We will
see later that significant complications result if the cycle costs are allowed to
be zero, even though the shortest path problem is still well posed in the sense
that shortest paths exist if the given graph is connected (see Section 3.1).

Example 1.2.8 (Multiplicative and Risk-Sensitive Models)

With x, y, u, and transition probabilities pxy (u), as in the finite-state MDP


of Example 1.2.2, consider the mapping
# ! "
H(x, u, J) = pxy (u)g(x, u, y)J(y) = E g(x, u, y)J(y) | x, u , (1.15)
y∈X

where g is a scalar function satisfying g(x, u, y) ≥ 0 for all x, y, u (this is


necessary for H to be monotone). This mapping corresponds to the multi-
plicative model of minimizing over all π = {µ0 , µ1 , . . .} the cost
& $ % $ %
Jπ (x0 ) = lim sup E g x0 , µ0 (x0 ), x1 g x1 , µ1 (x1 ), x2 · · ·
N→∞
$ %( ' (1.16)
g xN−1 , µN−1 (xN−1 ), xN ( x0 ,

where the state sequence


$ % {x0 , x1 , . . .} is generated using the transition prob-
abilities pxk xk+1 µk (xk ) .
To see that the mapping H of Eq. (1.15) corresponds to the cost function
(1.16), let us consider the unit function

¯
J(x) ≡ 1, x ∈ X,

and verify that for all x0 ∈ X, we have


& $
¯ 0 ) = E g x0 , µ0 (x0 ), x1 g x1 , µ1 (x1 ), x2 · · ·
% $ %
(Tµ0 Tµ1 · · · TµN−1 J)(x
$ %( ' (1.17)
g xN−1 , µN−1 (xN−1 ), xN ( x0 ,

so that

¯
Jπ (x) = lim sup (Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X.
N→∞

Indeed, taking into account that J¯(x) ≡ 1, we have

¯ N−1 ) = E g xN−1 , µN−1 (xN−1 ), xN J¯(xN ) | xN−1


! $ % "
(TµN−1 J)(x
! $ % "
= E g xN−1 , µN−1 (xN−1 ), xN | xN−1 ,
Sec. 1.2 Abstract Dynamic Programming Models 19

¯ N−2 ) = (Tµ (TµN−1 J¯) (xN−2 )


$ %
(TµN−2 TµN−1 J)(x N−2
! $ %
= E g xN−2 , µN−2 (xN−2 ), xN−1
! $ % "
· E g xN−1 , µN−1 (xN−1 ), xN | xN−1 } | xN−2 ,
and continuing similarly,
& $
¯ 0 ) = E g x0 , µ0 (x0 ), x1 E g x1 , µ1 (x1 ), x2 · · ·
% ! $ %
(Tµ0 Tµ1 · · ·TµN−1 J)(x
! $ % " " " '
E g xN−1 , µN−1 (xN−1 ), xN | xN−1 | xN−2 · · · | x0 ,

which by using the iterated expectations formula (see e.g., [BeT08]) proves
the expression (1.17).
An important special case of a multiplicative model is when g has the
form
g(x, u, y) = eh(x,u,y)
for some one-stage cost function h. We then obtain a finite-state MDP with
an exponential cost function,
& $ %'
h(x0 ,µ0 (x0 ),x1 )+···+h(xN−1 ,µN−1 (xN−1 ),xN )
Jπ (x0 ) = lim sup E e ,
N→∞

which is often used to introduce risk aversion in the choice of policy through
the convexity of the exponential.
There is also a multiplicative version of the infinite state space stochas-
tic optimal control problem of Example 1.2.1. The mapping H takes the
form ! $ %"
H(x, u, J) = E g(x, u, w)J f (x, u, w) ,
where xk+1 = f (xk , uk , wk ) is the underlying discrete-time dynamic system;
cf. Eq. (1.10).
Multiplicative models and related risk-sensitive models are discussed
extensively in the literature, mostly for the exponential cost case and under
different assumptions than ours; see e.g., [HoM72], [Jac73], [Rot84], [ChS87],
[Whi90], [JBE94], [FlM95], [HeM96], [FeM97], [BoM99], [CoM99], [BoM02],
[BBB08], [Ber16a]. The works of references [DeR79], [Pat01], and [Pat07]
relate to the stochastic shortest path problems of Example 1.2.6, and are the
closest to the semicontractive models discussed in Chapters 3 and 4, based
on the author’s paper [Ber16a]; see the next example and Section 3.5.2.

Example 1.2.9 (Affine Monotonic Models)

Consider a finite state space X = {1, . . . , n} and a (possibly infinite) control


constraint set U (x) for each state x. For each policy µ, let the mapping Tµ
be given by
Tµ J = bµ + Aµ J, (1.18)
where bµ is a vector of !n with components b x, µ(x) , x = 1, . . . , n, and Aµ
$ %
$ %
is an n × n matrix with components Axy µ(x) , x, y = 1, . . . , n. We assume
that b(x, u) and Axy (u) are nonnegative,
b(x, u) ≥ 0, Axy (u) ≥ 0, ∀ x, y = 1, . . . , n, u ∈ U (x).
20 Introduction Chap. 1

Thus Tµ and T map nonnegative functions to nonnegative functions J : X (→


[0, ∞].
This model was introduced in the first edition of this book, and was elab-
orated on in the author’s paper [Ber16a]. Special cases of the model include
the finite-state Markov and semi-Markov problems of Examples 1.2.1-1.2.3,
and the stochastic shortest path problem of Example 1.2.6, with Aµ being the
transition probability matrix of µ (perhaps appropriately discounted), and bµ
being the cost per stage vector of µ, which is assumed nonnegative. An in-
teresting affine monotonic model of a different type is the multiplicative cost
model of the preceding example, where the initial function is J¯(x) ≡ 1 and
the cost accumulates multiplicatively up to reaching a termination state t. In
the exponential case of this model, the cost of a generated path starting from
some initial state accumulates additively as in the SSP case, up to reaching
t. However, the cost of the model is the expected value of the exponentiated
cost of the path up to reaching t. It can be shown then that the mapping Tµ
has the form
$ % $ %
(Tµ J)(x) = pxt µ(x) exp g(x, µ(x), t)
n
# $ %
+ pxy (µ(x))exp g(x, µ(x), y) J(y), x ∈ X,
y=1

where pxy (u) is the probability of transition from x to y under u, and g(x, u, y)
is the cost of the transition; see Section 3.5.2 for a detailed derivation. Clearly
Tµ has the affine monotonic form (1.18).

Example 1.2.10 (Aggregation)

Aggregation is an approximation approach that simplifies a large dynamic


programming (DP) problem by “combining” multiple states into aggregate
states. This results in a reduced or “aggregate” problem with fewer states,
which can often be solved using exact DP methods. The optimal cost-to-go
function derived from this aggregate problem then serves as an approximation
of the optimal cost function for the original problem.
Consider an n-state Markovian decision problem with transition prob-
abilities pij (u). To construct an aggregation framework, we introduce a finite
set A of aggregate states. We generically denote the aggregate states by let-
ters such as x and y, and the original system states by letters such as i and j.
The approximation framework is constructed by combining in various ways
the aggregate states and the original system states to form a larger system
(see Fig. 1.2.2). To specify the probabilistic structure of this system, we in-
troduce two (somewhat arbitrary) choices of probability distributions, which
relate the original system states with the aggregate states:
(1) For each aggregate state x and original system state i, we specify the
disaggregation probability dxi . We assume that dxi ≥ 0 and

n
#
dxi = 1, ∀ x ∈ A.
i=1
Sec. 1.2 Abstract Dynamic Programming Models 21

Original System
i , j=1
according to pij (u), with cost
! !

Disaggregation Probabilities Aggregation Probabilities


Aggregation Probabilities Aggregation Disaggregation
Probabilities Probabilities
Disaggregation Probabilities
dxi φjy Q
Disaggregation Probabilities
S

), x ), y
Aggregate System

Figure 1.2.2 Illustration of the relation between aggregate and original sys-
tem states.

Roughly, dxi may be interpreted as the “degree to which x is represented


by i.”
(2) For each aggregate state y and original system state j, we specify the
aggregation probability φjy . We assume that φjy ≥ 0 and
#
φjy = 1, ∀ j = 1, . . . , n.
y∈A

Roughly, φjy may be interpreted as the “degree of membership of j in


the aggregate state y.”
The aggregation and disaggregation probabilities specify a dynamic sys-
tem involving both aggregate and original system states (cf. Fig. 1.2.2). In
this system:
(i) From aggregate state x, we generate original system state i according
to dxi .
(ii) We generate transitions from original system state i to original system
state j according to pij (u), with cost g(i, u, j).
(iii) From original system state j, we generate aggregate state y according
to φjy .
Illustrative examples of aggregation frameworks are given in the books
[Ber12a] and [Ber17a]. One possibility is hard aggregation, where aggregate
states are identified with the sets of a partition of the state space. For another
type of common scheme, think of the case where the original system states
form a fine grid in some space, which is “aggregated” into a much coarser grid.
In particular let us choose a collection of “representative” original system
states, and associate each one of them with an aggregate state. Thus, each
aggregate state x is associated with a unique representative state ix , and the
22 Introduction Chap. 1

y3 Original State Space

j3 y1 1 y2
φj1 y1
1 φj1 y2
i pij1 (u) x j1
x=ip 2 φj1 y3
y2 y3

x j1 j2
j2 j3

Representative/Aggregate States

Figure 1.2.3 Aggregation based on a small subset of representative states


(these are shown with larger dark circles, while the other (nonrepresentative)
states are shown with smaller dark circles). In this figure, from representa-
tive state x = i, there are three possible transitions, to states j1 , j2 , and
j3 , according to pij1 (u), pij2 (u), pij3 (u), and each of these states is associ-
ated with a convex combination of representative states using the aggregation
probabilities. For example, j1 is associated with φj1 y1 y1 + φj1 y2 y2 + φj1 y3 y3 .

disaggregation probabilities are

1 if i = ix ,
&
dxi = (1.19)
0 if i =
, ix .

The aggregation probabilities are chosen to represent each original system


state j with a convex combination of aggregate/representative states; see
Fig. 1.2.3. It is also natural to assume that the aggregation probabilities map
representative states to themselves, i.e.,

1 if j = jy ,
&
φjy =
0 if j =
, jy .

This scheme makes intuitive geometrical sense as an interpolation scheme in


the special case where both the original and the aggregate states are asso-
ciated with points in a Euclidean space. The scheme may also be extended
to problems with a continuous state space. In this case, the state space is
discretized with a finite grid, and the states of the grid are viewed as the ag-
gregate states. The disaggregation probabilities are still given by Eq. (1.19),
while the aggregation probabilities may be arbitrarily chosen to represent each
original system state with a convex combination of representative states.
As an extension of the preceding schemes, suppose that through some
special insight into the problem’s structure or some preliminary calculation,
we know some features of the system’s state that can “predict well” its cost.
Then it seems reasonable to form the aggregate states by grouping together
Sec. 1.2 Abstract Dynamic Programming Models 23

states with “similar features,” or to form aggregate states by using “represen-


tative features” instead of representative states. This is called “feature-based
aggregation;” see the books [BeT96] (Section 3.1) and [Ber12a] (Section 6.5)
for a description and analysis.
Given aggregation and disaggregation probabilities, we may define an
aggregate problem whose states are the aggregate states. This problem in-
volves an aggregate discrete-time system, which we will describe shortly. We
require that the control is applied with knowledge of the current aggregate
state only (rather than the original system state).† To this end, we assume
that the control constraint set U (i) is independent of the state i, and we de-
note it by U . Then, by adding the probabilities of all the relevant paths in
Fig. 1.2.2, it can be seen that the transition probability from aggregate state
x to aggregate state y under control u ∈ U is
n n
# #
p̂xy (u) = dxi pij (u)φjy .
i=1 j=1

The corresponding expected transition cost is given by


n n
# #
ĝ(x, u) = dxi pij (u)g(i, u, j).
i=1 j=1

These transition probabilities and costs define the aggregate problem.


We may compute the optimal costs-to-go J(x),ˆ x ∈ A, of this problem
by using some exact DP method. Then, the costs-to-go of each state j of the
original problem are usually approximated by
#
J˜(j) = ˆ
φjy J(y).
y∈A

Example 1.2.11 (Distributed Aggregation)


The abstract DP framework is useful not only in modeling DP problems,
but also in modeling algorithms arising in DP and even other contexts. We
illustrate this with an example from Bertsekas and Yu [BeY10] that relates
to the distributed solution of large-scale discounted finite-state MDP using
cost function approximation based on aggregation. ‡ It involves a partition of
the n states into m subsets for the purposes of distributed computation, and
yields a corresponding approximation (V1 , . . . , Vm ) to the cost vector J ∗ .
In particular, we have a discounted n-state MDP (cf. Example 1.2.2),
and we introduce aggregate states S1 , . . . , Sm , which are disjoint subsets of

† An alternative form of aggregate problem, where the control may depend


on the original system state is discussed in Section 6.5.2 of the book [Ber12a].
‡ See [Ber12a], Section 6.5.2, for a more detailed discussion. Other examples
of algorithmic mappings that come under our framework arise in asynchronous
policy iteration (see Sections 2.6.3, 3.6.2, and [BeY10], [BeY12], [YuB13a]), and
in constrained forms of policy iteration (see [Ber11c], or [Ber12a], Exercise 2.7).
24 Introduction Chap. 1

the original state space with S1 ∪· · ·∪Sn = {1, . . . , n}. We envision a network
of processors & = 1, . . . , m, each assigned to the computation of a local cost
function V" , defined on the corresponding aggregate state/subset S" :

V" = {V"y | y ∈ S" }.

Processor & also maintains a scalar aggregate cost R" for its aggregate state,
which is a weighted average of the detailed cost values V"x within S" :
#
R" = d"x V"x ,
x∈S"

+
where d"x are given probabilities with d"x ≥ 0 and x∈S d"x = 1. The aggre-
"
gate costs R" are communicated between processors and are used to perform
the computation of the local cost functions V" (we will discuss computation
models of this type in Section 2.6).
We denote J = (V1 , . . . , Vm , R1 , . . . , Rm ). We introduce the mapping
H(x, u, J) defined for each of the n states x by

H(x, u, J) = W" (x, u, V" , R1 , . . . , Rm ), if x ∈ S" ,

where for x ∈ S"


n
# #
W" (x, u, V" , R1 , . . . , Rm ) = pxy (u)g(x, u, y) + α pxy (u)V"y
y=1 y∈S"
#
+α pxy (u)Rs(y) ,
y ∈S
/ "

and for each original system state y, we denote by s(y) the index of the subset
to which y belongs [i.e., y ∈ Ss(y) ].
We may view H as an abstract mapping on the space of J, and aim to
find its fixed point J ∗ = (V1∗ , . . . , Vm

, R1∗ , . . . , Rm

). Then, for & = 1, . . . , m, we

may view V" as an approximation to the optimal cost vector of the original
MDP starting at states x ∈ S" , and we may view R"∗ as a form of aggregate
cost for S" . The advantage of this formulation is that it involves significant
decomposition and parallelization of the computations among the processors,
when performing various DP algorithms. In particular, the computation of
W" (x, u, V" , R1 , . . . , Rm ) depends on just the local vector V" , whose dimension
may be potentially much smaller than n.

1.2.4 Reinforcement Learning - Projected and Aggregation


Bellman Equations

Given an abstract DP model described by a mapping H, we may be in-


terested in fixed points of related mappings other than T and Tµ . Such
mappings may arise in various contexts, such as for example distributed
Sec. 1.2 Abstract Dynamic Programming Models 25

asynchronous aggregation in Example 1.2.11. An important context is sub-


space approximation, whereby Tµ and T are restricted onto a subspace of
functions for the purpose of approximating their fixed points. Much of the
theory of approximate DP, neuro-dynamic programming, and reinforce-
ment learning relies on such approximations (there are quite a few books,
which collectively contain extensive accounts these subjects, such as Bert-
sekas and Tsitsiklis [BeT96], Sutton and Barto [SuB98], Gosavi [Gos03],
Cao [Cao07], Chang, Fu, Hu, and Marcus [CFH07], Meyn [Mey07], Powell
[Pow07], Borkar [Bor08], Haykin [Hay08], Busoniu, Babuska, De Schut-
ter, and Ernst [BBD10], Szepesvari [Sze10], Bertsekas [Ber12a], [Ber17a],
[Ber19b], [Ber20], and Vrabie, Vamvoudakis, and Lewis [VVL13]).
For an illustration, consider the approximate evaluation of the cost
vector of a discrete-time Markov chain with states i = 1, . . . , n. We assume
that state transitions (i, j) occur at time k according to given transition
probabilities pij , and generate a cost αk g(i, j), where α ∈ (0, 1) is a discount
factor. The cost function over an infinite number of stages can be shown to
be the unique fixed point of the Bellman equation mapping T : &n #→ &n
whose components are given by
n
# $ %
(T J)(i) = pij (u) g(i, j) + αJ(j) , i = 1, . . . , n, J ∈ &n .
j=1
This is the same as the mapping T in the discounted finite-state MDP Ex-
ample 1.2.2, except that we restrict attention to a single policy. Finding
the cost function of a fixed policy is the important policy evaluation sub-
problem that arises prominently within the context of policy iteration. It
also arises in the context of a simplified form of policy iteration, the roll-
out algorithm; see e.g., [BeT96], [Ber12a], [Ber17a], [Ber19b], [Ber20]. In
some artificial intelligence contexts, policy iteration is referred to as self-
learning, and in these contexts the policy evaluation is almost always done
approximately, sometimes with the use of neural networks.
A prominent approach for approximation of the fixed point of T is
based on the solution of lower-dimensional equations defined on the sub-
space {Φr | r ∈ &s } that is spanned by the columns of a given n × s matrix
Φ. Two such approximating equations have been studied extensively (see
[Ber12a], Chapter 6, for a detailed account and references; also [BeY07],
[BeY09], [YuB10], [Ber11a] for extensions to abstract contexts beyond ap-
proximate DP). These are:
(a) The projected equation
Φr = Πξ T (Φr), (1.20)
where Πξ denotes projection onto S with respect to a weighted Eu-
clidean norm 2
3 n
3# $ %2
+J+ξ = 4 ξi J(i) (1.21)
i=1
26 Introduction Chap. 1

with ξ = (ξ1 , . . . , ξn ) being a probability distribution with positive


components (sometimes a seminorm projection is used, whereby some
of the components ξi may be zero; see Yu and Bertsekas [YuB12]).
(b) The aggregation equation

Φr = ΦDT (Φr), (1.22)

with D being an s × n matrix whose rows are restricted to be proba-


bility distributions; these are the disaggregation probabilities of Ex-
ample 1.2.10. Also, in this approach, the rows of Φ are restricted to
be probability distributions; these are the aggregation probabilities
of Example 1.2.10.
We now see that solving the projected equation (1.20) and the aggre-
gation equation (1.22) amounts to finding a fixed point of the mappings
Πξ T and ΦDT , respectively. These mappings derive their structure from
the DP operator T , so they have some DP-like properties, which can be
exploited for analysis and computation.
An important fact is that the aggregation mapping ΦDT preserves
the monotonicity and the sup-norm contraction property of T , while the
projected equation mapping Πξ T generally does not. The reason for preser-
vation of monotonicity is the nonnegativity of the components of the ma-
trices Φ and D (see the author’s survey paper [Ber11c] for a discussion of
the importance of preservation of monotonicity in various DP operations).
The reason for preservation of sup-norm contraction is that the matrices
Φ and D are sup-norm nonexpansive, because their rows are probability
distributions. In fact, it can be verified that the solution r of Eq. (1.22)
can be viewed as the exact DP solution of the “aggregate” DP problem
that represents a lower-dimensional approximation of the original (see Ex-
ample 1.2.10). The preceding observations are important for our purposes,
as they indicate that much of the theory developed in this book applies to
approximation-related mappings based on aggregation.
By contrast, the projected equation mapping Πξ T need not be mono-
tone, because the components of Πξ need not be nonnegative. Moreover
while the projection Πξ is nonexpansive with respect to the projection norm
+·+ξ , it need not be nonexpansive with respect to the sup-norm. As a result
the projected equation mapping Πξ T need not be a sup-norm contraction.
These facts play a significant role in approximate DP methodology.

1.2.5 Reinforcement Learning - Temporal Difference and


Proximal Algorithms

An important possibility for finding a fixed point of T is to replace T


with another mapping, say F , such that F and T have the same fixed
points. For example, F may offer some advantages in terms of algorithmic
convenience or quality of approximation when used in conjunction with
Sec. 1.2 Abstract Dynamic Programming Models 27

projection or aggregation [cf. Eqs. (1.20) and (1.22)]. Alternatively, F may


be the mapping of some iterative method that is suitable for computing
fixed points of T .
In this book we will not consider in much detail the possibility of using
an alternative mapping F to find a fixed point of a mapping T . We will just
mention here some multistep versions of T , which have been used widely
for approximations in reinforcement learning. An important example is the
mapping T (λ) : &n #→ &n , defined for a given λ ∈ (0, 1) as follows: T (λ)
transforms a vector J ∈ &n to the vector T (λ) J ∈ &n , whose n components
are given by

#
$ %
(λ)
T J (i) = (1 − λ) λ$ (T $+1 J)(i), i = 1, . . . , n, J ∈ &n ,
$=0

for λ ∈ (0, 1), where T $ is the %-fold composition of T with itself % times.
Here there should be conditions that guarantee the convergence of the
infinite series in the preceding definition. The multistep analog of the
projected Eq. (1.20) is
Φr = Πξ T (λ) (Φr).
The popular temporal difference methods, such as TD(λ), LSTD(λ), and
LSPE(λ), aim to solve this equation (see the book references on approx-
imate DP, neuro-dynamic programming, and reinforcement learning cited
earlier). The mapping T (λ) also forms the basis for the λ-policy iteration
method to be discussed in Sections 2.5, 3.2.4, and 4.3.3.
The multistep analog of the aggregation Eq. (1.22) is

Φr = ΦDT (λ) (Φr),

and methods that are similar to the temporal difference methods can be
used for its solution. In particular, a multistep method based on the map-
ping T (λ) is the, so-called, λ-aggregation method (see [Ber12a], Chapter
6), as well as other forms of aggregation (see [Ber12a], [YuB12]).
In the case where T is a linear mapping of the form

T J = AJ + b,

where b is a vector in &n , and A is an n × n matrix with eigenvalues


strictly within the unit circle, there is an interesting connection between
the multistep mapping T (λ) and another mapping of major importance in
numerical convex optimization. This is the proximal mapping, associated
with T and a scalar c > 0, and denoted by P (c) . In particular, for a given
J ∈ &n , the vector P (c) J is defined as the unique vector Y ∈ &n that solves
the equation
1
Y − AY − b = (J − Y ).
c
28 Introduction Chap. 1

Equivalently,
5 6−1 5 6
c+1 1
P (c) J
= I −A b+ J , (1.23)
c c
where I is the identity matrix. Then it can be shown (see Exercise 1.2 or
the papers [Ber16b], [Ber18c]) that if
λ
c= ,
1−λ
we have
T (λ) = T · P (c) = P (c) · T.
Moreover, the vectors J, P (c) J, and T (λ) J are colinear and satisfy
c + 1 $ (c) %
T (λ) J = J + P J −J .
c
The preceding formulas show that T (λ) and P (c) are closely related, and
that iterating with T (λ) is “faster” than iterating with P (c) , since the eigen-
values of A are within the unit circle, so that T is a contraction. In addition,
methods such as TD(λ), LSTD(λ), LSPE(λ), and their projected versions,
which are based on T (λ) , can be adapted to be used with P (c) .
A more general form of multistep approach, introduced and studied
in the paper [YuB12], replaces T (λ) with a mapping T (w) : &n #→ &n that
has components
#∞
$ %
T (w) J (i) = wi$ (T $ J)(i), i = 1, . . . , n, J ∈ &n ,
$=1
where w is a vector sequence whose ith component, (wi1 , wi2 , . . .), is a prob-
ability distribution over the positive integers. Then the multistep analog
of the projected equation (1.20) is
Φr = Πξ T (w) (Φr), (1.24)
while the multistep analog of the aggregation equation (1.22) is
Φr = ΦDT (w) (Φr). (1.25)
The mapping T (λ) is obtained for wi$ = (1 − λ)λ$−1 , independently of
the state i. A more general version, where λ depends on the state i, is
obtained for wi$ = (1 − λi )λ$−1
i . The solution of Eqs. (1.24) and (1.25)
by simulation-based methods is discussed in the paper [YuB12]; see also
Exercise 1.3.
Let us also note that there is a connection between projected equa-
tions of the form (1.24) and aggregation equations of the form (1.25). This
connection is based on the use of a seminorm [this is given by the same
expression as the norm + · +ξ of Eq. (1.21), with some of the components
of ξ allowed to be 0]. In particular, the most prominent cases of aggrega-
tion equations can be viewed as seminorm projected equations because, for
these cases, ΦD is a seminorm projection (see [Ber12a], p. 639, [YuB12],
Section 4). Moreover, they can also be viewed as projected equations where
the projection is oblique (see [Ber12a], Section 7.3.6).
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 29

1.3 REINFORCEMENT LEARNING - APPROXIMATION IN


VALUE SPACE
In this section we will use geometric illustrations to obtain insight into Bell-
man’s equation, the algorithms of value iteration (VI) and policy iteration
(PI), and an approximation methodology, which is prominent in reinforce-
ment learning and is known as approximation in value space.† Throughout
this section, we will make use of the following two properties:
(a) T and Tµ are monotone, i.e., they satisfy Assumption 1.2.1.
(b) We have
(T J)(x) = min (Tµ J)(x), for all x, (1.26)
µ∈M

where M is the set of stationary policies. This is true because for


any policy µ, there is no coupling constraint between the controls
µ(x) and µ(x" ) that correspond to two different states x and x" .
We will first focus on the discounted version of the Markovian decision
problem of Example 1.2.1, and we will then consider more general cases.

1.3.1 Approximation in Value Space for Markovian Decision


Problems
In Markovian decision problems the mappings Tµ and T are given by
! " # " #$
(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) , for all x, (1.27)

and
! " #$
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) , for all x, (1.28)
u∈U(x)

where α ∈ (0, 1]; cf. Example 1.2.1.


In addition to monotonicity, we have an additional important prop-
erty: Tµ is linear , in the sense that it has the form

Tµ J = Gµ + Aµ J,

where Gµ ∈ R(X) is some function and Aµ : R(X) "→ R(X) is an operator


such that for any functions J1 , J2 , and scalars γ1 , γ2 , we have

Aµ (γ1 J1 + γ2 J2 ) = γ1 Aµ J1 + γ2 Aµ J2 .

† The major alternative reinforcement learning approach is approximation


in policy space, whereby a suboptimal policy is selected from within a class of
parametrized policies, usually by means of some optimization procedure, such as
random search, or gradient descent; see e.g., the author’s reinforcement learning
book [Ber19b].
30 Introduction Chap. 1

This is true because of the linearity of the expected value operation in Eq.
(1.27). The linearity of Tµ implies another important property: (T J)(x) is
a concave function of J for every x. By this we mean that the set
! "
Cx = (J, ξ) | (T J)(x) ≥ ξ, J ∈ R(X), ξ ∈ & (1.29)

is convex for all x ∈ X, where R(X) is the set of real-valued functions over
the state space X, and & is the set of real numbers. This follows from the
linearity of Tµ , the alternative definition of T given by Eq. (1.26), and the
fact that for a fixed x, the minimum of the linear functions (Tµ J)(x) over
µ ∈ M is concave as a function of J.
We illustrate these properties graphically with an example.

Example 1.3.1 (A Two-State and Two-Control Example)

Assume that there are two states 1 and 2, and two controls u and v. Consider
the policy µ that applies control u at state 1 and control v at state 2. Then
the operator Tµ takes the form

2
# $ %
(Tµ J)(1) = p1j (u) g(1, u, y) + αJ(y) , (1.30)
y=1

2
# $ %
(Tµ J)(2) = p2y (v) g(2, v, y) + αJ(y) , (1.31)
y=1

where pxy (u) and pxy (v) are the probabilities that the next state will be y,
when the current state is x, and the control is u or v, respectively. Clearly,
(Tµ J)(1) and (Tµ J)(2) are linear functions of J. Also the operator T of the
Bellman equation J = T J takes the form
0 2
# $ %
(T J)(1) = min p1y (u) g(1, u, y) + αJ(y) ,
y=1
1 (1.32)
2
# $ %
p1y (v) g(1, v, y) + αJ(y) ,
y=1

0 2
# $ %
(T J)(2) = min p2y (u) g(2, u, y) + αJ(y) ,
y=1
1 (1.33)
2
# $ %
p2y (v) g(2, v, y) + αJ(y) .
y=1

Thus, (T J)(1) and (T J)(2) are concave and piecewise linear as functions of
the two-dimensional vector J (with two pieces; more generally, as many linear
pieces as the number of controls). This concavity property holds in general
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 31

since (T J)(x) is the minimum of a collection of linear functions of J, one for


each u ∈ U (x). Figure 1.3.1 illustrates (Tµ J)(1) for the cases where µ(1) = u
and µ(1) = v, (Tµ J)(2) for the cases
$ where µ(2)% = u and µ(2) = v, (T J)(1),
and (T J)(2), as functions of J = J(1), J(2) .

Critical properties from the DP point of view are whether T and Tµ


have fixed points; equivalently, whether the Bellman equations J = T J
and J = Tµ J have solutions within the class of real-valued functions, and
whether the set of solutions includes J * and Jµ , respectively. It may thus
be important to verify that T or Tµ are contraction mappings. This is
true for example in the benign case of discounted problems (α < 1) with
bounded cost per stage. However, for undiscounted problems, asserting
the contraction property of T or Tµ may be more complicated, and even
impossible. In this book we will deal extensively with such questions and
related issues regarding the solution set of the Bellman equation.

Geometrical Interpretations

We will now interpret the Bellman operators geometrically, starting with


Tµ , which is linear as noted earlier. Figure 1.3.2 illustrates its form. Note
here that the functions J and Tµ J are multidimensional. They have as
many scalar components J(x) and (Tµ J)(x), respectively, as there are
states x, but they can only be shown projected onto one dimension. The
cost function Jµ satisfies Jµ = Tµ Jµ , so it is obtained from the intersec-
tion of the graph of Tµ J and the 45 degree line, when Jµ is real-valued.
We interpret the situation where Jµ is not real-valued with lack of system
stability under µ [so µ will be viewed as unstable if we have Jµ (x) = ∞
for some initial states x]. For further discussion of stability issues, see the
book [Ber22].
The form of the Bellman operator T is illustrated in Fig. 1.3.3. Again
the functions J, J * , T J, Tµ J, etc, are multidimensional, but they are shown
projected onto one dimension. The Bellman equation J = T J may have
one or many real-valued solutions. It may also have no real-valued solution
in exceptional situations, as we will discuss later. The figure assumes that
the Bellman equations J = T J and J = Tµ J have a unique real-valued
solution, which is true if T and Tµ are contraction mappings, as is the case
for discounted problems with bounded cost per stage. Otherwise, these
equations may have no solution or multiple solutions within the class of
real-valued functions. The equation J = T J typically has J * as a solution,
but may have more than one solution in cases where either α = 1 or α < 1,
and the cost per stage is unbounded.

Example 1.3.2 (A Two-State and Infinite Controls Problem)

Let us consider the mapping T for a problem that involves two states, 1 and
2, but an infinite number of controls. In particular, the control space at both
32 Introduction Chap. 1

(2) (T J ∗ )(1) = J ∗ (1) ( (1) (T J ∗ )(2) = J ∗ (2)

One-step lookahead J ∗ One-step lookahead J ∗

∗ J ∗ (1)
(1) J ∗ (2)

State 1 State 2 State 1 State 2

Figure 1.3.1 Geometric illustrations of the Bellman operators Tµ and T for


states 1 and 2 in Example 1.3.1; cf. Eqs. (1.30)-(1.33). The problem’s transition
probabilities are: p11 (u) = 0.3, p12 (u) = 0.7, p21 (u) = 0.4, p22 (u) = 0.6, p11 (v) =
0.6, p12 (v) = 0.4, p21 (v) = 0.9, p22 (v) = 0.1. The stage costs are g(1, u, 1) =
3, g(1, u, 2) = 10, g(2, u, 1) = 0, g(2, u, 2) = 6, g(1, v, 1) = 7, g(1, v, 2) = 5,
g(2, v, 1) = 3, g(2, v, 2) = 12. The discount factor is α = 0.9, and the optimal
costs are J ∗ (1) = 50.59 and J ∗ (2) = 47.41. The optimal policy is µ∗ (1) = v
and µ∗ (2) = u. The figure also shows the one-dimensional “slices” of T that pass
through J ∗ .
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 33

Generic stable policy


J Generic unstable Generic
policy stable policy µ
Generic unstable policy µ! Tµ J

Tµ! J

1 J J

45◦ Line Cost of µ


Player/Policy Jµ = Tµ Jµ

(1) = 0 1 J J

Figure 1.3.2 Geometric interpretation of the linear Bellman operator Tµ and the
corresponding Bellman equation. The graph of Tµ is a plane in the space ! × !,
and when projected on a one-dimensional plane that corresponds to a single state
and passes through Jµ , it becomes a line. Then there are three cases:
(a) The line has slope less than 45 degrees, so it intersects the 45-degree line at
a unique point, which is equal to Jµ , the solution of the Bellman equation
J = Tµ J. This is true if Tµ is a contraction mapping, as is the case for
discounted problems with bounded cost per stage.
(b) The line has slope less than 45 degrees. Then it intersects the 45-degree line
at a unique point, which is a solution of the Bellman equation J = Tµ J,
but is not equal to Jµ . Then Jµ is not real-valued; we consider such µ to
be unstable under µ.
(c) The line has slope exactly equal to 45 degrees. This is an exceptional case
where the Bellman equation J = Tµ J has an infinite number of real-valued
solutions or no real-valued solution at all; we will provide examples where
this occurs later.

states is the unit interval, U (1) = U (2) = [0, 1]. Here (T J)(1) and (T J)(2)
are given by

g1 + r11 u2 + r12 (1 − u)2 + αuJ(1) + α(1 − u)J(2) ,


! "
(T J)(1) = min
u∈[0,1]

g2 + r21 u2 + r22 (1 − u)2 + αuJ(1) + α(1 − u)J(2) .


! "
(T J)(2) = min
u∈[0,1]
34 Introduction Chap. 1

Tµ̃ J
One-step lookahead
Position
ON-LINEEvaluation Policy µ̃ with
PLAY Lookahead Tree States
Tµ̃ J˜ = T J˜
Final Features Optimal Policy
Tµ J Final Features Optimal Policy
One-step lookahead Generic policy µ

T J = minµ Tµ J
= 4 Model minµ Tµ J˜
1 J J

ective Cost Approximation Value Space Approximation


45◦ Line Cost of µ Cost of µ̃
Optimal cost Cost µof=rollout
Player/Policy J Tµ Jµ policy
Jµ̃ =˜ Tµ̃ Jµ̃
J J∗ = T J∗
0 Prob. = 1
(1) = 0 J˜ 1 J J

Figure 1.3.3 Geometric interpretation of the Bellman operator T , and the cor-
responding Bellman equation. For a fixed x, the function (T J)(x) can be written
as minµ (Tµ J)(x), so it is concave as a function of J. The optimal cost function
J ∗ satisfies J ∗ = T J ∗ , so it is obtained from the intersection of the graph of T J
and the 45 degree line shown, assuming J ∗ is real-valued.
Note that the graph of T lies below the graph of every operator Tµ , and
is in fact obtained as the lower envelope of the graphs of Tµ as µ ranges over
the set of policies M. In particular, for any given function J, ˜ for every x, the
˜
value (T J)(x) is obtained by finding a support hyperplane/subgradient of the
graph of the concave function (T J)(x) at J, ˜ as shown in the figure. This support
hyperplane is defined by the control µ(x) of a policy µ̃ that attains the minimum
˜
of (Tµ J)(x) over µ:
˜
µ̃(x) ∈ arg min (Tµ J)(x)
µ∈M

(there may be multiple policies attaining this minimum, defining multiple support
hyperplanes). This construction also shows how the minimization

˜
(T J)(x) ˜
= min (Tµ J)(x)
µ∈M

˜
corresponds to a linearization of the mapping T at the point J.

The control u at each state x = 1, 2 has the meaning of a probability that


we must select at that state. In particular, we control the probabilities u and
(1 − u) of moving to states y = 1 and y = 2, at a control cost that is quadratic
in u and (1 − u), respectively. For this problem (T J)(1) and (T J)(2) can be
calculated in closed form, so they are easy to plot and understand. They are
piecewise quadratic, unlike the corresponding plots of Fig. 1.3.1, which are
piecewise linear; see Fig. 1.3.4.
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 35

(2) (T J ∗ )(1) = J ∗ (1) ( (1) (T J ∗ )(2) = J ∗ (2)

One-step lookahead J ∗ One-step lookahead J ∗

∗ J ∗ (1) (1) J ∗ (2)

State 1 State 2
Figure 1.3.4 Illustration of the Bellman operator T for states 1 and 2 in Example
1.3.2. The parameter values are g1 = 5, g2 = 3, r11 = 3, r12 = 15, r21 = 9,
r22 = 1, and the discount factor is α = 0.9. The optimal costs are J ∗ (1) = 49.7
and J ∗ (2) = 40.0, and the optimal policy is µ∗ (1) = 0.59 and µ∗ (2) = 0. The
figure also shows the one-dimensional slices of the operators at J(1) = 15 and
J(2) = 30, together with the corresponding 45-degree lines.

Visualization of Value Iteration


The operator notation simplifies algorithmic descriptions, derivations, and
proofs related to DP. For example, the value iteration (VI) algorithm can
be written in the compact form

Jk+1 = T Jk , k = 0, 1, . . . ,

as illustrated in Fig. 1.3.5. Moreover, the VI algorithm for a given policy


µ can be written as

Jk+1 = Tµ Jk , k = 0, 1, . . . ,

and it can be similarly interpreted, except that the graph of the function
Tµ J is linear. Also we will see shortly that there is a similarly compact
description for the policy iteration algorithm.

1.3.2 Approximation in Value Space and Newton’s Method


Let us now interpret approximation in value space in terms of abstract
geometric constructions. Here we approximate J * with some function J, ˜
and we obtain by minimization a corresponding policy, called a one-step
˜ a one-step lookahead policy
lookahead policy. In particular, for a given J,
µ̃ is characterized by the equation

Tµ̃ J˜ = T J,
˜
36 Introduction Chap. 1

provement Bellman Equation Value Iterations

J2
TJ
J1
1 J J

45◦ Line
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1 1 J J
Stability Region 0 J0 J1 J2

Figure 1.3.5 Geometric interpretation of the VI algorithm Jk+1 = T Jk , start-


ing from some initial function J0 . Successive iterates are obtained through the
staircase construction shown in the figure. The VI algorithm Jk+1 = Tµ Jk for a
given policy µ can be similarly interpreted, except that the graph of the function
Tµ J is linear.

as in Fig. 1.3.6. This equation implies that the graph of Tµ̃ J just touches
the graph of T J at J,˜ as shown in the figure. Moreover, for each state
x ∈ X the hyperplane Hµ̃ (x)
&$ % '
Hµ̃ (x) = J(x), ξ | (Tµ̃ J)(x) ≥ ξ ,

supports from above the convex set


&$ % '
J(x), ξ | (T J)(x) ≥ ξ

˜ ˜ ˜
$ %
at the point J(x), (T J)(x) and defines a subgradient of (T J)(x) at J.
Note that the one-step lookahead policy µ̃ need not be unique, since T
need not be differentiable.
In conclusion, the equation

J = Tµ̃ J

is a pointwise (for each x) linearization of the equation

J = TJ
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 37

Tµ̃ J
Corresponds to One-Step Lookahead Policy ˜
One-Step Lookahead Policy Cost
One-Step Lookahead Policy µ̃
TJ

1 J J

also Newton Step


Approximations Result of
Newton step from J˜
J˜ for solving J = T J
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1
Stability Region 0 J˜ J˜ Jµ̃ = Tµ̃ Jµ̃ 1 J J

One-Step
Cost Approximation Value Space Lookahead Policy Cost l
Approximation
One-Step Lookahead Policy Cost l

Figure 1.3.6 Geometric interpretation of approximation in value space and the


˜ we find a
one-step lookahead policy µ̃ as a step of Newton’s method. Given J,
policy µ̃ that attains the minimum in the relation

T J˜ = min Tµ J.
˜
µ

This policy satisfies T J˜ = Tµ̃ J,


˜ so the graph of T J and Tµ̃ J touch at J,
˜ as shown.
It may not be unique. Because T J has concave components, the equation

J = Tµ̃ J

˜ The linearized equation is solved


is the linearization of the equation J = T J at J.
at the typical step of Newton’s method to provide the next iterate, which is just
Jµ̃ .

˜ and its solution, Jµ̃ , can be viewed as the result of a Newton iteration
at J,
˜ In summary, the Newton iterate at J˜ is Jµ̃ , the solution of
at the point J.
the linearized equation J = Tµ̃ J.†
We may also consider approximation in value space with %-step looka-

† The classical Newton’s method for solving a fixed point problem of the form
y = T (y), where y is an n-dimensional vector, operates as follows: At the current
iterate yk , we linearize T and find the solution yk+1 of the corresponding linear
fixed point problem. Assuming T is differentiable, the linearization is obtained
38 Introduction Chap. 1

head using J˜. This is the same as approximation in value space with one-
step lookahead using the (% − 1)-fold operation of T on J, ˜ T $−1 J.
˜ Thus
it can be interpreted as a Newton step starting from T $−1 ˜
J, the result of
˜ This is illustrated in Fig. 1.3.7.†
% − 1 value iterations applied to J.

1.3.3 Policy Iteration and Newton’s Method


Another major class of infinite horizon algorithms is based on policy it-
eration (PI for short). We will discuss several abstract versions of PI in
subsequent chapters, under a variety of assumptions. Generally, each iter-
ation of the PI algorithm starts with a policy (which we call current or base
policy), and generates another policy (which we call new or rollout policy,
respectively). For the stochastic optimal control problem of Example 1.2.1,
given the base policy µ, a policy iteration consists of two phases:

by using a first order Taylor expansion:

∂T (yk )
yk+1 = T (yk ) + (yk+1 − yk ),
∂y

where ∂T (yk )/∂y is the n × n Jacobian matrix of T evaluated at the vector


yk . The most commonly given convergence rate property of Newton’s method is
quadratic convergence. It states that near the solution y ∗ , we have

/yk+1 − y ∗ / = O /yk − y ∗ /2 ,
$ %

where / · / is the Euclidean norm, and holds assuming the Jacobian matrix ex-
ists and is Lipschitz continuous (see [Ber16], Section 1.4). There are extensions
of Newton’s method that are based on solving a linearized system at the cur-
rent iterate, but relax the differentiability requirement to piecewise differentiabil-
ity, and/or component concavity, while maintaining the superlinear convergence
property of the method.
The structure of the Bellman operators (1.28) and (1.27), with their mono-
tonicity and concavity properties, tends to enhance the convergence and rate of
convergence properties of Newton’s method, even in the absence of differentiabil-
ity, as evidenced by the convergence analysis of PI, and the extensive favorable
experience with rollout, PI, and MPC. In this connection, it is worth noting that
in the case of Markov games, where the concavity property does not hold, the
PI method may oscillate, as shown by Pollatschek and Avi-Itzhak [PoA69], and
needs to be modified to restore its global convergence; see the author’s paper
[Ber21c]. We will discuss abstract versions of game and minimax contexts n
Chapter 5.
† Variants of Newton’s method that involve combinations of first order it-
erative methods, such as the Gauss-Seidel and Jacobi algorithms, and New-
ton’s method, and they belong to the general family of Newton-SOR methods
(SOR stands for “successive over-relaxation”); see the classic book by Ortega
and Rheinboldt [OrR70] (Section 13.4).
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 39

Tµ̃Policy
Corresponds to One-Step Lookahead J ˜
Multistep Lookahead Policy Cost
One-Step Lookahead Policy µ̃

TJ

Linear policy parameter Optimal ! = 3


1 J J

also Newton Step


Approximations
Effective Cost Result
Approximation ValueofSpace Approximation
Newton step from T !−1 J˜
Cost Approximation Value ˜Space Approximation
J for solving J = T J
Multistep Lookahead Policy Cost T 2 J˜
1 J J
Stability Region 0 ˜ Jµ̃ = Tµ̃ Jµ̃
J
J˜ J J∗ = T J∗
Cost Approximation Value Space
Optimal
0 Prob. CostMultistep
of rolloutLookahead
Approximation
= 1 cost policy ˜ Policy Cost l
One-Step Lookahead Policy Cost l

Figure 1.3.7 Geometric interpretation of approximation in value space with #-


step lookahead (in this figure # = 3). It is the same as approximation in value
space with one-step lookahead using T "−1 J˜ as cost approximation. It can be
viewed as a Newton step at the point T "−1 J, ˜ the result of # − 1 value iterations
˜ Note that as # increases the cost function Jµ̃ of the #-step lookahead
applied to J.
policy µ̃ approaches more closely the optimal J ∗ , and that lim"→∞ Jµ̃ = J ∗ .

(a) Policy evaluation, which computes the cost function Jµ . One possi-
bility is to solve the corresponding Bellman equation
& $ % $ %'
Jµ (x) = E g x, µ(x), w + αJµ f (x, µ(x), w) , for all x.
(1.34)
However, the value Jµ (x) for any x can also be computed by Monte
Carlo simulation, by averaging over many randomly generated tra-
jectories the cost of the policy starting from x. Other possibilities
include the use of specialized simulation-based methods, based on
the projected and aggregation Bellman equations discussed in Sec-
tion 1.2.4, for which there is extensive literature (see e.g., the books
[BeT96], [SuB98], [Ber12a], [Ber19b]).
(b) Policy improvement , which computes the rollout policy µ̃ using the
one-step lookahead minimization
& $ %'
µ̃(x) ∈ arg min E g(x, u, w) + αJµ f (x, u, w) , for all x.
u∈U(x)
(1.35)
40 Introduction Chap. 1

Policy Evaluation for


for µk+1

Linearized Bellman Eq. at


Linearized Bellman Eq. at Jµk

∗ TJ
Prob. = 1 Prob. =
also Newton Step
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1
1 J J
Cost of µk+1 Cost of µk
Jµk+1 = Tµk+1 Jµk+1 Jµk = Tµk Jµk

Figure 1.3.8 Geometric interpretation of a single policy iteration. Starting from


the stable current policy µk , it evaluates the corresponding cost function Jµk ,
and computes the next policy µk+1 according to Tµk+1 Jµk = T Jµk . The corre-
sponding cost function Jµk+1 is obtained as the solution of the linearized equation
J = Tµk+1 J, so it is the result of a Newton step for solving the Bellman equation
J = T J, starting from Jµk . Note than in policy iteration, the Newton step always
starts at a function Jµ , which satisfies Jµ ≥ J ∗ .

It is generally expected (and can be proved under mild conditions)


that the rollout policy is improved in the sense that Jµ̃ (x) ≤ Jµ (x)
for all x.
Thus the PI process generates a sequence of policies {µk }, by obtain-
ing µk+1through a policy improvement operation using Jµk in place of Jµ
in Eq. (1.35), which is obtained through policy evaluation of the preceding
policy µk using Eq. (1.34). In subsequent chapters, we will show under ap-
propriate assumptions that general forms of PI have interesting and often
solid convergence properties, which may hold even when the method is im-
plemented (with appropriate modifications) in unconventional computing
environments, involving asynchronous distributed computation.
In terms of our abstract notation, the PI algorithm can be written
in a compact form. For the generated policy sequence {µk }, the policy
evaluation phase obtains Jµk from the equation
Jµk = Tµk Jµk , (1.36)
while the policy improvement phase obtains µk+1 through the equation
Tµk+1 Jµk = T Jµk . (1.37)
Sec. 1.4 Organization of the Book 41

As Fig. 1.3.8 illustrates, PI can be viewed as Newton’s method for solv-


ing the Bellman equation in the function space of cost functions J. In
particular, the policy improvement Eq. (1.37) is the Newton step starting
from Jµk , and yields µk+1 as the corresponding one-step lookahead/rollout
policy.
The interpretation of PI as a form of Newton’s method has a long his-
tory, for which we refer to the original works for linear quadratic problems
by Kleinman [Klei68],† and for finite-state infinite horizon discounted and
Markov game problems by Pollatschek and Avi-Itzhak [PoA69] (who also
showed that the method may oscillate in the game case; see the discussion
in Chapter 5).

1.3.4 Approximation in Value Space for General Abstract


Dynamic Programming

Let us now consider the general case where the mapping Tµ is not assumed
linear for all stationary policies µ ∈ M. In this case we still have the
alternative description of T
(T J)(x) = min (Tµ J)(x), for all x,
µ∈M

but T need not be concave, i.e., for some x ∈ X, the function (T J)(x) may
not be concave as a function of J. We illustrate this fact in Fig. 1.3.9.
The nonlinearity of the mapping Tµ can have profound consequences
on the validity of the PI algorithm and its interpretation in terms of New-
ton’s method. A prominent case where this is so arises in minimax problems
and related two-person zero sum game settings (cf. Example 1.2.5). We will
discuss this case in Chapter 5, where we will introduce modifications to the
PI algorithm that restore its convergence property.
We note, however, that it is possible that the mappings Tµ are non-
linear and convex, but that T has concave and differentiable components
(T J)(x), in which case the Newton step interpretation applies. This occurs
in particular in the important case of zero-sum dynamic games involving a
linear system and a quadratic cost function.

1.4 ORGANIZATION OF THE BOOK

The examples in the preceding sections demonstrate that while the mono-
tonicity assumption is satisfied for most DP models, the contraction as-
sumption may or may not hold. In particular, the contraction assumption
† This was part of Kleinman’s Ph.D. thesis [Kle67] at M.I.T., supervised by
M. Athans. Kleinman gives credit for the one-dimensional version of his results to
Bellman and Kalaba [BeK65]. Note also that the first proposal of the PI method
was given by Bellman in his classic book [Bel57], under the name “approximation
in policy space.”
42 Introduction Chap. 1

Tµ! J

Tµ J

T J = min{Tµ , Tµ! }

(1) = 0 1 J J

Figure 1.3.9 Geometric interpretation of the Bellman operator, in the general


case where the policy mappings Tµ are not linear. The figure illustrates the case
of two policies µ and µ# , whose mappings Tµ and Tµ! are piecewise linear and
! "
convex. In this case the mapping T , given by (T J)(x) = min Tµ J(x), Tµ! J(x) ,
is piecewise linear, but it is neither convex nor concave, and the Newton step
interpretation breaks down; see also Chapter 5.

is satisfied for the mapping H in Examples 1.2.1-1.2.5, provided there is


discounting and that the cost per stage is bounded. However, it need not
hold in the SSP Example 1.2.6, the multiplicative Example 1.2.8, and the
affine monotonic Example 1.2.9.
The book’s central theme is that the presence or absence of mono-
tonicity and contraction fundamentally shapes the analytical and algorith-
mic theories for abstract DP. In our development, with few exceptions, we
will assume that monotonicity holds. Consequently, the book is organized
around the presence or absence of the contraction property. In the next
three chapters we will discuss three types of DP models.
(a) Contractive models: These models, discussed in Chapter 2, have
the richest and strongest algorithmic theory, and serve as a bench-
mark for other models. Notable examples include discounted stochas-
tic optimal control problems (cf. Example 1.2.1), finite-state dis-
counted MDP (cf. Example 1.2.2), and some special types of SSP
problems (cf. Example 1.2.6).
(b) Semicontractive models: In these models, Tµ is monotone but
is not a contraction for all µ ∈ M. Most practical deterministic,
stochastic, and minimax-type shortest path problems fall into this
Sec. 1.4 Organization of the Book 43

category. One challenge here is that, under certain conditions, some


of the problem’s cost functions may take the values +∞ or −∞, and
the mappings Tµ and T must be able to deal with such functions.
The distinguishing feature of semicontractive models is the separa-
tion of policies into those that “behave well” within our optimization
framework and those that do not. Contraction-based analysis is in-
sufficient to deal with “ill-behaved” policies, so we introduce a notion
of “regularity,” which is connected to contraction, but is more gen-
eral. In particular, a policy µ is considered “regular” if the dynamic
system underlying Tµ has Jµ has an asymptotically stable equilibrium
within a suitable domain. Our models and analysis are patterned to
a large extent after the SSP problems of Example 1.2.6 (the regular
µ correspond to the proper policies). We show that the (restricted)
optimal cost function over just the regular policies can typically be
obtained with value and policy iteration algorithms. By contrast, the
optimal cost function over all policies J * may not be obtainable by
these algorithms, and indeed J * may not even be a solution of Bell-
man’s equation, as we will show with a simple example in Section
3.1.2.
The key idea is that under certain conditions, the restricted opti-
mization (the one that optimizes over the regular policies only) is
well behaved, both analytically and algorithmically. Under additional
conditions, which directly or indirectly ensure the existence of an opti-
mal regular policy, we obtain semicontractive models with properties
nearly as robust as contractive models.
In Chapter 3, we develop the basic theory of semicontractive models
for the case where the regular policies are stationary, while in Chapter
4 (Section 4.4), we extend the notion of regularity to nonstationary
policies. Moreover, we illustrate the theory with a variety of interest-
ing shortest path-type problems (stochastic, minimax, affine mono-
tonic, and risk sensitive/exponential cost), linear-quadratic optimal
control problems, and deterministic and stochastic optimal control
problems.
(c) Noncontractive models: These models rely on just the monotonic-
ity property of Tµ , and are more complex than the preceding ones.
Like semicontractive models, the problem’s cost functions may take
the values of +∞ or −∞, and in fact the optimal cost function may
take the values ∞ and −∞ as a matter of course (rather than on
an exceptional basis, as in semicontractive models). This complexity
presents considerable challenges, as much of the contractive model
theory either does not extend or does so in a weaker form only. For
instance, the fixed point equation J = T J may lack a unique solu-
tion, value iteration may succeed starting with some functions but
44 Introduction Chap. 1

not with others, and policy iteration may fail altogether. Some of
these issues may be mitigated when additional structure is present,
which we discuss in Sections 4.4-4.6, focusing on noncontractive mod-
els that also have some semicontractive structure, and corresponding
favorable properties.
Examples of DP problems from each of the model categories above,
primarily special cases of the specific DP models discussed in Section 1.2,
are scattered throughout the book. They serve both to illustrate the theory
and its exceptions, and to highlight the beneficial role of additional special
structure.
We finally note some other types of models where there are restric-
tions to the set of policies, i.e., M may be a strict subset of the set of
functions µ : X #→ U with µ(x) ∈ U (x) for all x ∈ X. Such restrictions
may include measurability (needed to establish a mathematically rigorous
probabilistic framework) or special structure that enhances the characteri-
zation of optimal policies and facilitates their computation. These models
were treated in Chapter 5 of the first edition of this book, and also in
Chapter 6 of [BeS78].†

Algorithms

Our discussion of algorithms centers on abstract forms of value and policy


iteration, and is organized along three characteristics: exact, approximate,
and asynchronous. The exact algorithms represent idealized versions, the
approximate represent implementations that use approximations of various
kinds, and the asynchronous involve irregular computation orders, where
the costs and controls at different states are updated at different iterations
(for example the cost of a single state being iterated at a time, as in Gauss-
Seidel and other methods; see [Ber12a] for several examples of distributed
asynchronous DP algorithms).
Approximate and asynchronous implementations have been the sub-
ject of intensive investigations since the 1980s, in the context of the solution
of large-scale problems. Some of this methodology relies on the use of sim-
ulation, which is asynchronous by nature and is prominent in approximate
DP. Generally, the monotonicity and sup-norm contraction structures of
many prominent DP models favors the use of asynchronous algorithms in
DP, as first shown in the author’s paper [Ber82], and discussed at vari-
ous points in this book: Section 2.6 for contractive models, Section 3.6 for
semicontractive models, and Sections 5.3-5.4 for minimax problems and
zero-sum games.

† Chapter 5 of the first edition is accessible from the author’s web site and
the book’s web page, and uses terminology and notation that are consistent with
the present edition.
Sec. 1.5 Notes, Sources, and Exercises 45

1.5 NOTES, SOURCES, AND EXERCISES


This monograph is written in a mathematical style that emphasizes sim-
plicity and abstraction. According to the relevant Wikipedia article:
“Abstraction in mathematics is the process of extracting the underlying
essence of a mathematical concept, removing any dependence on real world
objects with which it might originally have been connected, and generaliz-
ing it so that it has wider applications or matching among other abstract
descriptions of equivalent phenomena ... The advantages of abstraction
are:
(1) It reveals deep connections between different areas of mathematics.
(2) Known results in one area can suggest conjectures in a related area.
(3) Techniques and methods from one area can be applied to prove results
in a related area.
One disadvantage of abstraction is that highly abstract concepts can be
difficult to learn. A degree of mathematical maturity and experience may
be needed for conceptual assimilation of abstractions.”
Consistent with the preceding view of abstraction, our aim has been
to construct a minimalist framework, where the important mathematical
structures stand out, while the application context is deliberately blurred.
Of course, our development has to pass the test of relevance to applica-
tions. In this connection, we note that our presentation has integrated the
relation of our abstract DP models with the applications of Section 1.2,
and particularly discounted stochastic optimal control models (Chapter 2),
shortest path-type models (Chapters 3 and 4), undiscounted determinis-
tic and stochastic optimal control models (Chapter 4), and minimax and
zero-sum game problems (Chapter 5). We have given illustrations of the
abstract mathematical theory using these models and others throughout
the text. A much broader and accessible account of applications is given
in the author’s two-volume DP textbook.
Section 1.2: The abstract style of mathematical development has a long
history in DP. In particular, the connection between DP and fixed point the-
ory may be traced to Shapley [Sha53], who exploited contraction mapping
properties in analysis of the two-player dynamic game model of Example
1.2.4. Since then, the underlying contraction properties of discounted DP
problems with bounded cost per stage have been explicitly or implicitly
used by most authors that have dealt with the subject. Moreover, the
value of the abstract viewpoint as the basis for economical and insightful
analysis has been widely recognized.
An abstract DP model, based on unweighted sup-norm contraction
assumptions, was introduced in the paper by Denardo [Den67]. This model
pointed to the fundamental connections between DP and fixed point the-
ory, and provided generality and insight into the principal analytical and
46 Introduction Chap. 1

algorithmic ideas underlying the discounted DP research up to that time.


Abstract DP ideas were also researched earlier, notably in the paper by
Mitten (Denardo’s Ph.D. thesis advisor) [Mit64]; see also Denardo and
Mitten [DeM67]. The properties of monotone contractions were also used
in the analysis of sequential games by Zachrisson [Zac64].
Two abstract DP models that rely only on monotonicity properties
were given by the author in the papers [Ber75], [Ber77]. They were pat-
terned after the negative cost DP problem of Blackwell [Bla65] and the
positive cost DP problem of Strauch [Str66] (see the monotone decreasing
and monotone increasing models of Section 4.3). These two abstract DP
models, together with the finite horizon models of Section 4.2, were used
extensively in the book by Bertsekas and Shreve [BeS78] for the analysis
of both discounted and undiscounted DP problems, ranging over MDP,
minimax, multiplicative, and Borel space models.
Extensions of the monotonicity-based analysis of the author’s paper
[Ber77] were given by Verdu and Poor [VeP87], who introduced additional
structure for developing backward and forward value iterations, and by
Szepesvari [Sze98a, Sze98b], who incorporated non-Markovian policies into
the abstract DP framework. The model from [Ber77] also provided a foun-
dation for asynchronous value and policy iteration methods for abstract
contractive and noncontractive DP models in Bertsekas [Ber82] and Bert-
sekas and Yu [BeY10]. An extended contraction framework, whereby the
sup-norm contraction norm is allowed to be weighted, was given in the au-
thor’s paper [Ber12b]. Another line of related research involving abstract
DP mappings that are not necessarily scalar-valued was initiated by Mit-
ten [Mit74], and was followed up by a number of authors, including Sobel
[Sob75], Morin [Mor82], and Carraway and Morin [CaM88].
Section 1.3: The central role of Newton’s method for understanding ap-
proximation value space, rollout, and other reinforcement learning and ap-
proximate DP methods, was articulated in the author’s monograph [Ber20],
and was described in more detail in the book [Ber22].
Section 1.4: Generally, noncontractive total cost DP models with some
special structure beyond monotonicity, fall in three major categories: mono-
tone increasing models, principally represented by positive cost DP, mono-
tone decreasing models, principally represented by negative cost DP, and
transient models, exemplified by the SSP model of Example 1.2.6, where
the decision process terminates after a period that is random and subject to
control. Abstract DP models patterned after the first two categories have
been known since the author’s papers [Ber75], [Ber77], and are further
discussed in Section 4.3.
The semicontractive models, further discussed Chapter 3 and Sec-
tions 4.4-4.6, are patterned after the third category. They were introduced
and analyzed in the first edition of this book, as well as the subsequent
series of papers and reports, [Ber15], [Ber16a], [BeY16], [Ber17b], [Ber17c],
Sec. 1.5 Notes, Sources, and Exercises 47

[Ber17d], [Ber19c]. Their analysis is based on the idea of separating poli-


cies into those that are well-behaved (these are called regular , and have
contraction-like properties) and those that are not (these are called irregu-
lar ). The objective of the analysis is then to explain the detrimental effects
of the irregular policies, and to delineate the kind of model structure that
can limit these effects. As far as the author knows, this idea is new in the
context of abstract DP. One of the aims of the present monograph is to
develop this idea and to show that it leads to an important and insightful
paradigm for conceptualization and solution of major classes of practical
DP problems.

EXERCISES

1.1 (Multistep Contraction Mappings)

This exercise shows how starting with an abstract mapping, we can obtain mul-
tistep mappings with the same fixed points and a stronger contraction modulus.
Consider a set of mappings Tµ : B(X) (→ B(X), µ ∈ M, satisfying the con-
traction Assumption 1.2.2, let m be a positive integer, and let Mm be the set
of m-tuples ν = (µ0 , . . . , µm−1 ), where µk ∈ M, k = 1, . . . , m − 1. For each
ν = (µ0 , . . . , µm−1 ) ∈ Mm , define the mapping T ν , by

T ν J = Tµ0 · · · Tµm−1 J, ∀ J ∈ B(X).

Show the contraction properties

/T ν J − T ν J # / ≤ αm /J − J # /, ∀ J, J # ∈ B(X), (1.39)

and
/T J − T J # / ≤ αm /J − J # /, ∀ J, J # ∈ B(X), (1.40)
where T is defined by

(T J)(x) = inf (Tµ0 · · · Tµm−1 J)(x), ∀ J ∈ B(X), x ∈ X.


(µ0 ,...,µm−1 )∈Mm

Solution: By the contraction property of Tµ0 , . . . , Tµm−1 , we have for all J, J # ∈


B(X),
/T ν J − T ν J # / = /Tµ0 · · · Tµm−1 J − Tµ0 · · · Tµm−1 J # /
≤ α/Tµ1 · · · Tµm−1 J − Tµ1 · · · Tµm−1 J # /
≤ α2 /Tµ2 · · · Tµm−1 J − Tµ2 · · · Tµm−1 J # /
..
.
≤ αm /J − J # /,
48 Introduction Chap. 1

thus showing Eq. (1.39).


We have from Eq. (1.39)

(Tµ0 · · · Tµm−1 J)(x) ≤ (Tµ0 · · · Tµm−1 J # )(x) + αm /J − J # / v(x), ∀ x ∈ X,

and by taking infimum of both sides over (Tµ0 · · · Tµm−1 ) ∈ Mm and dividing by
v(x), we obtain

(T J)(x) − (T J # )(x)
≤ αm /J − J # /, ∀ x ∈ X.
v(x)

Similarly
(T J # )(x) − (T J)(x)
≤ αm /J − J # /, ∀ x ∈ X,
v(x)
and by combining the last two relations and taking supremum over x ∈ X, Eq.
(1.40) follows.

1.2 (Relation of Temporal Difference Methods and Proximal


Algorithms [Ber16b], [Ber18c])

The purpose of this exercise is establish a close connection between the map-
pings underlying temporal difference and proximal methods (cf. Section 1.2.5).
Consider a linear mapping of the form

T J = AJ + b,

where b is a vector in !n , and A is an n × n matrix with eigenvalues strictly


λ
within the unit circle. Let λ ∈ (0, 1) and c = 1−λ , and consider the multistep
(λ)
mapping T given by

#
T (λ) J = (1 − λ) λ" T "+1 J, J ∈ !n ,
"=0

and the proximal mapping P (c) given by


8−1 7
c+1 1
7 8
P (c) J = I−A b+ J , J ∈ !n ;
c c

cf. Eq. (1.23) [equivalently, for a given J, P (c) J is the unique vector Y ∈ !n that
solves the equation
1
Y − T Y = (J − Y ),
c
(cf. Fig. 1.5.1)].
(a) Show that P (c) is given by

#
P (c) = (1 − λ) λ" T " ,
"=0
Sec. 1.5 Notes, Sources, and Exercises 49

Graph of the mapping Y − T Y V


c λ
λ= , c=
c+1 1−λ
Jˆ − T Jˆ = 1c (J − J)
ˆ

1
δ c (J −Y)

ˆ J* J Y
T (λ) J = T Jˆ Jˆ = P (c) J J

Figure 1.5.1. Illustration of the iterates T (λ) J and P (c) J for finding the fixed
point J ∗ of a linear mapping T . $Given %J, we find the proximal iterate Jˆ =
P (c) J and then add the amount 1c Jˆ − J to obtain T (λ) J = T P (c) J. If T is a
contraction mapping, T (λ) J is closer to J ∗ than P (c) J.

and can be written as


(λ) (λ)
P (c) J = A J +b ,
where
∞ ∞
(λ) (λ)
# #
A = (1 − λ) λ" A " , b = λ"+1 A" b.
"=0 "=0

(b) Verify that


T (λ) J = A(λ) J + b(λ) ,
where

# ∞
#
A(λ) = (1 − λ) λ" A"+1 , b(λ) = λ" A" b,
"=0 "=0

and show that


T (λ) = T P (c) = P (c) T, (1.41)
n
and that for all J ∈ ! ,
c + 1 $ (c)
P (c) J = J + λ T (λ) J − J , T (λ) J = J +
$ % %
P J −J . (1.42)
c
Thus T (λ) J is obtained by extrapolation along the line segment P (c) J − J,
as illustrated in Fig. 1.5.1. Note that since T is a contraction mapping,
T (λ) J is closer to J ∗ than P (c) J.
(c) Show that for a given J ∈ !n , the multistep and proximal iterates T (λ) J
and P (c) J are the unique fixed points of the contraction mappings WJ and
W J given by
WJ Y = (1 − λ)T J + λT Y, W J Y = (1 − λ)J + λT Y, Y ∈ !n ,
50 Introduction Chap. 1

respectively.
(d) Show that the fixed point property of part (c) yields the following formula
for the multistep mapping T (λ) :

T (λ) J = (1 − λA)−1 b + (1 − λ)AJ .


$ %
(1.43)

(e) (Multistep Contraction Property for Nonexpansive A [BeY09] ) Instead of


assuming that A has eigenvalues strictly within the unit circle, assume
that the matrix I − A is invertible and A is nonexpansive [i.e., has all its
eigenvalues within the unit circle (possibly on the unit circle)]. Show that
A(λ) is contractive (i.e., has eigenvalues that lie strictly within the unit
circle) and its eigenvalues have the form


# ζi (1 − λ)
θi = (1 − λ) λ" ζi"+1 = , i = 1, . . . , n, (1.44)
1 − ζi λ
"=0

where ζi , i = 1, . . . , n, are the eigenvalues of A. Note: For an intuitive


explanation of the result, note that the eigenvalues of A(λ) can be viewed
as convex combinations of complex numbers from the unit circle at least
two of which are different from each other, since ζi ,= 1 by assumption
(the nonzero corresponding eigenvalues of A and A2 are different from each
other). As a result the eigenvalues of A(λ) lie strictly within the unit circle.
(f) (Contraction Property of Projected Multistep Mappings) Under the assump-
tions of part (e), show that limλ→1 A(λ) = 0. Furthermore, for any n × n
matrix W , the matrix W A(λ) is contractive for λ sufficiently close to 1.
In particular the projected mapping ΠA(λ) and corresponding projected
proximal mapping (cf. Section 1.2.5) become contractions as λ → 1.

Solution: (a) The inverse in the definition of P (c) is written as

8−1 8−1 ∞
c+1 1
7 7 #
I−A = I−A = λ(I − λA)−1 = λ (λA)" .
c λ
"=0

1 1−λ
Thus, using the equation c
= λ
,

7c + 1 8−1 7 1
8
P (c) J = I −A b+ J
c c

1−λ
# 7 8
=λ (λA)" b + J
λ
"=0
∞ ∞
# #
= (1 − λ) (λA)" J + λ (λA)" b,
"=0 "=0

(λ) (λ) +∞
which is equal to A J + b . The formula P (c) = (1 − λ) "=0
λ" T " follows
from this expression.
Sec. 1.5 Notes, Sources, and Exercises 51

(b) The formula T (λ) J = A(λ) J + b(λ) is verified by straightforward calculation.


We have,
(λ) (λ)
T P (c) J = A A
$ %
J +b +b
∞ ∞
# #
= (1 − λ) λ" A"+1 J + λ"+1 A"+1 b + b = A(λ) J + b(λ)
"=0 "=0
(λ)
=T J,
thus proving the left side of Eq. (1.41). The right side is proved similarly. The in-
terpolation/extrapolation formula (1.42) follows by a straightforward calculation
from the definition of T (λ) . As an example, to show the left side of Eq. (1.42),
we write
J + λ T (λ) J − J = (1 − λ)J + λT (λ) J
$ %
, ∞ ∞
-
# " "+1
# " "
= (1 − λ)J + λ (1 − λ) λA J+ λ Ab
"=0 "=0
, ∞
- ∞
# " "
#
= (1 − λ) J+ λ AJ + λ"+1 A" b
"=1 "=0
(λ) (λ)
=A J +b
= P (c) J.

(c) To show that T (λ) J is the fixed point of WJ , we must verify that

T (λ) J = WJ T (λ) J ,
$ %

or equivalently that
T (λ) J = (1 − λ)T J + λT T (λ) J) = (1 − λ)T J + λT (λ) (T J).
$

The right-hand side, in view of the interpolation formula


(1 − λ)J + λT (λ) J = P (c) J, ∀ x ∈ !n ,
is equal to P (c) (T J), which from the formula T (λ) = P (c) T [cf. part (b)], is equal
to T (λ) J. The proof is similar for W J .
(d) The fixed point property of part (c) states that T (λ) J is the unique solution
of the following equation in Y :
Y = (1 − λ)T J + λT Y = (1 − λ)(AJ + b) + λ(AY + b),
from which the desired relation follows.
(e), (f) The formula (1.44) follows from the expression for A(λ) given in part (b).
This formula can be used to show that the eigenvalues of A(λ) lie strictly within
the unit circle, using also the fact that the matrices Am , m ≥ 1, and A(λ) have
the same eigenvectors (see [BeY09] for details). Moreover, the eigenvalue formula
shows that all eigenvalues of A(λ) converge to 0 as λ → 1, so that limλ→1 A(λ) = 0.
This also implies that W A(λ) is contractive for λ sufficiently close to 1.
52 Introduction Chap. 1

1.3 (State-Dependent Weighted Multistep Mappings [YuB12])

Consider a set of mappings Tµ : B(X) (→ B(X), µ ∈ M, satisfying the contraction


(w)
Assumption 1.2.2. Consider also the mappings Tµ : B(X) (→ B(X) defined by

#
(Tµ(w) J)(x) = w" (x) Tµ" J (x),
$ %
x ∈ X, J ∈ B(X),
"=1

where w" (x) are nonnegative scalars such that for all x ∈ X,

#
w" (x) = 1.
"=1

Show that

( (w)
((Tµ J)(x) − (Tµ(w) J # )(x)(
(
#
≤ w" (x) α" /J − J # /, ∀ x ∈ X,
v(x)
"=1

(w)
so that Tµ is a contraction with modulus

#
ᾱ = sup w" (x) α" ≤ α < 1.
x∈X
"=1

(w)
Moreover, for all µ ∈ M, the mappings Tµ and Tµ have the same fixed point.

Solution: By the contraction property of Tµ , we have for all J, J # ∈ B(X) and


x ∈ X,
( (w)
((Tµ J)(x) − (Tµ(w) J # )(x)(
( (+∞ +∞ (
( w (x)(Tµ" J)(x) − "=1 w" (x)(Tµ" J # )(x)(
"=1 "
=
v(x) v(x)

#
≤ w" (x)/Tµ" J − Tµ" J # /
"=1
,∞ -
# "
≤ w" (x)α /J − J # /,
"=1

(w)
showing the contraction property of Tµ .
Let Jµ be the fixed point of Tµ . By using the relation (Tµ" Jµ )(x) = Jµ (x),
we have for all x ∈ X,

, ∞
-
# #
(Tµ(w) Jµ )(x) Tµ" Jµ
$ %
= w" (x) (x) = w" (x) Jµ (x) = Jµ (x),
"=1 "=1

(w) (w)
so Jµ is the fixed point of Tµ [which is unique since Tµ is a contraction].
2
Contractive Models

Contents

2.1. Bellman’s Equation and Optimality Conditions . . . . . p. 54


2.2. Limited Lookahead Policies . . . . . . . . . . . . . p. 61
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . p. 66
2.3.1. Approximate Value Iteration . . . . . . . . . . . p. 67
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . p. 70
2.4.1. Approximate Policy Iteration . . . . . . . . . . p. 73
2.4.2. Approximate Policy Iteration Where Policies . . . . .
Converge . . . . . . . . . . . . . . . . . . . p. 75
2.5. Optimistic Policy Iteration and λ-Policy Iteration . . . . p. 77
2.5.1. Convergence of Optimistic Policy Iteration . . . . p. 79
2.5.2. Approximate Optimistic Policy Iteration . . . . . p. 84
2.5.3. Randomized Optimistic Policy Iteration . . . . . . p. 87
2.6. Asynchronous Algorithms . . . . . . . . . . . . . . p. 91
2.6.1. Asynchronous Value Iteration . . . . . . . . . . p. 91
2.6.2. Asynchronous Policy Iteration . . . . . . . . . . p. 98
2.6.3. Optimistic Asynchronous Policy Iteration with a . . . .
Uniform Fixed Point . . . . . . . . . . . . . p. 103
2.7. Notes, Sources, and Exercises . . . . . . . . . . . . p. 110

53
54 Contractive Models Chap. 2

In this chapter we consider the abstract DP model of Section 1.2 under the
most favorable assumptions: monotonicity and weighted sup-norm contrac-
tion. Important special cases of this model are the discounted problems
with bounded cost per stage (Example 1.2.1-1.2.5), the stochastic shortest
path problem of Example 1.2.6 in the case where all policies are proper, as
well as other problems involving special structures.
We first provide some basic analytical results and then focus on two
types of algorithms: value iteration and policy iteration. In addition to
exact forms of these algorithms, we discuss combinations and approximate
versions, as well as asynchronous distributed versions.

2.1 BELLMAN’S EQUATION AND OPTIMALITY CONDITIONS

In this section we recall the abstract DP model of Section 1.2, and derive
some of its basic properties under the monotonicity and contraction as-
sumptions of Section 1.3. We consider a set X of states and a set U of
controls, and for each x ∈ X, a nonempty control constraint set U (x) ⊂ U .
We denote by M the set of all functions µ : X #→ U with µ(x) ∈ U (x)
for all x ∈ X, which we refer to as policies (or “stationary policies,” when
we want to emphasize the distinction from nonstationary policies, to be
discussed later).
We denote by R(X) the set of real-valued functions J : X #→ %. We
have a mapping H : X × U × R(X) #→ % and for each policy µ ∈ M, we
consider the mapping Tµ : R(X) #→ R(X) defined by
! "
(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X.
We also consider the mapping T defined by
(T J)(x) = inf H(x, u, J) = inf (Tµ J)(x), ∀ x ∈ X.
u∈U(x) µ∈M

[We will use frequently the second equality above, which holds because M
can be viewed as the Cartesian product Πx∈X U (x).] We want to find a
function J * ∈ R(X) such that
J * (x) = inf H(x, u, J * ), ∀ x ∈ X,
u∈U(x)

i.e., to find a fixed point of T within R(X). We also want to obtain a policy
µ∗ ∈ M such that Tµ∗ J * = T J * .
Let us restate for convenience the contraction and monotonicity as-
sumptions of Section 1.2.2.

Assumption 2.1.1: (Monotonicity) If J, J # ∈ R(X) and J ≤ J # ,


then
H(x, u, J) ≤ H(x, u, J # ), ∀ x ∈ X, u ∈ U (x).
Sec. 2.1 Bellman’s Equation and Optimality Conditions 55

Note that the monotonicity assumption implies the following proper-


ties, for all J, J # ∈ R(X) and k = 0, 1, . . ., which we will use extensively:

J ≤ J# ⇒ T kJ ≤ T kJ #, Tµk J ≤ Tµk J # , ∀ µ ∈ M,

J ≤ TJ ⇒ T k J ≤ T k+1 J, Tµk J ≤ Tµk+1 J, ∀ µ ∈ M.


Here T k and Tµk denotes the k-fold composition of T and Tµ , respectively.
For the contraction assumption, we introduce a function v : X #→ %
with
v(x) > 0, ∀ x ∈ X.
We consider the weighted sup-norm
# #
#J(x)#
*J* = sup
x∈X v(x)

on B(X), the space of real-valued functions J on X such that J(x)/v(x) is


bounded over x ∈ X (see Appendix B for a discussion of the properties of
this space).

Assumption 2.1.2: (Contraction) For all J ∈ B(X) and µ ∈ M,


the functions Tµ J and T J belong to B(X). Furthermore, for some
α ∈ (0, 1), we have

*Tµ J − Tµ J # * ≤ α*J − J # *, ∀ J, J # ∈ B(X), µ ∈ M.

The classical DP models where both the monotonicity and contrac-


tion assumptions are satisfied are the discounted finite-state Markovian
decision problem of Example 1.2.2, and the stochastic shortest path prob-
lem of Example 1.2.6 in the special case where all policies are proper; see
the textbook [Ber12a] for an extensive discussion. In the context of these
problems, the fixed point equation J = T J is called Bellman’s equation, a
term that we will use more generally in this book as well. The following
proposition summarizes some of the basic consequences of the contraction
assumption.

Proposition 2.1.1: Let the contraction Assumption 2.1.2 hold. Then:


(a) The mappings Tµ and T are contraction mappings with modulus
α over B(X), and have unique fixed points in B(X), denoted Jµ
and J * , respectively.
56 Contractive Models Chap. 2

(b) For any J ∈ B(X) and µ ∈ M,

lim *J * − T k J* = 0, lim *Jµ − Tµk J* = 0.


k→∞ k→∞

(c) We have Tµ J * = T J * if and only if Jµ = J * .


(d) For any J ∈ B(X),

1 α
*J * − J* ≤ *T J − J*, *J * − T J* ≤ *T J − J*.
1−α 1−α

(e) For any J ∈ B(X) and µ ∈ M,

1 α
*Jµ − J* ≤ *Tµ J − J*, *Jµ − Tµ J* ≤ *Tµ J − J*.
1−α 1−α

Proof: We showed in Section 1.2.2 that T is a contraction with modulus


α over B(X). Parts (a) and (b) follow from Prop. B.1 of Appendix B.
To show part (c), note that if Tµ J * = T J * , then in view of T J * = J * ,
we have Tµ J * = J * , which implies that J * = Jµ , since Jµ is the unique
fixed point of Tµ . Conversely, if J * = Jµ , we have Tµ J * = Tµ Jµ = Jµ =
J * = T J *.
To show part (d), we use the triangle inequality to write for every k,
k
$ k
$
*T k J − J* ≤ *T ! J − T !−1 J* ≤ α!−1 *T J − J*.
!=1 !=1
Taking the limit as k → ∞ and using part (b), the left-hand side inequality
follows. The right-hand side inequality follows from the left-hand side and
the contraction property of T . The proof of part (e) is similar to part (d)
[indeed it
% is the
& special case of part (d) where T is equal to Tµ , i.e., when
U (x) = µ(x) for all x ∈ X]. Q.E.D.

Part (c) of the preceding proposition shows that there exists a µ ∈ M


such that Jµ = J * if and only if the minimum of H(x, u, J * ) over U (x) is
attained for all x ∈ X. Of course the minimum is attained if U (x) is
finite for every x, but otherwise this is not guaranteed in the absence of
additional assumptions. Part (d) provides a useful error bound: we can
evaluate the proximity of any function J ∈ B(X) to the fixed point J * by
applying T to J and computing *T J − J*. The left-hand side inequality of
part (e) (with J = J * ) shows that for every " > 0, there exists a µ" ∈ M
such that *Jµ! − J * * ≤ ", which may be obtained by letting µ" (x) minimize
H(x, u, J * ) over U (x) within an error of (1 − α)" v(x), for all x ∈ X.
Sec. 2.1 Bellman’s Equation and Optimality Conditions 57

The preceding proposition and some of the subsequent results may


also be proved if B(X) is replaced by a closed subset B(X) ⊂ B(X). This is
because the contraction mapping fixed point theorem (Prop. B.1) applies to
closed subsets of complete spaces. For simplicity, however, we will disregard
this possibility in the present chapter.
An important consequence of monotonicity of H, when it holds in
addition to contraction, is that it implies that J * , the unique fixed point
of T , is the infimum over µ ∈ M of Jµ , the unique fixed point of Tµ .

Proposition 2.1.2: Let the monotonicity and contraction Assump-


tions 2.1.1 and 2.1.2 hold. Then

J * (x) = inf Jµ (x), ∀ x ∈ X.


µ∈M

Furthermore, for every " > 0, there exists µ" ∈ M such that

J * (x) ≤ Jµ! (x) ≤ J * (x) + ", ∀ x ∈ X. (2.1)

Proof: We note that the right-hand side of Eq. (2.1) holds by Prop.
2.1.1(e) (see the remark following its proof). Thus inf µ∈M Jµ (x) ≤ J * (x)
for all x ∈ X. To show the reverse inequality as well as the left-hand side
of Eq. (2.1), we note that for all µ ∈ M, we have T J * ≤ Tµ J * , and since
J * = T J * , it follows that J * ≤ Tµ J * . By applying repeatedly Tµ to both
sides of this inequality and by using the monotonicity Assumption 2.1.1,
we obtain J * ≤ Tµk J * for all k > 0. Taking the limit as k → ∞, we see
that J * ≤ Jµ for all µ ∈ M, so that J * (x) ≤ inf µ∈M Jµ (x) for all x ∈ X.
Q.E.D.

Note that without monotonicity, we may have inf µ∈M Jµ (x) < J * (x)
for some x. This is illustrated by the following example.
Example 2.1.1 (Counterexample Without Monotonicity)

Let X = {x1 , x2 }, U = {u1 , u2 }, and let


' (
−αJ(x2 ) if u = u1 , 0 if u = u1 ,
H(x1 , u, J) = H(x2 , u, J) =
−1 + αJ(x1 ) if u = u2 , B if u = u2 ,
where B is a positive scalar. Then it can be seen that
1
J ∗ (x1 ) = − , J ∗ (x2 ) = 0,
1−α
and Jµ∗ = J ∗ where µ∗ (x1 ) = u2 and µ∗ (x2 ) = u1 . On the other hand, for
µ(x1 ) = u1 and µ(x2 ) = u2 , we have Jµ (x1 ) = −αB and Jµ (x2 ) = B, so
Jµ (x1 ) < J ∗ (x1 ) for B sufficiently large.
58 Contractive Models Chap. 2

Optimality over Nonstationary Policies

The connection with DP motivates us to consider the set Π of all sequences


π = {µ0 , µ1 , . . .} with µk ∈ M for all k (nonstationary policies in the DP
context), and define
¯
Jπ (x) = lim sup (Tµ0 · · · Tµk J)(x), ∀ x ∈ X,
k→∞

with J¯ being some function in B(X), where Tµ0 · · · Tµk J denotes the com-
position of the mappings Tµ0 , . . . , Tµk applied to J, i.e.,
! "
Tµ0 · · · Tµk J = Tµ0 Tµ1 · · · (Tµk−1 (Tµk J)) · · · .

Note that under the contraction Assumption 2.1.2, the choice of J¯ in


the definition of Jπ does not matter , since for any two J, J # ∈ B(X), we
have
*Tµ0 Tµ1 · · · Tµk J − Tµ0 Tµ1 · · · Tµk J # * ≤ αk+1 *J − J # *,
so the value of Jπ (x) is independent of J. ¯ Since by Prop. 2.1.1(b), Jµ (x) =
k
limk→∞ (Tµ J)(x) for all µ ∈ M, J ∈ B(X), and x ∈ X, in the DP context
we recognize Jµ as the cost function of the stationary policy {µ, µ, . . .}.
We now claim that under the monotonicity and contraction Assump-
tions 2.1.1 and 2.1.2, J * , which was defined as the unique fixed point of T ,
is equal to the optimal value of Jπ , i.e.,

J * (x) = inf Jπ (x), ∀ x ∈ X.


π∈Π

Indeed, since M defines a subset of Π, we have from Prop. 2.1.2,

J * (x) = inf Jµ (x) ≥ inf Jπ (x), ∀ x ∈ X,


µ∈M π∈Π

while for every π ∈ Π and x ∈ X, we have


¯
Jπ (x) = lim sup (Tµ0 Tµ1 · · · Tµk J)(x) ¯
≥ lim (T k+1 J)(x) = J * (x)
k→∞ k→∞

[the monotonicity Assumption 2.1.1 can be used to show that

Tµ0 Tµ1 · · · Tµk J¯ ≥ T k+1 J,


¯

and the last equality holds by Prop. 2.1.1(b)]. Combining the preceding
relations, we obtain J * (x) = inf π∈Π Jπ (x).
Thus, in DP terms, we may view J * as an optimal cost function over
all policies, including nonstationary ones. At the same time, Prop. 2.1.2
states that stationary policies are sufficient in the sense that the optimal
cost can be attained to within arbitrary accuracy with a stationary policy
[uniformly for all x ∈ X, as Eq. (2.1) shows].
Sec. 2.1 Bellman’s Equation and Optimality Conditions 59

Error Bounds and Other Inequalities

The analysis of abstract DP algorithms and related approximations requires


the use of some basic inequalities that follow from the assumptions of con-
traction and monotonicity. We have obtained two such results in Prop.
2.1.1(d),(e), which assume only the contraction assumption. These results
can be strengthened if in addition to contraction, we have monotonicity.
To this end we first show the following useful characterization.

Proposition 2.1.3: The monotonicity and contraction Assumptions


2.1.1 and 2.1.2 hold if and only if for all J, J # ∈ B(X), µ ∈ M, and
scalar c ≥ 0, we have

J ≤ J# + c v ⇒ Tµ J ≤ Tµ J # + αc v, (2.2)

where v is the weight function of the weighted sup-norm * · *.

Proof: Let the contraction and monotonicity assumptions hold. If J ≤


J # + c v, we have

H(x, u, J) ≤ H(x, u, J # + c v) ≤ H(x, u, J # ) + αc v(x),


∀ x ∈ X, u ∈ U (x),
(2.3)
where the left-side inequality follows from the monotonicity assumption and
the right-side inequality follows from the contraction assumption, which
together with *v* = 1, implies that

H(x, u, J + c v) − H(x, u, J)
≤ α*J + c v − J* = αc.
v(x)

The condition (2.3) implies the desired condition (2.2). Conversely, con-
dition (2.2) for c = 0 yields the monotonicity assumption, while for c =
*J # − J* it yields the contraction assumption. Q.E.D.

We can now derive the following useful variant of Prop. 2.1.1(d),(e),


which involves one-sided inequalities. This variant will be used in the
derivation of error bounds for various computational methods.

Proposition 2.1.4: (Error Bounds Under Contraction and


Monotonicity) Let the monotonicity and contraction Assumptions
2.1.1 and 2.1.2 hold. Then:
60 Contractive Models Chap. 2

(a) For any J ∈ B(X) and c ≥ 0, we have


c
TJ ≤ J + cv ⇒ J* ≤ J + v,
1−α
c
J ≤ TJ + cv ⇒ J ≤ J* + v.
1−α

(b) For any J ∈ B(X), µ ∈ M, and c ≥ 0, we have


c
Tµ J ≤ J + c v ⇒ Jµ ≤ J + v,
1−α
c
J ≤ Tµ J + c v ⇒ J ≤ Jµ + v.
1−α

(c) For all J ∈ B(X), c ≥ 0, and k = 0, 1, . . ., we have

αk c
TJ ≤ J + cv ⇒ J * ≤ T kJ + v,
1−α

αk c
J ≤ TJ + cv ⇒ T kJ ≤ J * + v.
1−α

Proof: (a) We show the first relation. Applying Eq. (2.2) with J # and J
replaced by J and T J, respectively, and taking infimum over µ ∈ M, we
see that if T J ≤ J + c v, then T 2 J ≤ T J + αc v. Proceeding similarly, it
follows that
T ! J ≤ T !−1 J + α!−1 c v.
We now write for every k,
k
$ k
$
T kJ − J = (T ! J − T !−1 J) ≤ α!−1 c v,
!=1 !=1

from which, by taking the limit as k → ∞, we obtain


c
J* ≤ J + v.
1−α
The second relation follows similarly.
(b) This part is the special case of part (a) where T is equal to Tµ .
(c) Similar to the proof of part (a), the inequality

TJ ≤ J + cv
Sec. 2.2 Limited Lookahead Policies 61

implies that for all k we have

T k+1 J ≤ T k J + αk c v.

Applying part (a) with J and c replaced by T k J and αk c, respectively,


we obtain the first desired relation. The second relation follows similarly.
Q.E.D.

2.2 LIMITED LOOKAHEAD POLICIES

In this section, we discuss a basic building block in the algorithmic method-


ology of abstract DP. Given some function J˜ that approximates J * , we
obtain a policy by solving a finite-horizon problem where J˜ is the termi-
nal cost function. The simplest possibility is a one-step lookahead policy µ
defined by
˜
µ(x) ∈ arg min H(x, u, J), x ∈ X. (2.4)
u∈U(x)

Its cost function Jµ was interpreted in Section 1.3.1 as the result of a


Newton iteration that starts from J˜ and aims to solve the Bellman equation
J = T J. The following proposition gives some bounds for its performance.

Proposition 2.2.1: (One-Step Lookahead Error Bounds) Let


the contraction Assumption 2.1.2 hold, and let µ be a one-step looka-
head policy obtained by minimization in Eq. (2.4), i.e., satisfying
Tµ J˜ = T J.
˜ Then

˜ ≤ α
*Jµ − T J* *T J˜ − J*,
˜ (2.5)
1−α

where * · * denotes the weighted sup-norm. Moreover

2α ˜
*Jµ − J * * ≤ *J − J * *, (2.6)
1−α

and
2 ˜
*Jµ − J * * ≤ *T J̃ − J*. (2.7)
1−α

Proof: Equation (2.5) follows from the second relation of Prop. 2.1.1(e)
˜ Also from the first relation of Prop. 2.1.1(e) with J = J * , we
with J = J.
have
1
*Jµ − J * * ≤ *Tµ J * − J * *.
1−α
62 Contractive Models Chap. 2

By using the triangle inequality, and the relations Tµ J˜ = T J˜ and J * = T J * ,


we obtain
˜ + *Tµ J˜ − T J*
*Tµ J * − J * * ≤ *Tµ J * − Tµ J* ˜ + *T J̃ − J * *
˜ + *T J̃ − T J * *
= *Tµ J * − Tµ J*
˜ + α*J˜ − J * *
≤ α*J * − J*
= 2α *J˜ − J * *,

and Eq. (2.6) follows by combining the preceding two relations.


˜
Also, from the first relation of Prop. 2.1.1(d) with J = J,
˜ ≤ 1 *T J̃ − J*.
*J * − J* ˜ (2.8)
1−α
Thus
˜ + *T J˜ − J*
*Jµ − J * * ≤ *Jµ − T J* ˜ + *J˜ − J * *
α ˜ + 1 *T J̃ − J*
≤ *T J˜ − J*
˜ + *T J˜ − J* ˜
1−α 1−α
2
= *T J˜ − J*,
˜
1−α

where the second inequality follows from Eqs. (2.5) and (2.8). This proves
Eq. (2.7). Q.E.D.

Equation (2.5) provides a computable bound on the cost function Jµ


of the one-step lookahead policy. The bound (2.6) says that if the one-step
lookahead approximation J˜ is within " of the optimal, the performance of
the one-step lookahead policy is within

2α"
1−α

of the optimal. Unfortunately, this is not very reassuring when α is close


to 1, in which case the error bound is large relative to ". Nonetheless, the
following example from [BeT96], Section 6.1.1, shows that this bound is
tight, i.e., for any α < 1, there is a problem with just two states where
the error bound is satisfied with equality. What is happening is that an
O(")
! difference" in single stage cost between two controls can generate an
O "/(1 − α) difference in policy costs, yet it can be “nullified” in the fixed
point equation J * = T J * by an O(") difference between J * and J˜.

Example 2.2.1

Consider a discounted optimal control problem with two states, 1 and 2, and
deterministic transitions. State 2 is absorbing, but at state 1 there are two
possible decisions: move to state 2 (policy µ∗ ) or stay at state 1 (policy µ).
Sec. 2.2 Limited Lookahead Policies 63

The cost of each transition is 0 except for the transition from 1 to itself under
policy µ, which has cost 2α", where " is a positive scalar and α ∈ [0, 1) is the
discount factor. The optimal policy µ∗ is to move from state 1 to state 2, and
the optimal cost-to-go function is J ∗ (1) = J ∗ (2) = 0. Consider the vector J˜
˜ = −" and J˜(2) = ", so that
with J(1)

#J˜ − J ∗ # = ",

as assumed in Eq. (2.6) (cf. Prop. 2.2.1). The policy µ that decides to stay
at state 1 is a one-step lookahead policy based on J˜, because

2α" + αJ˜(1) = α" = 0 + αJ˜(2).

We have
2α" 2α
Jµ (1) = = #J˜ − J ∗ #,
1−α 1−α
so the bound of Eq. (2.6) holds with equality.

Multistep Lookahead Policies with Approximations

Let us now consider a more general form of lookahead involving multiple


stages as well as other approximations of the type that we will consider later
in the implementation of various approximate value and policy iteration
algorithms. In particular, we will assume that given any J ∈ B(X), we
cannot compute exactly T J, but instead we can compute J˜ ∈ B(X) and
µ ∈ M such that

*J˜ − T J* ≤ δ, *Tµ J − T J* ≤ ", (2.9)

where δ and " are nonnegative scalars. These scalars are usually unknown,
so the resulting analysis will have a mostly qualitative character.
The case δ > 0 arises when the state space is either infinite or it is
finite but very large. Then instead of calculating (T J)(x) for all states x,
one may do so only for some states and estimate (T J)(x) for the remain-
ing states x by some form of interpolation. Alternatively, one may use
simulation data [e.g., noisy values of (T J)(x) for some or all x] and some
kind of least-squares error fit of (T J)(x) with a function from a suitable
parametric class. The function J˜ thus obtained will satisfy *J˜ − T J* ≤ δ
with δ > 0. Note that δ may not be small in this context, and the resulting
performance degradation may be a primary concern.
Cases where " > 0 may arise when the control space is infinite or
finite but large, and the minimization involved in the calculation of (T J)(x)
cannot be done exactly. Note, however, that it is possible that

δ > 0, " = 0,
64 Contractive Models Chap. 2

and in fact this occurs often in practice. In an alternative scenario, we may


first obtain the policy µ subject to a restriction that it belongs to a certain
subset of structured policies, so it satisfies

*Tµ J − T J* ≤ "

for some " > 0, and then we may set J˜ = Tµ J. In this case we have " = δ
in Eq. (2.9).
In a multistep method with approximations, we are given a posi-
tive integer m and a lookahead function Jm , and we successively compute
(backwards in time) Jm−1 , . . . , J0 and policies µm−1 , . . . , µ0 satisfying

*Jk − T Jk+1 * ≤ δ, *Tµk Jk+1 − T Jk+1 * ≤ ", k = 0, . . . , m− 1. (2.10)

Note that in the context of MDP, Jk can be viewed as an approximation to


the optimal cost function of an (m − k)-stage problem with terminal cost
function Jm . We have the following proposition.

Proposition 2.2.2: (Multistep Lookahead Error Bound) Let


the contraction Assumption 2.1.2 hold. The periodic policy

π = {µ0 , . . . , µm−1 , µ0 , . . . , µm−1 , . . .}

generated by the method of Eq. (2.10) satisfies

2αm " α(" + 2δ)(1 − αm−1 )


*Jπ − J * * ≤ *J m − J ** + + .
1 − αm 1 − αm (1 − α)(1 − αm )
(2.11)

Proof: Using the triangle inequality, Eq. (2.10), and the contraction prop-
erty of T , we have for all k

*Jm−k − T k Jm * ≤ *Jm−k − T Jm−k+1 * + *T Jm−k+1 − T 2 Jm−k+2 *


+ · · · + *T k−1 Jm−1 − T k Jm *
≤ δ + αδ + · · · + αk−1 δ,
(2.12)
showing that

δ(1 − αk )
*Jm−k − T k Jm * ≤ , k = 1, . . . , m. (2.13)
1−α
Sec. 2.2 Limited Lookahead Policies 65

From Eq. (2.10), we have *Jk − Tµk Jk+1 * ≤ δ + ", so for all k

*Jm−k − Tµm−k · · · Tµm−1 Jm * ≤ *Jm−k − Tµm−k Jm−k+1 *


+ *Tµm−k Jm−k+1 − Tµm−k Tµm−k+1 Jm−k+2 *
+ ···
+ *Tµm−k · · · Tµm−2 Jm−1 − Tµm−k · · · Tµm−1 Jm *
≤ (δ + ") + α(δ + ") + · · · + αk−1 (δ + "),

showing that

(δ + ")(1 − αk )
*Jm−k − Tµm−k · · · Tµm−1 Jm * ≤ , k = 1, . . . , m.
1−α
(2.14)
Using the fact *Tµ0 J1 − T J1 * ≤ " [cf. Eq. (2.10)], we obtain

*Tµ0 · · · Tµm−1 Jm − T m Jm * ≤ *Tµ0 · · · Tµm−1 Jm − Tµ0 J1 *


+ *Tµ0 J1 − T J1 * + *T J1 − T m Jm *
≤ α*Tµ1 · · · Tµm−1 Jm − J1 * + " + α*J1 − T m−1 Jm *
α(" + 2δ)(1 − αm−1 )
≤"+ ,
1−α

where the last inequality follows from Eqs. (2.13) and (2.14) for k = m − 1.
From this relation and the fact that Tµ0 · · · Tµm−1 and T m are con-
tractions with modulus αm , we obtain

*Tµ0 · · · Tµm−1 J * − J * * ≤ *Tµ0 · · · Tµm−1 J * − Tµ0 · · · Tµm−1 Jm *


+ *Tµ0 · · · Tµm−1 Jm − T m Jm * + *T mJm − J * *
α(" + 2δ)(1 − αm−1 )
≤ 2αm *J * − Jm * + " + .
1−α

We also have using Prop. 2.1.1(e), applied in the context of the multistep
mapping of Example 1.3.1,

1
*Jπ − J * * ≤ *Tµ0 · · · Tµm−1 J * − J * *.
1 − αm

Combining the last two relations, we obtain the desired result. Q.E.D.

Note that for m = 1 and δ = " = 0, i.e., the case of one-step lookahead
policy µ with lookahead function J1 and no approximation error in the
minimization involved in T J1 , Eq. (2.11) yields the bound


*Jµ − J * * ≤ *J1 − J * *,
1−α
66 Contractive Models Chap. 2

which coincides with the bound (2.6) derived earlier.


Also, in the special case where " = δ and Jk = Tµk Jk+1 (cf. the
discussion preceding Prop. 2.2.2), the bound (2.11) can be strengthened
somewhat. In particular, we have for all k, Jm−k = Tµm−k · · · Tµm−1 Jm , so
the right-hand side of Eq. (2.14) becomes 0 and the preceding proof yields,
with some calculation,

2αm δ αδ(1 − αm−1 )


*Jπ − J * * ≤ m
*Jm − J * * + m
+
1−α 1−α (1 − α)(1 − αm )
2αm δ
= *Jm − J * * + .
1 − αm 1−α
We finally note that Prop. 2.2.2 shows that as the lookahead size m
increases, the corresponding bound for *Jπ −J * * tends to "+α("+2δ)/(1−
α), or
" + 2αδ
lim sup *Jπ − J * * ≤ .
m→∞ 1−α
We will see that this error bound is superior to corresponding error bounds
for approximate versions of value and policy iteration by essentially a factor
1/(1 − α). In practice, however, periodic suboptimal policies, as required
by Prop. 2.2.2, are typically not used.
There is an alternative and often used form of on-line multistep
lookahead, whereby at the current state x we compute a multistep pol-
icy {µ0 , . . . , µm−1 }, we apply the first component µ0 (x) of that policy at
state x, then at the next state x̄ we recompute a new multistep policy
{µ̄0 , . . . , µ̄m−1 }, apply µ̄0 (x̄), etc. However, no error bound similar to the
one of Prop. 2.2.2 is currently known for this type of lookahead.

2.3 VALUE ITERATION

In this section, we discuss value iteration (VI for short), the algorithm
that starts with some J ∈ B(X), and generates T J, T 2 J, . . .. Since T is
a weighted sup-norm contraction under Assumption 2.1.2, the algorithm
converges to J * , and the rate of convergence is governed by

*T k J − J * * ≤ αk *J − J * *, k = 0, 1, . . . .

Similarly, for a given policy µ ∈ M, we have

*Tµk J − Jµ * ≤ αk *J − Jµ *, k = 0, 1, . . . .

From Prop. 2.1.1(d), we also have the error bound


α
*T k+1 J − J * * ≤ *T k+1 J − T k J*, k = 0, 1, . . . .
1−α
Sec. 2.3 Value Iteration 67

This bound does not rely on the monotonicity Assumption 2.1.1.


The VI algorithm is often used to compute an approximation J˜ to
* ˜ over u ∈ U (x)
J , and then to obtain a policy µ by minimizing H(x, u, J)
˜
for each x ∈ X. In other words J and µ satisfy

*J˜ − J * * ≤ γ, Tµ J˜ = T J,
˜

where γ is some positive scalar. Then by using Eq. (2.6), we have

2α γ
*Jµ − J * * ≤ . (2.15)
1−α

If the set of policies is finite, this procedure can be used to compute an


optimal policy with a finite but sufficiently large number of exact VI, as
shown in the following proposition.

Proposition 2.3.1: Let the contraction Assumption 2.1.2 hold and


let J ∈ B(X). If the set of policies M is finite, there exists an integer
k ≥ 0 such that Jµ∗ = J * for all µ∗ and k ≥ k with Tµ∗ T k J = T k+1 J.

Proof: Let M̃ be the set of policies such that Jµ .= J * . Since M̃ is finite,


we have
inf *Jµ − J * * > 0,
µ∈M̃

so by Eq. (2.15), there exists sufficiently small β > 0 such that

*J˜ − J * * ≤ β and Tµ J˜ = T J˜ ⇒ *Jµ − J * * = 0 µ∈⇒


/ M̃.
(2.16)
It follows that if k is sufficiently large so that *T k J − J * * ≤ β, then
/ M̃ so Jµ∗ = J * . Q.E.D.
Tµ∗ T k J = T k+1 J implies that µ∗ ∈

2.3.1 Approximate Value Iteration


We will now consider situations where the VI method may be imple-
mentable only through approximations. In particular, given a function
J, assume that we may only be able to calculate an approximation J˜ to
T J such that ) )
)J˜ − T J ) ≤ δ,

where δ is a given positive scalar. In the corresponding approximate VI


method, we start from an arbitrary bounded function J0 , and we generate
a sequence {Jk } satisfying

*Jk+1 − T Jk * ≤ δ, k = 0, 1, . . . . (2.17)
68 Contractive Models Chap. 2

This approximation may be the result of representing Jk+1 compactly, as a


linear combination of basis functions, through a projection or aggregation
process, as is common in approximate DP (cf. the discussion of Section
1.2.4).
We may also simultaneously generate a sequence of policies {µk } such
that
*Tµk Jk − T Jk * ≤ ", k = 0, 1, . . . , (2.18)
where " is some scalar [which could be equal to 0, as in case of Eq. (2.10),
considered earlier]. The following proposition shows that the corresponding
!
cost functions Jµk “converge” to J * to within an error of order O δ/(1 −
" ! "
α)2 [plus a less significant error of order O "/(1 − α) ].

Proposition 2.3.2: (Error Bounds for Approximate VI) Let


the contraction Assumption 2.1.2 hold. A sequence {Jk } generated by
the approximate VI method (2.17)-(2.18) satisfies

δ
lim sup *Jk − J * * ≤ , (2.19)
k→∞ 1−α

while the corresponding sequence of policies {µk } satisfies

" 2αδ
lim sup *Jµk − J * * ≤ + . (2.20)
k→∞ 1 − α (1 − α)2

Proof: Using the triangle inequality, Eq. (2.17), and the contraction prop-
erty of T , we have

*Jk − T k J0 * ≤ *Jk − T Jk−1 *


+ *T Jk−1 − T 2 Jk−2 * + · · · + *T k−1 J1 − T k J0 *
≤ δ + αδ + · · · + αk−1 δ,

and finally
(1 − αk )δ
*Jk − T k J0 * ≤ , k = 0, 1, . . . . (2.21)
1−α
By taking limit as k → ∞ and by using the fact limk→∞ T k J0 = J * , we
obtain Eq. (2.19).
We also have using the triangle inequality and the contraction prop-
erty of Tµk and T ,

*Tµk J * − J * * ≤ *Tµk J * − Tµk Jk * + *Tµk Jk − T Jk * + *T Jk − J * *


≤ α*J * − Jk * + " + α*Jk − J * *,
Sec. 2.3 Value Iteration 69

while by using also Prop. 2.1.1(e), we obtain


1 " 2α
*Jµk − J * * ≤ *T k J * − J * * ≤ + *Jk − J * *.
1−α µ 1−α 1−α
By combining this relation with Eq. (2.19), we obtain Eq. (2.20). Q.E.D.

The error bound (2.20) relates to stationary policies obtained from


the functions Jk by one-step lookahead. We may also obtain an m-step
periodic policy π from Jk by using m-step lookahead. Then Prop. 2.2.2
shows that the corresponding bound for *Jπ −J * * tends to ("+2αδ)/(1−α)
as m → ∞, which improves on the error bound (2.20) by a factor 1/(1 − α).
Finally, let us note that the error bound of Prop. 2.3.2 is predicated
upon generating a sequence {Jk } satisfying *Jk+1 − T Jk * ≤ δ for all k [cf.
Eq. (2.17)]. Unfortunately, some practical approximation schemes guar-
antee the existence of such a δ only if {Jk } is a bounded sequence. The
following example from [BeT96], Section 6.5.3, shows that boundedness of
the iterates is not automatically guaranteed, and is a serious issue that
should be addressed in approximate VI schemes.
Example 2.3.1 (Error Amplification in Approximate
Value Iteration)

Consider a two-state α-discounted MDP with states 1 and 2, and a single


policy. The transitions are deterministic: from state 1 to state 2, and from
state 2 to state 2. These transitions are also cost-free. Thus we have (T J(1) =
(T J)(2) = αJ(2), and J ∗ (1) = J ∗ (2) = 0.
We consider a VI scheme that approximates%cost functions &within the
one-dimensional subspace of linear functions S = (r, 2r) | r ∈ $ by using
a weighted least squares minimization; i.e., we approximate a vector J by its
weighted Euclidean projection onto S. In particular, given Jk = (rk , 2rk ), we
find Jk+1 = (rk+1 , 2rk+1 ), where for weights ξ1 , ξ2 > 0, rk+1 is obtained as
* ! "2 ! "2 +
rk+1 ∈ arg min ξ1 r − (T Jk )(1) + ξ2 2r − (T Jk )(2) .
r

Since for a zero cost per stage and the given deterministic transitions, we
have T Jk = (2αrk , 2αrk ), the preceding minimization is written as
, -
rk+1 ∈ arg min ξ1 (r − 2αrk )2 + ξ2 (2r − 2αrk )2 ,
r

which by writing the corresponding optimality condition yields rk+1 = αβrk ,


where β = 2(ξ1 + 2ξ2 )(ξ1 + 4ξ2 ) > 1. Thus if α > 1/β, the sequence {rk }
diverges and so does {Jk }. Note that in this example the optimal cost func-
tion J ∗ = (0, 0) belongs to the subspace S. The difficulty here is that the
approximate VI mapping that generates Jk+1 as the weighted Euclidean pro-
jection of T Jk is not a contraction (this is a manifestation of an important
issue in approximate DP and projected equation approximation, namely that
the projected mapping ΠT need not be a contraction even if T is a sup-norm
contraction; see [DFV00], [Ber12b] for examples and related discussions). At
the same time there is no δ such that #Jk+1 − T Jk # ≤ δ for all k, because of
error amplification in each approximate VI.
70 Contractive Models Chap. 2

2.4 POLICY ITERATION

In this section, we discuss policy iteration (PI for short), an algorithm


whereby we maintain and update a policy µk , starting from some initial
policy µ0 . The typical iteration has the following form (see Fig. 2.4.1 for a
one-dimensional illustration).

Policy iteration given the current policy µk :


Policy evaluation: We compute Jµk as the unique solution of the
equation
Jµk = Tµk Jµk .

Policy improvement: We obtain a policy µk+1 that satisfies

Tµk+1 Jµk = T Jµk .

We assume that the minimum of H(x, u, Jµk ) over u ∈ U (x) is at-


tained for all x ∈ X, so that the improved policy µk+1 is defined (we
use this assumption for all the PI algorithms of the book). The following
proposition establishes a basic cost improvement property, as well as finite
convergence for the case where the set of policies is finite.

Proposition 2.4.1: (Convergence of PI) Let the monotonicity


and contraction Assumptions 2.1.1 and 2.1.2 hold, and let {µk } be
a sequence generated by the PI algorithm. Then for all k, we have
Jµk+1 ≤ Jµk , with equality if and only if Jµk = J * . Moreover,

lim *Jµk − J * * = 0,
k→∞

and if the set of policies is finite, we have Jµk = J * for some k.

Proof: We have

Tµk+1 Jµk = T Jµk ≤ Tµk Jµk = Jµk .

Applying Tµk+1 to this inequality while using the monotonicity Assumption


2.1.1, we obtain

Tµ2k+1 Jµk ≤ Tµk+1 Jµk = T Jµk ≤ Tµk Jµk = Jµk .


Sec. 2.3 Value Iteration 71

T J 45 Degree Line
Prob. = 1 Prob. =
∗ TJ
n Value Iterations Prob. = 1 Prob. =

J J∗ = T J∗
0 Prob. = 1
J0 Do not Replace Set S 1 J J
= T J0 = T 2 J0

Tµ0 J J

∗ TJ
Prob. = 1 Prob. =

Policy Improvement Exact Policy Evaluation Approximate Policy


Evaluation

J J∗ = T J∗
0 Prob. = 1
J Jµ0 = Tµ0 Jµ0 1 J J
J µ1 = Tµ1 J µ1
Policy Improvement Exact Policy Evaluation (Exact if

Figure 2.4.1 Geometric interpretation of PI and VI in one dimension (a single


state). Each policy µ defines the mapping Tµ , and T J is the function minµ Tµ J.
When the number of policies is finite, T J is a piecewise linear concave function,
with each piece being a linear function Tµ J that corresponds to a policy µ. The
optimal cost function J ∗ satisfies J ∗ = T J ∗ , so it is obtained from the intersection
of the graph of T J and the 45 degree line shown. Similarly Jµ is the intersection of
the graph of Tµ J and the 45 degree line. The VI sequence is indicated in the top
figure by the staircase construction, which asymptotically leads to J ∗ . A single
policy iteration is illustrated in the bottom figure, and illustrates the connection
of PI with Newton’s method that was discussed in Section 1.3.2.
72 Contractive Models Chap. 2

Similarly, we have for all m > 0,


Tµmk+1 Jµk ≤ T Jµk ≤ Jµk ,
and by taking the limit as m → ∞, we obtain
Jµk+1 ≤ T Jµk ≤ Jµk , k = 0, 1, . . . . (2.22)
If Jµk+1 = Jµk , it follows that T Jµk = Jµk , so Jµk is a fixed point of T and
must be equal to J * . Moreover by using induction, Eq. (2.22) implies that
Jµk ≤ T k Jµ0 , k = 0, 1, . . . .
Since
J * ≤ Jµk , lim *T k Jµ0 − J * * = 0,
k→∞
it follows that limk→∞ *Jµk − J * * = 0.
Finally, if the number of policies is finite, Eq. (2.22) implies that there
can be only a finite number of iterations for which Jµk+1 (x) < Jµk (x) for
some x. Thus we must have Jµk+1 = Jµk for some k, at which time
Jµk = J * as shown earlier [cf. Eq. (2.22)]. Q.E.D.

In the case where the set of policies is infinite, we may assert the
convergence of the sequence of generated policies under some compactness
and continuity conditions. In particular, we will assume that the state
space is finite, X = {1, . . . , n}, and that each control constraint set U (x)
is a compact subset of %m . We will view a cost function J as an element
of %n , and a policy µ as an element of the set U (1) × · · · × U (n) ⊂ %mn ,
which is compact. Then {µk } has at least one limit point µ, which must
be an admissible policy. The following proposition guarantees, under an
additional continuity assumption for H(x, ·, ·), that every limit point µ is
optimal.

Assumption 2.4.1: (Compactness and Continuity)


(a) The state space is finite, X = {1, . . . , n}.
(b) Each control constraint set U (x), x = 1, . . . , n, is a compact
subset of %m .
(c) Each function H(x, ·, ·), x = 1, . . . , n, is continuous over U (x) ×
%n .

Proposition 2.4.2: Let the monotonicity and contraction Assump-


tions 2.1.1 and 2.1.2 hold, together with Assumption 2.4.1, and let
{µk } be a sequence generated by the PI algorithm. Then for every
limit point µ of {µk }, we have Jµ = J ∗ .
Sec. 2.4 Policy Iteration 73

Proof: We have Jµk → J * by Prop. 2.4.1. Let µ be the limit of a sub-


sequence {µk }k∈K . We will show that Tµ J * = T J * , from which it follows
that Jµ = J * [cf. Prop. 2.1.1(c)]. Indeed, we have Tµ J * ≥ T J * , so we focus
on showing the reverse inequality. From the equation

Tµk Jµk−1 = T Jµk−1 ,

we have
! "
H x, µk (x), Jµk−1 ≤ H(x, u, Jµk−1 ), x = 1, . . . , n, u ∈ U (x).

By taking limit in this relation as k → ∞, k ∈ K, and by using the


continuity of H(x, ·, ·) [cf. Assumption 2.4.1(c)], we obtain
! "
H x, µ(x), J * ≤ H(x, u, J * ), x = 1, . . . , n, u ∈ U (x).

By taking the minimum of the right-hand side over u ∈ U (x), we obtain


Tµ J * ≤ T J * . Q.E.D.

2.4.1 Approximate Policy Iteration

We now consider the PI method where the policy evaluation step and/or
the policy improvement step of the method are implemented through ap-
proximations. This method generates a sequence of policies {µk } and a
corresponding sequence of approximate cost functions {Jk } satisfying

*Jk − Jµk * ≤ δ, *Tµk+1 Jk − T Jk * ≤ ", k = 0, 1, . . . , (2.23)

where δ and " are some scalars, and *·* denotes the weighted sup-norm (the
one used in the contraction Assumption 2.1.2). The following proposition
provides an error bound for this algorithm.

Proposition 2.4.3: (Error Bound for Approximate PI) Let


the monotonicity and contraction Assumptions 2.1.1 and 2.1.2 hold.
The sequence {µk } generated by the approximate PI algorithm (2.23)
satisfies
" + 2αδ
lim sup *Jµk − J * * ≤ . (2.24)
k→∞ (1 − α)2

The essence of the proof is contained in the following proposition,


which quantifies the amount of approximate policy improvement at each
iteration.
74 Contractive Models Chap. 2

Proposition 2.4.4: Let the monotonicity and contraction Assump-


tions 2.1.1 and 2.1.2 hold. Let J, µ, and µ satisfy

*J − Jµ * ≤ δ, *Tµ J − T J* ≤ ",

where δ and " are some scalars. Then


" + 2αδ
*Jµ − J * * ≤ α*Jµ − J * * + . (2.25)
1−α

Proof: We denote by v the weight function corresponding to the weighted


sup-norm. Using the contraction property of T and Tµ , which implies that
*Tµ Jµ − Tµ J* ≤ αδ and *T J − T Jµ * ≤ αδ, and hence Tµ Jµ ≤ Tµ J + αδ v
and T J ≤ T Jµ + αδ v, we have

Tµ Jµ ≤ Tµ J + αδ v ≤ T J + (" + αδ) v ≤ T Jµ + (" + 2αδ) v. (2.26)

Since T Jµ ≤ Tµ Jµ = Jµ , this relation yields

Tµ Jµ ≤ Jµ + (" + 2αδ) v,

and applying Prop. 2.1.4(b) with µ = µ, J = Jµ , and c = " + 2αδ, we


obtain
" + 2αδ
J µ ≤ Jµ + v. (2.27)
1−α
Using this relation, we have

α(" + 2αδ)
Jµ = Tµ Jµ = Tµ Jµ + (Tµ Jµ − Tµ Jµ ) ≤ Tµ Jµ + v,
1−α

where the inequality follows by using Prop. 2.1.3 and Eq. (2.27). Subtract-
ing J * from both sides, we have

α(" + 2αδ)
Jµ − J * ≤ T µ Jµ − J * + v. (2.28)
1−α

Also by subtracting J * from both sides of Eq. (2.26), and using the
contraction property

T Jµ − J * = T Jµ − T J * ≤ α*Jµ − J * * v,

we obtain

Tµ Jµ − J * ≤ T Jµ − J * + (" + 2αδ) v ≤ α*Jµ − J * * v + (" + 2αδ) v.


Sec. 2.4 Policy Iteration 75

Combining this relation with Eq. (2.28), yields

α(" + 2αδ) " + 2αδ


Jµ −J * ≤ α*Jµ −J * * v+ v+("+αδ)e = α*Jµ −J * * v+ v,
1−α 1−α

which is equivalent to the desired relation (2.25). Q.E.D.

Proof of Prop. 2.4.3: Applying Prop. 2.4.4, we have

" + 2αδ
*Jµk+1 − J * * ≤ α*Jµk − J * * + ,
1−α

which by taking the lim sup of both sides as k → ∞ yields the desired result.
Q.E.D.

We note that the error bound of Prop. 2.4.3 is tight, as can be shown
with an example from [BeT96], Section 6.2.3. The error bound is com-
parable to the one for approximate VI, derived earlier in Prop. 2.3.2. In
particular, the error *Jµk −J * * is asymptotically proportional to 1/(1−α)2
and to the approximation error in policy evaluation or value iteration, re-
spectively. This is noteworthy, as it indicates that contrary to the case of
exact implementation, approximate PI need not hold a convergence rate
advantage over approximate VI, despite its greater overhead per iteration.
Note that when δ = " = 0, Eq. (2.25) yields

*Jµk+1 − J * * ≤ α*Jµk − J * *.

Thus in the case of an infinite state space and/or control space, exact
PI converges at a geometric rate under the contraction and monotonicity
assumptions of this section. This rate is the same as the rate of convergence
of exact VI. It follows that judging solely from the point of view of rate
of convergence estimates, exact PI holds an advantage over exact VI only
when the number of states is finite. This raises the question what happens
when the number of states is finite but very large. However, this question
is not very interesting from a practical point of view, since for a very large
number of states, neither VI or PI can be implemented in practice without
approximations (see the discussion of Section 1.2.4).

2.4.2 Approximate Policy Iteration Where Policies Converge

Generally, the policy sequence {µk } generated by approximate PI may


oscillate between several policies. However, under some circumstances this
sequence may be guaranteed to converge to some µ, in the sense that

µk+1 = µk = µ for some k. (2.29)


76 Contractive Models Chap. 2

An example arises when the policy sequence {µk } is generated by exact PI


applied with a different mapping H̃ in place of H, but the policy evaluation
and policy improvement error bounds of Eq. (2.23) are satisfied. The map-
ping H̃ may for example correspond to an approximation of the original
problem (as in the aggregation methods of Example 1.2.10; see [Ber11c]
and [Ber12a] for further discussion). In this case we can show the following
bound, which is much more favorable than the one of Prop. 2.4.3.

Proposition 2.4.5: (Error Bound for Approximate PI when


Policies Converge) Let the monotonicity and contraction Assump-
tions 2.1.1 and 2.1.2 hold, and assume that the approximate PI algo-
rithm (2.23) terminates with a policy µ that satisfies condition (2.29).
Then we have
" + 2αδ
*Jµ − J * * ≤ . (2.30)
1−α

Proof: Let J˜ be the cost function obtained by approximate policy evalu-


ation of µ [i.e., J˜ = Jk , where k satisfies the condition (2.29)]. Then we
have
*J˜ − Jµ * ≤ δ, *Tµ J˜ − T J*
˜ ≤ ", (2.31)
where the latter inequality holds since we have
*Tµ J˜ − T J*
˜ = *T J − T Jk * ≤ ",
µk+1 k

cf. Eq. (2.23). Using Eq. (2.31) and the fact Jµ = Tµ Jµ , we have
˜ + *T J˜ − Tµ J*
*T Jµ − Jµ * ≤ *T Jµ − T J* ˜ + *Tµ J˜ − Jµ *
˜ + *T J˜ − Tµ J*
= *T Jµ − T J* ˜ + *Tµ J˜ − Tµ Jµ *
(2.32)
˜ + " + α*J˜ − Jµ *
≤ α*Jµ − J*
≤ " + 2αδ.
Using Prop. 2.1.1(d) with J = Jµ , we obtain the error bound (2.30).
Q.E.D.

The preceding error bound can be extended to the case where two
successive policies generated by the approximate PI algorithm are “not too
different” rather than being identical. In particular, suppose that µ and µ
are successive policies, which in addition to
*J˜ − Jµ * ≤ δ, *Tµ J˜ − T J*
˜ ≤ ",

[cf. Eq. (2.23)], also satisfy


*Tµ J˜ − Tµ J*
˜ ≤ ζ,
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 77

where ζ is some scalar (instead of µ = µ, which is the case where policies


converge exactly). Then we also have

*T J˜ − Tµ J* ˜ + *Tµ J˜ − Tµ J*
˜ ≤ *T J˜ − Tµ J* ˜ ≤ " + ζ,

and by replacing " with " + ζ and µ with µ in Eq. (2.32), we obtain

" + ζ + 2αδ
*Jµ − J ∗ * ≤ .
1−α

When ζ is small enough to be of the order of max{δ, "}, this error bound
is comparable to the one for the case where policies converge.

2.5 OPTIMISTIC POLICY ITERATION AND λ-POLICY


ITERATION
In this section, we discuss some variants of the PI algorithm of the preceding
section, where the policy evaluation

Jµk = Tµk Jµk

is approximated by using VI. The most straightforward of these methods is


optimistic PI (also called “modified” PI, see e.g., [Put94]), where a policy
µk is evaluated approximately, using a finite number of VI. Thus, starting
with a function J0 ∈ B(X), we generate sequences {Jk } and {µk } with the
algorithm
m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , k = 0, 1, . . . , (2.33)

where {mk } is a sequence of positive integers (see Fig. 2.5.1, which shows
one iteration of the method where mk = 3). There is no systematic guide-
line for selecting the integers mk . Usually their best values are chosen
empirically, and tend to be considerably larger than 1 (in the case where
mk ≡ 1 the optimistic PI method coincides with the VI method). The
convergence of this method is discussed in Section 2.5.1.
Variants of optimistic PI include methods with approximations in the
policy evaluation and policy improvement phases (Section 2.5.2), and meth-
ods where the number mk is randomized (Section 2.5.3). An interesting
advantage of the latter methods is that they do not require the monotonic-
ity Assumption 2.1.1 for convergence in problems with a finite number of
policies.
A method that is conceptually similar to the optimistic PI method is
the λ-PI method defined by
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , k = 0, 1, . . . , (2.34)
78 Contractive Models Chap. 2

Tµ0 J

T J = minµ Tµ J

= T J0
Policy Improvement Exact Policy Evaluation Approximate Polic
Evaluation

J J∗ = T J∗ J Jµ0 = Tµ0 Jµ0


0 Prob. = 1
J0 1 J J
= T J0 = Tµ0 J0 J1 = Tµ30 J0
Approx. Policy Evaluation

Figure 2.5.1 Illustration of optimistic PI in one dimension. In this example, the


policy µ0 is evaluated approximately with just three applications of Tµ0 to yield
J1 = Tµ30 J0 .

where J0 is an initial function in B(X), and for any policy µ and scalar
(λ)
λ ∈ (0, 1), Tµ is the multistep mapping defined by


$
(λ)
Tµ J = (1 − λ) λ! Tµ!+1 J, J ∈ B(X),
!=0

(cf. Section 1.2.5). To compare optimistic PI and λ-PI, note that they both
involve multiple applications of the VI mapping Tµk : a fixed number mk
in the former case, and a geometrically weighted number in the latter case.
(λ)
In fact, we may view the λ-PI iterate Tµk Jk as the expected value of the
mk
optimistic PI iterate Tµk Jµk when mk is chosen by a geometric probability
distribution with parameter λ.
One of the reasons that make λ-PI interesting is its relation with
TD(λ) and other temporal difference methods on one hand, and the prox-
imal algorithm on the other. In particular, in λ-PI a policy evaluation is
performed with a single iteration of an extrapolated proximal algorithm;
cf. the discussion of Section 1.2.5 and Exercise 1.2. Thus implementation
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 79

of λ-PI can benefit from the rich methodology that has developed around
temporal difference and proximal methods.
Generally the optimistic and λ-PI methods have similar convergence
properties. In this section, we focus primarily on optimistic PI, and we
discuss briefly λ-PI in Section 2.5.3, where we will prove convergence for a
randomized version. For a convergence proof of λ-PI without randomiza-
tion in discounted stochastic optimal control and stochastic shortest path
problems, see the paper [BeI96] and the book [BeT96] (Section 2.3.1).

2.5.1 Convergence of Optimistic Policy Iteration

We will now focus on the optimistic PI algorithm (2.33). The following two
propositions provide its convergence properties.

Proposition 2.5.1: (Convergence of Optimistic PI) Let the


monotonicity
% & and contraction Assumptions 2.1.1 and 2.1.2 hold, and
let (Jk , µk ) be a sequence generated by the optimistic PI algorithm
(2.33). Then
lim *Jk − J * * = 0,
k→∞

and if the number of policies is finite, we have Jµk = J * for all k


greater than some index k̄.

Proposition 2.5.2: Let the monotonicity and contraction Assump-


tions
% 2.1.1
& and 2.1.2 hold, together with Assumption 2.4.1, and let
(Jk , µk ) be a sequence generated by the optimistic PI algorithm
(2.33). Then for every limit point µ of {µk }, we have Jµ = J ∗ .

We develop the proofs of the propositions through four lemmas. The


first lemma collects some properties of monotone weighted sup-norm con-
tractions, variants of which we noted earlier and we restate for convenience.

Lemma 2.5.1: Let W : B(X) #→ B(X) be a mapping that satisfies


the monotonicity assumption

J ≤ J# ⇒ W J ≤ W J #, ∀ J, J # ∈ B(X),

and the contraction assumption


80 Contractive Models Chap. 2

*W J − W J # * ≤ α*J − J # *, ∀ J, J # ∈ B(X),
for some α ∈ (0, 1).
(a) For all J, J # ∈ B(X) and scalar c ≥ 0, we have

J ≥ J# − c v ⇒ W J ≥ W J # − αc v. (2.35)

(b) For all J ∈ B(X), c ≥ 0, and k = 0, 1, . . ., we have

αk
J ≥ WJ − cv ⇒ W kJ ≥ J * − c v, (2.36)
1−α

αk
WJ ≥ J − cv ⇒ J * ≥ W kJ − c v, (2.37)
1−α
where J * is the fixed point of W .

Proof: The proof of part (a) follows the one of Prop. 2.1.4(b), while the
proof of part (b) follows the one of Prop. 2.1.4(c). Q.E.D.

Lemma 2.5.2: Let the monotonicity and contraction Assumptions


2.1.1 and 2.1.2 hold, let J ∈ B(X) and c ≥ 0 satisfy

J ≥ T J − c v,

and let µ ∈ M be such that Tµ J = T J. Then for all k > 0, we have


α
T J ≥ Tµk J − c v, (2.38)
1−α

and
Tµk J ≥ T (Tµk J) − αk c v. (2.39)

Proof: Since J ≥ T J − c v = Tµ J − c v, by using Lemma 2.5.1(a) with


W = Tµj and J # = Tµ J, we have for all j ≥ 1,
Tµj J ≥ Tµj+1 J − αj c v. (2.40)
By adding this relation over j = 1, . . . , k − 1, we have
k−1
$ α − αk α
T J = Tµ J ≥ Tµk J − αj c v = Tµk J − c v ≥ Tµk J − c v,
j=1
1−α 1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 81

showing Eq. (2.38). From Eq. (2.40) for j = k, we obtain

Tµk J ≥ Tµk+1 J − αk c v = Tµ (Tµk J) − αk c v ≥ T (Tµk J) − αk c v,

showing Eq. (2.39). Q.E.D.

The next lemma applies to the optimistic PI algorithm (2.33) and


proves a preliminary bound.

Lemma 2.5.3: Let the monotonicity


% & and contraction Assumptions
2.1.1 and 2.1.2 hold, let (Jk , µk ) be a sequence generated by the
optimistic PI algorithm (2.33), and assume that for some c ≥ 0 we
have
J0 ≥ T J0 − c v.
Then for all k ≥ 0,
α
T Jk + βk c v ≥ Jk+1 ≥ T Jk+1 − βk+1 c v, (2.41)
1−α

where βk is the scalar given by


'
1 if k = 0,
βk = (2.42)
αm0 +···+mk−1 if k > 0,

with mj , j = 0, 1, . . ., being the integers used in the algorithm (2.33).

Proof: We prove Eq. (2.41) by induction on k, using Lemma 2.5.2. For


k = 0, using Eq. (2.38) with J = J0 , µ = µ0 , and k = m0 , we have
α α
T J 0 ≥ J1 − c v = J1 − β0 c v,
1−α 1−α
showing the left-hand side of Eq. (2.41) for k = 0. Also by Eq. (2.39) with
µ = µ0 and k = m0 , we have

J1 ≥ T J1 − αm0 c v = T J1 − β1 c v.

showing the right-hand side of Eq. (2.41) for k = 0.


Assuming that Eq. (2.41) holds for k − 1 ≥ 0, we will show that it
holds for k. Indeed, the right-hand side of the induction hypothesis yields

Jk ≥ T Jk − βk c v.

Using Eqs. (2.38) and (2.39) with J = Jk , µ = µk , and k = mk , we obtain


α
T Jk ≥ Jk+1 − βk c v,
1−α
82 Contractive Models Chap. 2

and
Jk+1 ≥ T Jk+1 − αmk βk c v = T Jk+1 − βk+1 c v,
respectively. This completes the induction. Q.E.D.

The next lemma essentially proves the convergence of the optimistic


PI (Prop. 2.5.1) and provides associated error bounds.

Lemma 2.5.4: Let the monotonicity


% & and contraction Assumptions
2.1.1 and 2.1.2 hold, let (Jk , µk ) be a sequence generated by the
optimistic PI algorithm (2.33), and let c ≥ 0 be a scalar such that

*J0 − T J0 * ≤ c. (2.43)

Then for all k ≥ 0,

αk βk (k + 1)αk
Jk + c v ≥ Jk + c v ≥ J * ≥ Jk − c v, (2.44)
1−α 1−α 1−α

where βk is defined by Eq. (2.42).

Proof: Using the relation J0 ≥ T J0 − c v [cf. Eq. (2.43)] and Lemma 2.5.3,
we have
Jk ≥ T Jk − βk c v, k = 0, 1, . . . .
Using this relation in Lemma 2.5.1(b) with W = T and k = 0, we obtain
βk
Jk ≥ J * − c v,
1−α
which together with the fact αk ≥ βk , shows the left-hand side of Eq.
(2.44).
Using the relation T J0 ≥ J0 − c v [cf. Eq. (2.43)] and Lemma 2.5.1(b)
with W = T , we have
αk
J * ≥ T k J0 − c v, k = 0, 1, . . . . (2.45)
1−α
Using again the relation J0 ≥ T J0 − c v in conjunction with Lemma 2.5.3,
we also have
α
T Jj ≥ Jj+1 − βj c v, j = 0, . . . , k − 1.
1−α
Applying T k−j−1 to both sides of this inequality and using the monotonicity
and contraction properties of T k−j−1 , we obtain
αk−j
T k−j Jj ≥ T k−j−1 Jj+1 − βj c v, j = 0, . . . , k − 1,
1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 83

cf. Lemma 2.5.1(a). By adding this relation over j = 0, . . . , k − 1, and using


the fact βj ≤ αj , it follows that
k−1
$ αk−j j kαk
T k J0 ≥ Jk − α c v = Jk − c v. (2.46)
j=0
1−α 1−α
Finally, by combining Eqs. (2.45) and (2.46), we obtain the right-hand side
of Eq. (2.44). Q.E.D.

Proof of Props. 2.5.1 and 2.5.2: Let c be a scalar satisfying Eq. (2.43).
Then the error bounds (2.44) show that limk→∞ *Jk −J * * = 0, i.e., the first
part of Prop. 2.5.1. To show the second part (finite termination when the
. be the finite set of nonoptimal policies.
number of policies is finite), let M
. which
Then there exists " > 0 such that *Tµ̂ J * − T J * * > " for all µ̂ ∈ M,
.
implies that *Tµ̂ Jk − T Jk * > " for all µ̂ ∈ M and k sufficiently large. This
implies that µk ∈ / M. for all k sufficiently large. The proof of Prop. 2.5.2
follows using the compactness and continuity Assumption 2.4.1, and the
convergence argument of Prop. 2.4.2. Q.E.D.

Convergence Rate Issues

Let us consider the convergence rate bounds of Lemma 2.5.4 for optimistic
PI, and write them in the form
(k + 1)αk αm0 +···+mk−1
*J0 − T J0 * ≤ c ⇒ Jk − c v ≤ J * ≤ Jk + c v.
1−α 1−α
(2.47)
We may contrast these bounds with the ones for VI, where
αk αk
*J0 − T J0 * ≤ c ⇒ T k J0 − c v ≤ J * ≤ T k J0 + c v (2.48)
1−α 1−α
[cf. Prop. 2.1.4(c)].
In comparing the bounds (2.47) and (2.48), we should also take into
account the associated overhead for a single iteration of each method: op-
timistic PI requires at iteration k a single application of T and mk − 1
applications of Tµk (each being less time-consuming than an application of
T ), while VI requires a single application of T . It can then be seen that the
upper bound for optimistic PI is better than the one for VI (same bound
for less overhead), while the lower bound for optimistic PI is worse than the
one for VI (worse bound for more overhead). This suggests that the choice
of the initial condition J0 is important in optimistic PI, and in particular
it is preferable to have J0 ≥ T J0 (implying convergence to J * from above)
rather than J0 ≤ T J0 (implying convergence to J * from below). This is
consistent with the results of other works, which indicate that the conver-
gence properties of the method are fragile when the condition J0 ≥ T J0
does not hold (see [WiB93], [BeT96], [BeY10], [BeY12], [YuB13a]).
84 Contractive Models Chap. 2

2.5.2 Approximate Optimistic Policy Iteration

We will now derive error bounds for the case where the policy evaluation
and policy improvement operations are approximate, similar to the nonop-
timistic PI case of Section 2.4.1. In particular, we consider a method that
generates a sequence of policies {µk } and a corresponding sequence of ap-
proximate cost functions {Jk } satisfying
m
*Jk − Tµkk Jk−1 * ≤ δ, *Tµk+1 Jk − T Jk * ≤ ", k = 0, 1, . . . , (2.49)

[cf. Eq. (2.23)]. For example, we may compute (perhaps approximately,


m
by simulation) the values (Tµkk Jk−1 )(x) for a subset of states x, and use a
least squares fit of these values to select Jk from some parametric class of
functions.
We will prove the same error bound as for the nonoptimistic case, cf.
Eq. (2.24). However, for this we will need the following condition, which
is stronger than the contraction and monotonicity conditions that we have
been using so far.

Assumption 2.5.1: (Semilinear Monotonic Contraction) For


all J ∈ B(X) and µ ∈ M, the functions Tµ J and T J belong to B(X).
Furthermore, for some α ∈ (0, 1), we have for all J, J # ∈ B(X), µ ∈ M,
and x ∈ X,

(Tµ J # )(x) − (Tµ J)(x) J # (y) − J(y)


≤ α sup . (2.50)
v(x) y∈X v(y)

This assumption implies both the monotonicity and contraction As-


sumptions 2.1.1 and 2.1.2, as can be easily verified. Moreover the assump-
tion is satisfied in the discounted DP examples of Section 1.2, as well as
the stochastic shortest path problem of Example 1.2.6. It holds if Tµ is a
linear mapping involving a matrix with nonnegative components that has
spectral radius less than 1 (or more generally if Tµ is the minimum or the
maximum of a finite number of such linear mappings).
For any function y ∈ B(X), let us use the notation

y(x)
M (y) = sup .
x∈X v(x)

Then the condition (2.50) can be written for all J, J # ∈ B(X), and µ ∈ M
as
M (Tµ J − Tµ J # ) ≤ αM (J − J # ), (2.51)
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 85

and also implies the following multistep versions, for ) ≥ 1,

Tµ! J − Tµ! J # ≤ α! M (J − J # )v, M (Tµ! J − Tµ! J # ) ≤ α! M (J − J # ), (2.52)

which can be proved by induction using Eq. (2.51). We have the following
proposition.

Proposition 2.5.3: (Error Bound for Optimistic Approximate


PI) Let Assumption 2.5.1 hold, in addition to the monotonicity and
contraction Assumptions 2.1.1 and 2.1.2. Then the sequence {µk }
generated by the optimistic approximate PI algorithm (2.49) satisfies

" + 2αδ
lim sup *Jµk − J * * ≤ .
k→∞ (1 − α)2

Proof: Let us fix k ≥ 1, and for simplicity let us assume that mk ≡ m for
some m, and denote

J = Jk−1 , J = Jk , µ = µk , µ̄ = µk+1 ,

s = Jµ − Tµm J, s̄ = Jµ̄ − Tµ̄m J, t = Tµm J − J * , t̄ = Tµ̄m J − J * .


We have
Jµ − J * = Jµ − Tµm J + Tµm J − J * = s + t. (2.53)
We will derive recursive relations for s and t, which will also involve the
residual functions

r = Tµ J − J, r̄ = Tµ̄ J − J.

We first obtain a relation between r and r̄. We have

r̄ = Tµ̄ J − J
= (Tµ̄ J − Tµ J) + (Tµ J − J)
! " ! "
≤ (Tµ̄ J − T J) + Tµ J − Tµ (Tµm J) + (Tµm J − J) + Tµm (Tµ J) − Tµm J
≤ "v + αM (J − Tµm J)v + δv + αm M (Tµ J − J)v
≤ (" + δ)v + αδv + αm M (r)v,

where the first inequality follows from Tµ̄ J ≥ T J, and the second and third
inequalities follow from Eqs. (2.49) and (2.52). From this relation we have
! "
M (r̄) ≤ " + (1 + α)δ + βM (r),
86 Contractive Models Chap. 2

where β = αm . Taking lim sup as k → ∞ in this relation, we obtain

" + (1 + α)δ
lim sup M (r) ≤ . (2.54)
k→∞ 1−β

Next we derive a relation between s and r. We have

s = Jµ − Tµm J
= Tµm Jµ − Tµm J
≤ αm M (Jµ − J)v
αm
≤ M (Tµ J − J)v
1−α
αm
= M (r)v,
1−α

where the first inequality follows from Eq. (2.52) and the second inequality
αm
follows by using Prop. 2.1.4(b). Thus we have M (s) ≤ 1−α M (r), from
which by taking lim sup of both sides and using Eq. (2.54), we obtain
! "
β " + (1 + α)δ
lim sup M (s) ≤ . (2.55)
k→∞ (1 − α)(1 − β)

Finally we derive a relation between t, t̄, and r. We first note that

T J − T J * ≤ αM (J − J * )v
= αM (J − Tµm J + Tµm J − J * )v
≤ αM (J − Tµm J)v + αM (Tµm J − J * )v
≤ αδv + αM (t)v.

Using this relation, and Eqs. (2.49) and (2.52), we have

t̄ = Tµ̄m J − J *
= (Tµ̄m J − Tµ̄m−1 J) + · · · + (Tµ̄2 J − Tµ̄ J) + (Tµ̄ J − T J) + (T J − T J * )
≤ (αm−1 + · · · + α)M (Tµ̄ J − J)v + "v + αδv + αM (t)v,

so finally
α − αm
M (t̄) ≤ M (r̄) + (" + αδ) + αM (t).
1−α
By taking lim sup of both sides and using Eq. (2.54), it follows that
! "
(α − β) " + (1 + α)δ " + αδ
lim sup M (t) ≤ 2 (1 − β)
+ . (2.56)
k→∞ (1 − α) 1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 87

We now combine Eqs. (2.53), (2.55), and (2.56). We obtain

lim sup M (Jµk − J * ) ≤ lim sup M (s) + lim sup M (t)


k→∞ k→∞ k→∞
! " ! "
β " + (1 + α)δ (α − β) " + (1 + α)δ " + αδ
≤ + +
(1 − α)(1 − β) (1 − α)2 (1 − β) 1−α
! "! "
β(1 − α) + (α − β) " + (1 + α)δ " + αδ
= 2
+
(1 − α) (1 − β) 1−α
! "
α " + (1 + α)δ " + αδ
= +
(1 − α)2 1−α
" + 2αδ
= .
(1 − α)2

This proves the result, since in view of Jµk ≥ J * , we have M (Jµk − J * ) =


*Jµk − J * *. Q.E.D.

A remarkable fact is that approximate VI, approximate PI, and ap-


proximate optimistic PI have very similar error bounds (cf. Props. 2.3.2,
2.4.3, and 2.5.3). Approximate VI has a slightly better bound, but insignif-
icantly so in practical terms. When approximate PI produces a convergent
sequence of policies, the associated error bound is much better (cf. Prop.
2.4.5). However, special conditions are needed for convergence of policies
in approximate PI. These conditions are fulfilled in some cases, notably
including schemes where aggregation is used for policy evaluation (cf. Sec-
tion 1.2.4). In other cases, including some where the projected equation
is used for policy evaluation, approximate PI (both optimistic and nonop-
timistic) will typically generate a cycle of policies satisfying the bound of
Prop. 2.4.3; see Section 3.6 of the PI survey paper [Ber11c], or Chapter 6
of the book [Ber12a].

2.5.3 Randomized Optimistic Policy Iteration

We will now consider a randomized version of the optimistic PI algorithm


where the number mk of VI iterations in the kth policy evaluation is ran-
dom, while the monotonicity assumption need not hold. We assume, how-
ever, that each policy mapping is a contraction in a suitable space, that
the number of policies is finite, and that mk = 1 with positive probability
(these assumptions can be modified and/or generalized in ways suggested
by the subsequent line of proof). In particular, for each positive integer j,
we have a probability p(j) ≥ 0, where


$
p(1) > 0, p(j) = 1.
j=1
88 Contractive Models Chap. 2

We consider the algorithm


m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , k = 0, 1, . . . , (2.57)

where mk is chosen randomly according to the distribution p(j),

P (mk = j) = p(j), j = 1, 2, . . . . (2.58)

The selection of mk is independent of previous selections. We will assume


the following.

Assumption 2.5.2: Let * · * be a norm on some complete space of


real-valued functions over X, denoted F (X), and assume the following.
(a) The set of policies M is finite.
(b) The mappings Tµ , µ ∈ M, and T are contraction mappings from
F (X) into F (X).

The preceding assumption requires that the number of policies is


finite, but does not require any monotonicity condition (cf. Assumption
2.1.1), while its contraction condition (b) is weaker than the contraction
Assumption 2.1.2 since F (X) is a general complete normed space, not nec-
essarily B(X). This flexibility may be useful in algorithms that involve
cost function approximation within a subspace of basis functions. For such
algorithms, however, T does not necessarily have a unique fixed point, as
discussed in Section 1.2.4. By contrast since F (X) is assumed complete,
Assumption 2.5.2 implies that Tµ and T have unique fixed points, which
we denote by Jµ and J * , respectively.
An important preliminary fact (which relies on the finiteness of M)
is given in the following proposition. The proposition implies that near J *
the generated policies µk are “optimal” in the sense that Jµk = J * , so the
algorithm does not tend to cycle. †

Proposition 2.5.4: Let Assumption 2.5.2 hold, and let M∗ be the


subset of all µ ∈ M such that Tµ J * = T J * . Then for all µ ∈ M∗ , we
have Jµ = J * . Moreover, there exists an " > 0 such that for all J with
*J − J * * < " we have Tµ J = T J only if µ ∈ M∗ .

Proof: If µ ∈ M∗ , we have Tµ J * = T J * = J * . Thus J * is the unique


fixed point Jµ of Tµ , and we have Jµ = J * .

† Note that without monotonicity, J ∗ need not have any formal optimality
properties (cf. the discussion of Section 2.1 and Example 2.1.1).
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 89

To prove the second assertion, we argue by contradiction, so we as-


sume that there exist a sequence of scalars {"k } and a sequence of policies
{µk } such that "k ↓ 0 and
µk ∈
/ M∗ , Tµk Jk = T Jk , *Jk − J * * < "k , ∀ k = 0, 1, . . . .
Since M is finite, we may assume without loss of generality that for some
µ∈/ M∗ , we have µk = µ for all k, so from the preceding relation we have
T µ Jk = T J k , *Jk − J * * < "k , ∀ k = 0, 1, . . . .
Thus *Jk − J ** → 0, and by the contraction Assumption 2.5.2(b), we have
*Tµ Jk − Tµ J * * → 0, *T Jk − T J * * → 0.
Since Tµ Jk = T Jk , the limits of {Tµ Jk } and {T Jk } are equal, i.e., Tµ J * =
T J * = J * . Since Jµ is the unique fixed point of Tµ over F (X), it follows
that Jµ = J * , contradicting the earlier hypothesis that µ ∈ / M∗ . Q.E.D.

The preceding proof illustrates the key idea of the randomized opti-
m
mistic PI algorithm, which is that for µ ∈ M∗ , the mappings Tµ k have a
*
common fixed point that is equal to J , the fixed point of T . Thus within
a distance " from J * , the iterates (2.57) aim consistently at J * . Moreover,
because the probability of a VI (an iteration with mk = 1) is positive, the
algorithm is guaranteed to eventually come within " from J * through a
sufficiently long sequence of contiguous VI iterations. For this we need the
sequence {Jk } to be bounded, which will be shown as part of the proof of
the following proposition.

Proposition 2.5.5: Let Assumption 2.5.2 hold. Then for any start-
ing point J0 ∈ F (X), a sequence {Jk } generated by the randomized
optimistic PI algorithm (2.57)-(2.58) belongs to F (X) and converges
to J * with probability one.

Proof: We will show that {Jk } is bounded by showing that for all k, we
have
1
max *Jk − Jµ * ≤ ρk max *J0 − Jµ * + max *Jµ − Jµ# *, (2.59)
µ∈M µ∈M 1 − ρ µ,µ# ∈M
where ρ is a common contraction modulus of Tµ , µ ∈ M, and T . Indeed,
we have for all µ ∈ M
*Jk − Jµ * ≤ *Jk − Jµk−1 * + *Jµk−1 − Jµ *
k−1m
= *Tµk−1 Jk−1 − Jµk−1 * + *Jµk−1 − Jµ *
≤ ρmk−1 *Jk−1 − Jµk−1 * + *Jµk−1 − Jµ *
≤ ρmk−1 max *Jk−1 − Jµ * + max *Jµ − Jµ# *
µ∈M µ,µ# ∈M

≤ ρ max *Jk−1 − Jµ * + max


#
*Jµ − Jµ# *,
µ∈M µ,µ ∈M
90 Contractive Models Chap. 2

and finally, for all k,

max *Jk − Jµ * ≤ ρ max *Jk−1 − Jµ * + max *Jµ − Jµ# *.


µ∈M µ∈M µ,µ# ∈M

From this relation, we obtain Eq. (2.59) by induction.


Thus in conclusion, we have {Jk } ⊂ D, where D is the bounded set
' # /
# 1
D = J # max *J − Jµ * ≤ max *J0 − Jµ * + max *Jµ − Jµ# * .
µ∈M µ∈M 1 − ρ µ,µ# ∈M

We use this fact to argue that with enough contiguous value iterations, i.e.,
iterations where mk = 1, Jk can be brought arbitrarily close to J * , and
once this happens, the algorithm operates like the ordinary VI algorithm.
Indeed, each time the iteration Jk+1 = T Jk is performed (i.e., when
mk = 1), the distance of the iterate Jk from J * is reduced by a factor ρ, i.e.,
*Jk+1 − J * * ≤ ρ*Jk − J * *. Since {Jk } belongs to the bounded set D, and
our randomization scheme includes the condition p(1) > 0, the algorithm is
guaranteed (with probability one) to eventually execute a sufficient number
of contiguous iterations Jk+1 = T Jk to enter a sphere
% &
S" = J ∈ F (X) | *J − J * * < "

of small enough radius " to guarantee that the generated policy µk belongs
to M∗ , as per Prop. 2.5.4. Once this happens, all subsequent iterations
reduce the distance *Jk − J * * by a factor ρ at every iteration, since
)
)Tµm J − J * * ≤ ρ*Tµm−1 J − J * * ≤ ρ*J − J * *, ∀ µ ∈ M∗ , m ≥ 1, J ∈ S" .

Thus once {Jk } enters S" , it stays within S" and converges to J * . Q.E.D.

A Randomized Version of λ-Policy Iteration

We now turn to the λ-PI algorithm. Instead of the nonrandomized version


(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , k = 0, 1, . . . ,

cf. Eq. (2.34), we consider a randomized version that involves a fixed prob-
ability p ∈ (0, 1). It has the form
'
T Jk with probability p,
Tµk Jk = T Jk , Jk+1 = (λ)
Tµk Jk with probability 1 − p. (2.60)

The idea of the algorithm is similar to the one of the randomized


optimistic PI algorithm (2.57)-(2.58). Under the assumptions of Prop.
Sec. 2.6 Asynchronous Algorithms 91

2.5.5, the sequence {Jk } generated by the randomized λ-PI algorithm (2.60)
belongs to F (X) and converges to J * with probability one. The reason is
that the contraction property of Tµ over F (X) with respect to the norm *·*
(λ) (λ)
implies that Tµ is well-defined, and also implies that Tµ is a contraction
ove