AbstractDynamic Programming
AbstractDynamic Programming
THIRD EDITION
Dimitri P. Bertsekas
Arizona State University
http://www.athenasc.com
Email: [email protected]
WWW: http://www.athenasc.com
1. Introduction . . . . . . . . . . . . . . . . . . . . p. 1
1.1. Structure of Dynamic Programming Problems . . . . . . . p. 2
1.2. Abstract Dynamic Programming Models . . . . . . . . . . p. 5
1.2.1. Problem Formulation . . . . . . . . . . . . . . . . p. 5
1.2.2. Monotonicity and Contraction Properties . . . . . . . p. 7
1.2.3. Some Examples . . . . . . . . . . . . . . . . . . p. 10
1.2.4. Reinforcement Learning - Projected and Aggregation . . . .
Bellman Equations . . . . . . . . . . . . . . . . p. 24
1.2.5. Reinforcement Learning - Temporal Difference and . . . . .
Proximal Algorithms . . . . . . . . . . . . . . . . p. 26
1.3. Reinforcement Learning - Approximation in Value Space . . . p. 29
1.3.1. Approximation in Value Space for . . . . . . . . . . . .
Markovian Decision Problems . . . . . . . . . . . . p. 29
1.3.2. Approximation in Value Space and . . . . . . . . . . . .
Newton’s Method . . . . . . . . . . . . . . . . . p. 35
1.3.3. Policy Iteration and Newton’s Method . . . . . . . . p. 39
1.3.4. Approximation in Value Space for General Abstract . . . .
Dynamic Programming . . . . . . . . . . . . . . . p. 41
1.4. Organization of the Book . . . . . . . . . . . . . . . . p. 41
1.5. Notes, Sources, and Exercises . . . . . . . . . . . . . . . p. 45
2. Contractive Models . . . . . . . . . . . . . . . . . p. 53
2.1. Bellman’s Equation and Optimality Conditions . . . . . . . p. 54
2.2. Limited Lookahead Policies . . . . . . . . . . . . . . . p. 61
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . . . p. 66
2.3.1. Approximate Value Iteration . . . . . . . . . . . . . p. 67
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . . . p. 70
2.4.1. Approximate Policy Iteration . . . . . . . . . . . . p. 73
2.4.2. Approximate Policy Iteration Where Policies Converge . p. 75
2.5. Optimistic Policy Iteration and λ-Policy Iteration . . . . . . p. 77
2.5.1. Convergence of Optimistic Policy Iteration . . . . . . p. 79
2.5.2. Approximate Optimistic Policy Iteration . . . . . . . p. 84
2.5.3. Randomized Optimistic Policy Iteration . . . . . . . . p. 87
v
vi Contents
References . . . . . . . . . . . . . . . . . . . . . p. 391
Index . . . . . . . . . . . . . . . . . . . . . . . p. 401
Preface of the First Edition
This book aims at a unified and economical development of the core the-
ory and algorithms of total cost sequential decision problems, based on
the strong connections of the subject with fixed point theory. The analy-
sis focuses on the abstract mapping that underlies dynamic programming
(DP for short) and defines the mathematical character of the associated
problem. Our discussion centers on two fundamental properties that this
mapping may have: monotonicity and (weighted sup-norm) contraction. It
turns out that the nature of the analytical and algorithmic DP theory is
determined primarily by the presence or absence of these two properties,
and the rest of the problem’s structure is largely inconsequential.
In this book, with some minor exceptions, we will assume that mono-
tonicity holds. Consequently, we organize our treatment around the con-
traction property, and we focus on four main classes of models:
(a) Contractive models, discussed in Chapter 2, which have the richest
and strongest theory, and are the benchmark against which the the-
ory of other models is compared. Prominent among these models are
discounted stochastic optimal control problems. The development of
these models is quite thorough and includes the analysis of recent ap-
proximation algorithms for large-scale problems (neuro-dynamic pro-
gramming, reinforcement learning).
(b) Semicontractive models, discussed in Chapter 3 and parts of Chap-
ter 4. The term “semicontractive” is used qualitatively here, to refer
to a variety of models where some policies have a regularity/contrac-
tion-like property but others do not. A prominent example is stochas-
tic shortest path problems, where one aims to drive the state of
a Markov chain to a termination state at minimum expected cost.
These models also have a strong theory under certain conditions, of-
ten nearly as strong as those of the contractive models.
(c) Noncontractive models, discussed in Chapter 4, which rely on just
monotonicity. These models are more complex than the preceding
ones and much of the theory of the contractive models generalizes in
weaker form, if at all. For example, in general the associated Bell-
man equation need not have a unique solution, the value iteration
method may work starting with some functions but not with others,
and the policy iteration method may not work at all. Infinite hori-
zon examples of these models are the classical positive and negative
DP problems, first analyzed by Dubins and Savage, Blackwell, and
ix
x Preface
Dimitri P. Bertsekas
Spring 2013
Preface xiii
The second edition aims primarily to amplify the presentation of the semi-
contractive models of Chapter 3 and Chapter 4, and to supplement it with
a broad spectrum of research results that I obtained and published in jour-
nals and reports since the first edition was written. As a result, the size
of this material more than doubled, and the size of the book increased by
about 40%.
In particular, I have thoroughly rewritten Chapter 3, which deals with
semicontractive models where stationary regular policies are sufficient. I
expanded and streamlined the theoretical framework, and I provided new
analyses of a number of shortest path-type applications (deterministic,
stochastic, affine monotonic, exponential cost, and robust/minimax), as
well as several types of optimal control problems with continuous state
space (including linear-quadratic, regulation, and planning problems).
In Chapter 4, I have extended the notion of regularity to nonstation-
ary policies (Section 4.4), aiming to explore the structure of the solution set
of Bellman’s equation, and the connection of optimality with other struc-
tural properties of optimal control problems. As an application, I have
discussed in Section 4.5 the relation of optimality with classical notions
of stability and controllability in continuous-spaces deterministic optimal
control. In Section 4.6, I have similarly extended the notion of a proper
policy to continuous-spaces stochastic shortest path problems.
I have also revised Chapter 1 a little (mainly with the addition of
Section 1.2.5 on the relation between proximal algorithms and temporal
difference methods), added to Chapter 2 some analysis relating to λ-policy
iteration and randomized policy iteration algorithms (Section 2.5.3), and I
have also added several new exercises (with complete solutions) to Chapters
1-4. Additional material relating to various applications can be found in
some of my journal papers, reports, and video lectures on semicontractive
models, which are posted at my web site.
In addition to the changes in Chapters 1-4, I have also eliminated from
the second edition the analysis that deals with restricted policies (Chap-
ter 5 and Appendix C of the first edition). This analysis is motivated in
part by the complex measurability questions that arise in mathematically
rigorous theories of stochastic optimal control with Borel state and control
spaces. This material is covered in Chapter 6 of the monograph by Bert-
sekas and Shreve [BeS78], and followup research on the subject has been
limited. Thus, I decided to just post Chapter 5 and Appendix C of the first
xiv Preface
edition at the book’s web site (40 pages), and omit them from the second
edition. As a result of this choice, the entire book now requires only a
modest mathematical background, essentially a first course in analysis and
in elementary probability.
The range of applications of dynamic programming has grown enor-
mously in the last 25 years, thanks to the use of approximate simulation-
based methods for large and challenging problems. Because approximations
are often tied to special characteristics of specific models, their coverage in
this book is limited to general discussions in Chapter 1 and to error bounds
given in Chapter 2. However, much of the work on approximation methods
so far has focused on finite-state discounted, and relatively simple deter-
ministic and stochastic shortest path problems, for which there is solid and
robust analytical and algorithmic theory (part of Chapters 2 and 3 in this
monograph). As the range of applications becomes broader, I expect that
the level of mathematical understanding projected in this book will become
essential for the development of effective and reliable solution methods. In
particular, much of the new material in this edition deals with infinite-state
and/or complex shortest path type-problems, whose approximate solution
will require new methodologies that transcend the current state of the art.
Dimitri P. Bertsekas
January 2018
Preface xv
Dimitri P. Bertsekas
February 2022
1
Introduction
Contents
1
2 Introduction Chap. 1
Dynamic programming (DP for short) is the principal method for analysis
of a large and diverse class of sequential decision problems. Examples are
deterministic and stochastic optimal control problems with a continuous
state space, Markov and semi-Markov decision problems with a discrete
state space, minimax problems, and sequential zero-sum games. While the
nature of these problems may vary widely, their underlying structures turn
out to be very similar. In all cases there is an underlying mapping that
depends on an associated controlled dynamic system and corresponding
cost per stage. This mapping, the DP (or Bellman) operator, provides a
compact “mathematical signature” of the problem. It defines the cost func-
tion of policies and the optimal cost function, and it provides a convenient
shorthand notation for algorithmic description and analysis.
More importantly, the structure of the DP operator defines the math-
ematical character of the associated problem. The purpose of this book is to
provide an analysis of this structure, centering on two fundamental prop-
erties: monotonicity and (weighted sup-norm) contraction. It turns out
that the nature of the analytical and algorithmic DP theory is determined
primarily by the presence or absence of one or both of these two properties,
and the rest of the problem’s structure is largely inconsequential.
Here xk is the state of the system taking values in a set X (the state space),
and uk is the control taking values in a set U (the control space). † At stage
k, there is a cost
αk g(xk , uk )
incurred when uk is applied at state xk , where α is a scalar in (0, 1] that has
the interpretation of a discount factor when α < 1. The controls are chosen
as a function of the current state, subject to a constraint that depends on
that state. In particular, at state x the control is constrained to take values
in a given set U (x) ⊂ U . Thus we are interested in optimization over the
set of (nonstationary) policies
! "
Π = {µ0 , µ1 , . . .} | µk ∈ M, k = 0, 1, . . . ,
(We use limit superior rather than limit to cover the case where the limit
does not exist.) The optimal cost function is
We now note that both Eqs. (1.3) and (1.4) can be stated in terms of
the expression
$ %
H(x, u, J) = g(x, u) + αJ f (x, u) , x ∈ X, u ∈ U (x).
Defining $ %
(Tµ J)(x) = H x, µ(x), J , x ∈ X,
and
(T J)(x) = inf H(x, u, J), x ∈ X,
u∈U(x)
Then it can be verified by induction that for all initial states x0 , we have
¯ 0 ).
Jπ,N (x0 ) = (Tµ0 Tµ1 · · · TµN−1 J)(x (1.6)
Here Tµ0 Tµ1 · · · TµN−1 is the composition of the mappings Tµ0 , Tµ1 , . . . TµN−1 ,
i.e., for all J,
$ %
(Tµ0 Tµ1 J)(x) = Tµ0 (Tµ1 J) (x), x ∈ X,
and more generally
$ %
(Tµ0 Tµ1 · · · TµN−1 J)(x) = Tµ0 (Tµ1 (· · · (TµN−1 J))) (x), x ∈ X,
(our notational conventions are summarized in Appendix A). Thus the
finite horizon cost functions Jπ,N of π can be defined in terms of the map-
pings Tµ [cf. Eq. (1.6)], and so can the infinite horizon cost function Jπ :
¯
Jπ (x) = lim sup(Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X, (1.7)
N →∞
The Bellman equation (1.3) and the optimality condition (1.4), stated in
terms of the mappings Tµ and T , highlight a central theme of this book,
which is that DP theory is intimately connected with the theory of abstract
mappings and their fixed points. Analogs of the Bellman equation, J * =
T J * , optimality conditions, and other results and computational methods
hold for a great variety of DP models, and can be stated compactly as
described above in terms of the corresponding mappings Tµ and T . The
gain from this abstraction is greater generality and mathematical insight,
as well as a more unified, economical, and streamlined analysis.
which we view as the “infinite horizon cost function” of π [cf. Eq. (1.7); we
use lim sup for generality, since we are not assured that the limit exists].
We want to minimize Jπ over π, i.e., to find
J * (x) = inf Jπ (x), x ∈ X,
π
or equivalently, †
J ≤ J& ⇒ T J ≤ T J &.
J ≤ J& ⇒ Tµ J ≤ Tµ J & , ∀ µ ∈ M,
= 0 Tµ J = 0 Tµ J
=0 J TJ =0 ) Jµ J TJ
on B(X). The properties of B(X) and some of the associated fixed point
theory are discussed in Appendix B. In particular, as shown there, B(X)
is a complete normed space, so any mapping from B(X) to B(X) that is a
contraction or an m-stage contraction for some integer m > 1, with respect
to + · +, has a unique fixed point (cf. Props. B.1 and B.2).
In what follows in this section, we describe a few special cases, which indi-
cate the connections of appropriate forms of the mapping H with the most
popular total cost DP models. In all these models the monotonicity As-
sumption 1.2.1 (or some closely related version) holds, but the contraction
Assumption 1.2.2 may not hold, as we will indicate later. Our descriptions
are by necessity brief, and the reader is referred to the relevant textbook
literature for more detailed discussion.
In particular, it can be shown that the limit exists if α < 1 and the expected
value of |g| is uniformly bounded, i.e., for some B > 0,
!( ("
E (g(x, u, w)( ≤ B, ∀ x ∈ X, u ∈ U (x). (1.12)
so that
! $ % $ %"
(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) ,
and ! $ %"
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) .
u∈U (x)
In this way, taking also into account the rule ∞−∞ = ∞ (see Appendix A), E{w}
is well-defined as an extended real number if Ω is finite or countably infinite.
12 Introduction Chap. 1
and parallels the one given for deterministic optimal control problems [cf. Eq.
(1.3)].
These properties can be expressed and analyzed in an abstract setting
by using just the mappings Tµ and T , both when Tµ and T are contractive
(see Chapter 2), and when they are only monotone and not contractive while
either g ≥ 0 or g ≤ 0 (see Chapter 4). Moreover, under some conditions, it is
possible to analyze these properties in cases where Tµ is contractive for some
but not all µ (see Chapter 3, and Section 4.4).
In the special case of the preceding example where the number of states is
finite, the system equation (1.10) may be defined in terms of the transition
probabilities
$ %
pxy (u) = Prob y = f (x, u, w) | x , x, y ∈ X, u ∈ U (x),
[cf. Eq. (1.12)] holds (or more simply, when U is a finite set), the mappings Tµ
and T are contraction mappings with respect to the standard (unweighted)
sup-norm. This is a classical model, referred to as discounted finite-state
MDP , which has a favorable theory and has found extensive applications (cf.
[Ber12a], Chapters 1 and 2). The model is additionally important, because it
is often used for computational solution of continuous state space problems
via discretization.
Sec. 1.2 Abstract Dynamic Programming Models 13
where G is some function representing expected cost per stage, and mxy (u)
are nonnegative scalars with
#
mxy (u) < 1, ∀ x ∈ X, u ∈ U (x).
y∈X
Let us consider the situation where a separate game of the type just
described is played at each stage. The game played at a given stage is repre-
sented by a “state” x that takes values in a finite set X. The state evolves
according to transition probabilities qxy (i, j) where i and j are the moves
selected by the minimizer and the maximizer, respectively (here y represents
14 Introduction Chap. 1
the next game to be played after moves i and j are chosen at the game rep-
resented by x). When the state is x, under u ∈ U and v ∈ V , the one-stage
expected payoff is u# A(x)v, where A(x) is the n × m payoff matrix, and the
state transition probabilities are
n m
# #
pxy (u, v) = ui vj qxy (i, j) = u# Qxy v,
i=1 j=1
where Qxy is the n × m matrix that has components qxy (i, j). Payoffs are
discounted by α ∈ (0, 1), and the objectives of the minimizer and maximizer,
roughly speaking, are to minimize and to maximize the total discounted ex-
pected payoff. This requires selections of u and v to strike a balance between
obtaining favorable current stage payoffs and playing favorable games in fu-
ture stages.
We now introduce an abstract DP framework related to the sequential
move selection process just described. We consider the mapping G given by
#
G(x, u, v, J) = u# A(x)v + α pxy (u, v)J(y)
y∈X
,
#
- (1.14)
#
=u A(x) + α Qxy J(y) v,
y∈X
and
(T J)(x) = min max G(x, u, v, J).
u∈U v∈V
[cf. Eq. (1.14)] is a matrix that is independent of u and v, we may view J ∗ (x)
as the value of a static game (which depends on the state x). In particular,
from the fundamental minimax equality (1.13), we have
This implies that J ∗ is also the unique fixed point of the mapping
where
H(x, v, J) = min G(x, u, v, J),
u∈U
∗
i.e., J is the fixed point regardless of the order in which minimizer and
maximizer select mixed strategies at each stage.
In the preceding development, we have introduced J ∗ as the unique
fixed point of the mappings T and T . However, J ∗ also has an interpretation
in game theoretic terms. In particular, it can be shown that J ∗ (x) is the value
of a dynamic game, whereby at state x the two opponents choose multistage
(possibly nonstationary) policies that consist of functions of the current state,
and continue to select moves using these policies over an infinite horizon. For
further discussion of this interpretation, we refer to [Ber12a] and to books on
dynamic games such as [FiV96]; see also [PaB99] and [Yu14] for an analysis
of the undiscounted case (α = 1) where there is a termination state, as in
the stochastic shortest path problems of the subsequent Example 1.2.6. An
alternative and more general formulation of sequential zero-sum games, which
allows for an infinite state space, will be given in Chapter 5.
The stochastic shortest path (SSP for short) problem is the special case of
the stochastic optimal control Example 1.2.1 where:
(a) There is no discounting (α = 1).
(b) The state space is X = {t, 1, . . . , n} and we are given transition proba-
bilities, denoted by
To simplify the notation, we have assumed that the cost per stage does not
depend on the successor state, which amounts to using expected cost per
stage in all calculations.
Since the termination state t is cost-free, the cost starting from t is zero
for every policy. Accordingly, for all cost functions, we ignore the component
that corresponds to t, and define
n
#
H(x, u, J) = g(x, u) + pxy (u)J(y), x = 1, . . . , n, u ∈ U (x), J ∈ !n .
y=1
n
$ % # $ %
(Tµ J)(x) = g x, µ(x) + pxy µ(x) J(y), x = 1, . . . , n,
y=1
0 n
1
#
(T J)(x) = min g(x, u) + pxy (u)J(y) , x = 1, . . . , n.
u∈U (x)
y=1
Note that the matrix that has components pxy (u), x, y = 1, . . . , n, is sub-
stochastic (some of its row sums may be less than 1) because there may be
a positive transition probability from a state x to the termination state t.
Consequently Tµ may be a contraction for some µ, but not necessarily for all
µ ∈ M.
The SSP problem has been discussed in many sources, including the
books [Pal67], [Der70], [Whi82], [Ber87], [BeT89], [HeL99], [Ber12a], and
[Ber17a], where it is sometimes referred to by earlier names such as “first
passage problem” and “transient programming problem.” In the framework
that is most relevant to our purposes, given in the paper by Bertsekas and
Tsitsiklis [BeT91], there is a classification of stationary policies for SSP into
proper and improper . We say that µ ∈ M is proper if, when using µ, there is
positive probability that termination will be reached after at most n stages,
regardless of the initial state; i.e., if
Tµ∗ J ∗ = T J ∗ .
In addition, J ∗ and Jµ can be computed by value iteration,
n
starting with any J ∈ ! (see [Ber12a], Chapter 3, for a textbook account).
These properties are in analogy with the desirable properties (a)-(c), given at
the end of the preceding subsection in connection with contractive models.
Regarding policy iteration, it works in its strongest form when there are
no improper policies, in which case the mappings Tµ and T are weighted sup-
norm contractions. When there are improper policies, modifications to the
policy iteration method are needed; see [Ber12a], [YuB13a], and also Section
3.6.2, where these modifications will be discussed in an abstract setting.
In Section 3.5.1 we will also consider SSP problems where the strong
SSP conditions (a) and (b) above are not satisfied. Then we will see that
unusual phenomena can occur, including that J ∗ may not be a solution of
Bellman’s equation. Still our line of analysis of Chapter 3 will apply to such
problems.
The special case of the SSP problem where the state transitions are determin-
istic is the classical shortest path problem. Here, we have a graph of n nodes
x = 1, . . . , n, plus a destination t, and an arc length axy for each directed arc
(x, y). At state/node x, a policy µ chooses an outgoing arc from x. Thus the
controls available at x can be identified with the outgoing neighbors of x [the
nodes u such that (x, u) is an arc]. The corresponding mapping H is
axu + J(u) if u ,= t,
&
H(x, u, J) = x = 1, . . . , n.
axt if u = t,
$ %
A stationary policy µ defines a graph whose arcs are x, µ(x) , x =
1, . . . , n. The policy µ is proper if and only if this graph is acyclic (it consists of
18 Introduction Chap. 1
a tree of directed paths leading from each node to the destination). Thus there
exists a proper policy if and only if each node is connected to the destination
with a directed path. Furthermore, an improper policy has finite cost starting
from every initial state if and only if all the cycles of the corresponding graph
have nonnegative cycle cost. It follows that the favorable analytical and
algorithmic results described for SSP in the preceding example hold if the
given graph is connected and the costs of all its cycles are positive. We will
see later that significant complications result if the cycle costs are allowed to
be zero, even though the shortest path problem is still well posed in the sense
that shortest paths exist if the given graph is connected (see Section 3.1).
¯
J(x) ≡ 1, x ∈ X,
so that
¯
Jπ (x) = lim sup (Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X.
N→∞
which by using the iterated expectations formula (see e.g., [BeT08]) proves
the expression (1.17).
An important special case of a multiplicative model is when g has the
form
g(x, u, y) = eh(x,u,y)
for some one-stage cost function h. We then obtain a finite-state MDP with
an exponential cost function,
& $ %'
h(x0 ,µ0 (x0 ),x1 )+···+h(xN−1 ,µN−1 (xN−1 ),xN )
Jπ (x0 ) = lim sup E e ,
N→∞
which is often used to introduce risk aversion in the choice of policy through
the convexity of the exponential.
There is also a multiplicative version of the infinite state space stochas-
tic optimal control problem of Example 1.2.1. The mapping H takes the
form ! $ %"
H(x, u, J) = E g(x, u, w)J f (x, u, w) ,
where xk+1 = f (xk , uk , wk ) is the underlying discrete-time dynamic system;
cf. Eq. (1.10).
Multiplicative models and related risk-sensitive models are discussed
extensively in the literature, mostly for the exponential cost case and under
different assumptions than ours; see e.g., [HoM72], [Jac73], [Rot84], [ChS87],
[Whi90], [JBE94], [FlM95], [HeM96], [FeM97], [BoM99], [CoM99], [BoM02],
[BBB08], [Ber16a]. The works of references [DeR79], [Pat01], and [Pat07]
relate to the stochastic shortest path problems of Example 1.2.6, and are the
closest to the semicontractive models discussed in Chapters 3 and 4, based
on the author’s paper [Ber16a]; see the next example and Section 3.5.2.
where pxy (u) is the probability of transition from x to y under u, and g(x, u, y)
is the cost of the transition; see Section 3.5.2 for a detailed derivation. Clearly
Tµ has the affine monotonic form (1.18).
n
#
dxi = 1, ∀ x ∈ A.
i=1
Sec. 1.2 Abstract Dynamic Programming Models 21
Original System
i , j=1
according to pij (u), with cost
! !
), x ), y
Aggregate System
Figure 1.2.2 Illustration of the relation between aggregate and original sys-
tem states.
j3 y1 1 y2
φj1 y1
1 φj1 y2
i pij1 (u) x j1
x=ip 2 φj1 y3
y2 y3
x j1 j2
j2 j3
Representative/Aggregate States
1 if i = ix ,
&
dxi = (1.19)
0 if i =
, ix .
1 if j = jy ,
&
φjy =
0 if j =
, jy .
the original state space with S1 ∪· · ·∪Sn = {1, . . . , n}. We envision a network
of processors & = 1, . . . , m, each assigned to the computation of a local cost
function V" , defined on the corresponding aggregate state/subset S" :
Processor & also maintains a scalar aggregate cost R" for its aggregate state,
which is a weighted average of the detailed cost values V"x within S" :
#
R" = d"x V"x ,
x∈S"
+
where d"x are given probabilities with d"x ≥ 0 and x∈S d"x = 1. The aggre-
"
gate costs R" are communicated between processors and are used to perform
the computation of the local cost functions V" (we will discuss computation
models of this type in Section 2.6).
We denote J = (V1 , . . . , Vm , R1 , . . . , Rm ). We introduce the mapping
H(x, u, J) defined for each of the n states x by
and for each original system state y, we denote by s(y) the index of the subset
to which y belongs [i.e., y ∈ Ss(y) ].
We may view H as an abstract mapping on the space of J, and aim to
find its fixed point J ∗ = (V1∗ , . . . , Vm
∗
, R1∗ , . . . , Rm
∗
). Then, for & = 1, . . . , m, we
∗
may view V" as an approximation to the optimal cost vector of the original
MDP starting at states x ∈ S" , and we may view R"∗ as a form of aggregate
cost for S" . The advantage of this formulation is that it involves significant
decomposition and parallelization of the computations among the processors,
when performing various DP algorithms. In particular, the computation of
W" (x, u, V" , R1 , . . . , Rm ) depends on just the local vector V" , whose dimension
may be potentially much smaller than n.
for λ ∈ (0, 1), where T $ is the %-fold composition of T with itself % times.
Here there should be conditions that guarantee the convergence of the
infinite series in the preceding definition. The multistep analog of the
projected Eq. (1.20) is
Φr = Πξ T (λ) (Φr).
The popular temporal difference methods, such as TD(λ), LSTD(λ), and
LSPE(λ), aim to solve this equation (see the book references on approx-
imate DP, neuro-dynamic programming, and reinforcement learning cited
earlier). The mapping T (λ) also forms the basis for the λ-policy iteration
method to be discussed in Sections 2.5, 3.2.4, and 4.3.3.
The multistep analog of the aggregation Eq. (1.22) is
and methods that are similar to the temporal difference methods can be
used for its solution. In particular, a multistep method based on the map-
ping T (λ) is the, so-called, λ-aggregation method (see [Ber12a], Chapter
6), as well as other forms of aggregation (see [Ber12a], [YuB12]).
In the case where T is a linear mapping of the form
T J = AJ + b,
Equivalently,
5 6−1 5 6
c+1 1
P (c) J
= I −A b+ J , (1.23)
c c
where I is the identity matrix. Then it can be shown (see Exercise 1.2 or
the papers [Ber16b], [Ber18c]) that if
λ
c= ,
1−λ
we have
T (λ) = T · P (c) = P (c) · T.
Moreover, the vectors J, P (c) J, and T (λ) J are colinear and satisfy
c + 1 $ (c) %
T (λ) J = J + P J −J .
c
The preceding formulas show that T (λ) and P (c) are closely related, and
that iterating with T (λ) is “faster” than iterating with P (c) , since the eigen-
values of A are within the unit circle, so that T is a contraction. In addition,
methods such as TD(λ), LSTD(λ), LSPE(λ), and their projected versions,
which are based on T (λ) , can be adapted to be used with P (c) .
A more general form of multistep approach, introduced and studied
in the paper [YuB12], replaces T (λ) with a mapping T (w) : &n #→ &n that
has components
#∞
$ %
T (w) J (i) = wi$ (T $ J)(i), i = 1, . . . , n, J ∈ &n ,
$=1
where w is a vector sequence whose ith component, (wi1 , wi2 , . . .), is a prob-
ability distribution over the positive integers. Then the multistep analog
of the projected equation (1.20) is
Φr = Πξ T (w) (Φr), (1.24)
while the multistep analog of the aggregation equation (1.22) is
Φr = ΦDT (w) (Φr). (1.25)
The mapping T (λ) is obtained for wi$ = (1 − λ)λ$−1 , independently of
the state i. A more general version, where λ depends on the state i, is
obtained for wi$ = (1 − λi )λ$−1
i . The solution of Eqs. (1.24) and (1.25)
by simulation-based methods is discussed in the paper [YuB12]; see also
Exercise 1.3.
Let us also note that there is a connection between projected equa-
tions of the form (1.24) and aggregation equations of the form (1.25). This
connection is based on the use of a seminorm [this is given by the same
expression as the norm + · +ξ of Eq. (1.21), with some of the components
of ξ allowed to be 0]. In particular, the most prominent cases of aggrega-
tion equations can be viewed as seminorm projected equations because, for
these cases, ΦD is a seminorm projection (see [Ber12a], p. 639, [YuB12],
Section 4). Moreover, they can also be viewed as projected equations where
the projection is oblique (see [Ber12a], Section 7.3.6).
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 29
and
! " #$
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) , for all x, (1.28)
u∈U(x)
Tµ J = Gµ + Aµ J,
Aµ (γ1 J1 + γ2 J2 ) = γ1 Aµ J1 + γ2 Aµ J2 .
This is true because of the linearity of the expected value operation in Eq.
(1.27). The linearity of Tµ implies another important property: (T J)(x) is
a concave function of J for every x. By this we mean that the set
! "
Cx = (J, ξ) | (T J)(x) ≥ ξ, J ∈ R(X), ξ ∈ & (1.29)
is convex for all x ∈ X, where R(X) is the set of real-valued functions over
the state space X, and & is the set of real numbers. This follows from the
linearity of Tµ , the alternative definition of T given by Eq. (1.26), and the
fact that for a fixed x, the minimum of the linear functions (Tµ J)(x) over
µ ∈ M is concave as a function of J.
We illustrate these properties graphically with an example.
Assume that there are two states 1 and 2, and two controls u and v. Consider
the policy µ that applies control u at state 1 and control v at state 2. Then
the operator Tµ takes the form
2
# $ %
(Tµ J)(1) = p1j (u) g(1, u, y) + αJ(y) , (1.30)
y=1
2
# $ %
(Tµ J)(2) = p2y (v) g(2, v, y) + αJ(y) , (1.31)
y=1
where pxy (u) and pxy (v) are the probabilities that the next state will be y,
when the current state is x, and the control is u or v, respectively. Clearly,
(Tµ J)(1) and (Tµ J)(2) are linear functions of J. Also the operator T of the
Bellman equation J = T J takes the form
0 2
# $ %
(T J)(1) = min p1y (u) g(1, u, y) + αJ(y) ,
y=1
1 (1.32)
2
# $ %
p1y (v) g(1, v, y) + αJ(y) ,
y=1
0 2
# $ %
(T J)(2) = min p2y (u) g(2, u, y) + αJ(y) ,
y=1
1 (1.33)
2
# $ %
p2y (v) g(2, v, y) + αJ(y) .
y=1
Thus, (T J)(1) and (T J)(2) are concave and piecewise linear as functions of
the two-dimensional vector J (with two pieces; more generally, as many linear
pieces as the number of controls). This concavity property holds in general
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 31
Geometrical Interpretations
Let us consider the mapping T for a problem that involves two states, 1 and
2, but an infinite number of controls. In particular, the control space at both
32 Introduction Chap. 1
∗ J ∗ (1)
(1) J ∗ (2)
Tµ! J
1 J J
(1) = 0 1 J J
Figure 1.3.2 Geometric interpretation of the linear Bellman operator Tµ and the
corresponding Bellman equation. The graph of Tµ is a plane in the space ! × !,
and when projected on a one-dimensional plane that corresponds to a single state
and passes through Jµ , it becomes a line. Then there are three cases:
(a) The line has slope less than 45 degrees, so it intersects the 45-degree line at
a unique point, which is equal to Jµ , the solution of the Bellman equation
J = Tµ J. This is true if Tµ is a contraction mapping, as is the case for
discounted problems with bounded cost per stage.
(b) The line has slope less than 45 degrees. Then it intersects the 45-degree line
at a unique point, which is a solution of the Bellman equation J = Tµ J,
but is not equal to Jµ . Then Jµ is not real-valued; we consider such µ to
be unstable under µ.
(c) The line has slope exactly equal to 45 degrees. This is an exceptional case
where the Bellman equation J = Tµ J has an infinite number of real-valued
solutions or no real-valued solution at all; we will provide examples where
this occurs later.
states is the unit interval, U (1) = U (2) = [0, 1]. Here (T J)(1) and (T J)(2)
are given by
Tµ̃ J
One-step lookahead
Position
ON-LINEEvaluation Policy µ̃ with
PLAY Lookahead Tree States
Tµ̃ J˜ = T J˜
Final Features Optimal Policy
Tµ J Final Features Optimal Policy
One-step lookahead Generic policy µ
T J = minµ Tµ J
= 4 Model minµ Tµ J˜
1 J J
Figure 1.3.3 Geometric interpretation of the Bellman operator T , and the cor-
responding Bellman equation. For a fixed x, the function (T J)(x) can be written
as minµ (Tµ J)(x), so it is concave as a function of J. The optimal cost function
J ∗ satisfies J ∗ = T J ∗ , so it is obtained from the intersection of the graph of T J
and the 45 degree line shown, assuming J ∗ is real-valued.
Note that the graph of T lies below the graph of every operator Tµ , and
is in fact obtained as the lower envelope of the graphs of Tµ as µ ranges over
the set of policies M. In particular, for any given function J, ˜ for every x, the
˜
value (T J)(x) is obtained by finding a support hyperplane/subgradient of the
graph of the concave function (T J)(x) at J, ˜ as shown in the figure. This support
hyperplane is defined by the control µ(x) of a policy µ̃ that attains the minimum
˜
of (Tµ J)(x) over µ:
˜
µ̃(x) ∈ arg min (Tµ J)(x)
µ∈M
(there may be multiple policies attaining this minimum, defining multiple support
hyperplanes). This construction also shows how the minimization
˜
(T J)(x) ˜
= min (Tµ J)(x)
µ∈M
˜
corresponds to a linearization of the mapping T at the point J.
State 1 State 2
Figure 1.3.4 Illustration of the Bellman operator T for states 1 and 2 in Example
1.3.2. The parameter values are g1 = 5, g2 = 3, r11 = 3, r12 = 15, r21 = 9,
r22 = 1, and the discount factor is α = 0.9. The optimal costs are J ∗ (1) = 49.7
and J ∗ (2) = 40.0, and the optimal policy is µ∗ (1) = 0.59 and µ∗ (2) = 0. The
figure also shows the one-dimensional slices of the operators at J(1) = 15 and
J(2) = 30, together with the corresponding 45-degree lines.
Jk+1 = T Jk , k = 0, 1, . . . ,
Jk+1 = Tµ Jk , k = 0, 1, . . . ,
and it can be similarly interpreted, except that the graph of the function
Tµ J is linear. Also we will see shortly that there is a similarly compact
description for the policy iteration algorithm.
Tµ̃ J˜ = T J,
˜
36 Introduction Chap. 1
J2
TJ
J1
1 J J
45◦ Line
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1 1 J J
Stability Region 0 J0 J1 J2
as in Fig. 1.3.6. This equation implies that the graph of Tµ̃ J just touches
the graph of T J at J,˜ as shown in the figure. Moreover, for each state
x ∈ X the hyperplane Hµ̃ (x)
&$ % '
Hµ̃ (x) = J(x), ξ | (Tµ̃ J)(x) ≥ ξ ,
˜ ˜ ˜
$ %
at the point J(x), (T J)(x) and defines a subgradient of (T J)(x) at J.
Note that the one-step lookahead policy µ̃ need not be unique, since T
need not be differentiable.
In conclusion, the equation
J = Tµ̃ J
J = TJ
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 37
Tµ̃ J
Corresponds to One-Step Lookahead Policy ˜
One-Step Lookahead Policy Cost
One-Step Lookahead Policy µ̃
TJ
1 J J
One-Step
Cost Approximation Value Space Lookahead Policy Cost l
Approximation
One-Step Lookahead Policy Cost l
T J˜ = min Tµ J.
˜
µ
J = Tµ̃ J
˜ and its solution, Jµ̃ , can be viewed as the result of a Newton iteration
at J,
˜ In summary, the Newton iterate at J˜ is Jµ̃ , the solution of
at the point J.
the linearized equation J = Tµ̃ J.†
We may also consider approximation in value space with %-step looka-
† The classical Newton’s method for solving a fixed point problem of the form
y = T (y), where y is an n-dimensional vector, operates as follows: At the current
iterate yk , we linearize T and find the solution yk+1 of the corresponding linear
fixed point problem. Assuming T is differentiable, the linearization is obtained
38 Introduction Chap. 1
head using J˜. This is the same as approximation in value space with one-
step lookahead using the (% − 1)-fold operation of T on J, ˜ T $−1 J.
˜ Thus
it can be interpreted as a Newton step starting from T $−1 ˜
J, the result of
˜ This is illustrated in Fig. 1.3.7.†
% − 1 value iterations applied to J.
∂T (yk )
yk+1 = T (yk ) + (yk+1 − yk ),
∂y
/yk+1 − y ∗ / = O /yk − y ∗ /2 ,
$ %
where / · / is the Euclidean norm, and holds assuming the Jacobian matrix ex-
ists and is Lipschitz continuous (see [Ber16], Section 1.4). There are extensions
of Newton’s method that are based on solving a linearized system at the cur-
rent iterate, but relax the differentiability requirement to piecewise differentiabil-
ity, and/or component concavity, while maintaining the superlinear convergence
property of the method.
The structure of the Bellman operators (1.28) and (1.27), with their mono-
tonicity and concavity properties, tends to enhance the convergence and rate of
convergence properties of Newton’s method, even in the absence of differentiabil-
ity, as evidenced by the convergence analysis of PI, and the extensive favorable
experience with rollout, PI, and MPC. In this connection, it is worth noting that
in the case of Markov games, where the concavity property does not hold, the
PI method may oscillate, as shown by Pollatschek and Avi-Itzhak [PoA69], and
needs to be modified to restore its global convergence; see the author’s paper
[Ber21c]. We will discuss abstract versions of game and minimax contexts n
Chapter 5.
† Variants of Newton’s method that involve combinations of first order it-
erative methods, such as the Gauss-Seidel and Jacobi algorithms, and New-
ton’s method, and they belong to the general family of Newton-SOR methods
(SOR stands for “successive over-relaxation”); see the classic book by Ortega
and Rheinboldt [OrR70] (Section 13.4).
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 39
Tµ̃Policy
Corresponds to One-Step Lookahead J ˜
Multistep Lookahead Policy Cost
One-Step Lookahead Policy µ̃
TJ
(a) Policy evaluation, which computes the cost function Jµ . One possi-
bility is to solve the corresponding Bellman equation
& $ % $ %'
Jµ (x) = E g x, µ(x), w + αJµ f (x, µ(x), w) , for all x.
(1.34)
However, the value Jµ (x) for any x can also be computed by Monte
Carlo simulation, by averaging over many randomly generated tra-
jectories the cost of the policy starting from x. Other possibilities
include the use of specialized simulation-based methods, based on
the projected and aggregation Bellman equations discussed in Sec-
tion 1.2.4, for which there is extensive literature (see e.g., the books
[BeT96], [SuB98], [Ber12a], [Ber19b]).
(b) Policy improvement , which computes the rollout policy µ̃ using the
one-step lookahead minimization
& $ %'
µ̃(x) ∈ arg min E g(x, u, w) + αJµ f (x, u, w) , for all x.
u∈U(x)
(1.35)
40 Introduction Chap. 1
∗ TJ
Prob. = 1 Prob. =
also Newton Step
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1
1 J J
Cost of µk+1 Cost of µk
Jµk+1 = Tµk+1 Jµk+1 Jµk = Tµk Jµk
Let us now consider the general case where the mapping Tµ is not assumed
linear for all stationary policies µ ∈ M. In this case we still have the
alternative description of T
(T J)(x) = min (Tµ J)(x), for all x,
µ∈M
but T need not be concave, i.e., for some x ∈ X, the function (T J)(x) may
not be concave as a function of J. We illustrate this fact in Fig. 1.3.9.
The nonlinearity of the mapping Tµ can have profound consequences
on the validity of the PI algorithm and its interpretation in terms of New-
ton’s method. A prominent case where this is so arises in minimax problems
and related two-person zero sum game settings (cf. Example 1.2.5). We will
discuss this case in Chapter 5, where we will introduce modifications to the
PI algorithm that restore its convergence property.
We note, however, that it is possible that the mappings Tµ are non-
linear and convex, but that T has concave and differentiable components
(T J)(x), in which case the Newton step interpretation applies. This occurs
in particular in the important case of zero-sum dynamic games involving a
linear system and a quadratic cost function.
The examples in the preceding sections demonstrate that while the mono-
tonicity assumption is satisfied for most DP models, the contraction as-
sumption may or may not hold. In particular, the contraction assumption
† This was part of Kleinman’s Ph.D. thesis [Kle67] at M.I.T., supervised by
M. Athans. Kleinman gives credit for the one-dimensional version of his results to
Bellman and Kalaba [BeK65]. Note also that the first proposal of the PI method
was given by Bellman in his classic book [Bel57], under the name “approximation
in policy space.”
42 Introduction Chap. 1
Tµ! J
Tµ J
T J = min{Tµ , Tµ! }
(1) = 0 1 J J
not with others, and policy iteration may fail altogether. Some of
these issues may be mitigated when additional structure is present,
which we discuss in Sections 4.4-4.6, focusing on noncontractive mod-
els that also have some semicontractive structure, and corresponding
favorable properties.
Examples of DP problems from each of the model categories above,
primarily special cases of the specific DP models discussed in Section 1.2,
are scattered throughout the book. They serve both to illustrate the theory
and its exceptions, and to highlight the beneficial role of additional special
structure.
We finally note some other types of models where there are restric-
tions to the set of policies, i.e., M may be a strict subset of the set of
functions µ : X #→ U with µ(x) ∈ U (x) for all x ∈ X. Such restrictions
may include measurability (needed to establish a mathematically rigorous
probabilistic framework) or special structure that enhances the characteri-
zation of optimal policies and facilitates their computation. These models
were treated in Chapter 5 of the first edition of this book, and also in
Chapter 6 of [BeS78].†
Algorithms
† Chapter 5 of the first edition is accessible from the author’s web site and
the book’s web page, and uses terminology and notation that are consistent with
the present edition.
Sec. 1.5 Notes, Sources, and Exercises 45
EXERCISES
This exercise shows how starting with an abstract mapping, we can obtain mul-
tistep mappings with the same fixed points and a stronger contraction modulus.
Consider a set of mappings Tµ : B(X) (→ B(X), µ ∈ M, satisfying the con-
traction Assumption 1.2.2, let m be a positive integer, and let Mm be the set
of m-tuples ν = (µ0 , . . . , µm−1 ), where µk ∈ M, k = 1, . . . , m − 1. For each
ν = (µ0 , . . . , µm−1 ) ∈ Mm , define the mapping T ν , by
/T ν J − T ν J # / ≤ αm /J − J # /, ∀ J, J # ∈ B(X), (1.39)
and
/T J − T J # / ≤ αm /J − J # /, ∀ J, J # ∈ B(X), (1.40)
where T is defined by
and by taking infimum of both sides over (Tµ0 · · · Tµm−1 ) ∈ Mm and dividing by
v(x), we obtain
(T J)(x) − (T J # )(x)
≤ αm /J − J # /, ∀ x ∈ X.
v(x)
Similarly
(T J # )(x) − (T J)(x)
≤ αm /J − J # /, ∀ x ∈ X,
v(x)
and by combining the last two relations and taking supremum over x ∈ X, Eq.
(1.40) follows.
The purpose of this exercise is establish a close connection between the map-
pings underlying temporal difference and proximal methods (cf. Section 1.2.5).
Consider a linear mapping of the form
T J = AJ + b,
cf. Eq. (1.23) [equivalently, for a given J, P (c) J is the unique vector Y ∈ !n that
solves the equation
1
Y − T Y = (J − Y ),
c
(cf. Fig. 1.5.1)].
(a) Show that P (c) is given by
∞
#
P (c) = (1 − λ) λ" T " ,
"=0
Sec. 1.5 Notes, Sources, and Exercises 49
1
δ c (J −Y)
ˆ J* J Y
T (λ) J = T Jˆ Jˆ = P (c) J J
Figure 1.5.1. Illustration of the iterates T (λ) J and P (c) J for finding the fixed
point J ∗ of a linear mapping T . $Given %J, we find the proximal iterate Jˆ =
P (c) J and then add the amount 1c Jˆ − J to obtain T (λ) J = T P (c) J. If T is a
contraction mapping, T (λ) J is closer to J ∗ than P (c) J.
respectively.
(d) Show that the fixed point property of part (c) yields the following formula
for the multistep mapping T (λ) :
∞
# ζi (1 − λ)
θi = (1 − λ) λ" ζi"+1 = , i = 1, . . . , n, (1.44)
1 − ζi λ
"=0
8−1 8−1 ∞
c+1 1
7 7 #
I−A = I−A = λ(I − λA)−1 = λ (λA)" .
c λ
"=0
1 1−λ
Thus, using the equation c
= λ
,
7c + 1 8−1 7 1
8
P (c) J = I −A b+ J
c c
∞
1−λ
# 7 8
=λ (λA)" b + J
λ
"=0
∞ ∞
# #
= (1 − λ) (λA)" J + λ (λA)" b,
"=0 "=0
(λ) (λ) +∞
which is equal to A J + b . The formula P (c) = (1 − λ) "=0
λ" T " follows
from this expression.
Sec. 1.5 Notes, Sources, and Exercises 51
(c) To show that T (λ) J is the fixed point of WJ , we must verify that
T (λ) J = WJ T (λ) J ,
$ %
or equivalently that
T (λ) J = (1 − λ)T J + λT T (λ) J) = (1 − λ)T J + λT (λ) (T J).
$
where w" (x) are nonnegative scalars such that for all x ∈ X,
∞
#
w" (x) = 1.
"=1
Show that
∞
( (w)
((Tµ J)(x) − (Tµ(w) J # )(x)(
(
#
≤ w" (x) α" /J − J # /, ∀ x ∈ X,
v(x)
"=1
(w)
so that Tµ is a contraction with modulus
∞
#
ᾱ = sup w" (x) α" ≤ α < 1.
x∈X
"=1
(w)
Moreover, for all µ ∈ M, the mappings Tµ and Tµ have the same fixed point.
(w)
showing the contraction property of Tµ .
Let Jµ be the fixed point of Tµ . By using the relation (Tµ" Jµ )(x) = Jµ (x),
we have for all x ∈ X,
∞
, ∞
-
# #
(Tµ(w) Jµ )(x) Tµ" Jµ
$ %
= w" (x) (x) = w" (x) Jµ (x) = Jµ (x),
"=1 "=1
(w) (w)
so Jµ is the fixed point of Tµ [which is unique since Tµ is a contraction].
2
Contractive Models
Contents
53
54 Contractive Models Chap. 2
In this chapter we consider the abstract DP model of Section 1.2 under the
most favorable assumptions: monotonicity and weighted sup-norm contrac-
tion. Important special cases of this model are the discounted problems
with bounded cost per stage (Example 1.2.1-1.2.5), the stochastic shortest
path problem of Example 1.2.6 in the case where all policies are proper, as
well as other problems involving special structures.
We first provide some basic analytical results and then focus on two
types of algorithms: value iteration and policy iteration. In addition to
exact forms of these algorithms, we discuss combinations and approximate
versions, as well as asynchronous distributed versions.
In this section we recall the abstract DP model of Section 1.2, and derive
some of its basic properties under the monotonicity and contraction as-
sumptions of Section 1.3. We consider a set X of states and a set U of
controls, and for each x ∈ X, a nonempty control constraint set U (x) ⊂ U .
We denote by M the set of all functions µ : X #→ U with µ(x) ∈ U (x)
for all x ∈ X, which we refer to as policies (or “stationary policies,” when
we want to emphasize the distinction from nonstationary policies, to be
discussed later).
We denote by R(X) the set of real-valued functions J : X #→ %. We
have a mapping H : X × U × R(X) #→ % and for each policy µ ∈ M, we
consider the mapping Tµ : R(X) #→ R(X) defined by
! "
(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X.
We also consider the mapping T defined by
(T J)(x) = inf H(x, u, J) = inf (Tµ J)(x), ∀ x ∈ X.
u∈U(x) µ∈M
[We will use frequently the second equality above, which holds because M
can be viewed as the Cartesian product Πx∈X U (x).] We want to find a
function J * ∈ R(X) such that
J * (x) = inf H(x, u, J * ), ∀ x ∈ X,
u∈U(x)
i.e., to find a fixed point of T within R(X). We also want to obtain a policy
µ∗ ∈ M such that Tµ∗ J * = T J * .
Let us restate for convenience the contraction and monotonicity as-
sumptions of Section 1.2.2.
J ≤ J# ⇒ T kJ ≤ T kJ #, Tµk J ≤ Tµk J # , ∀ µ ∈ M,
1 α
*J * − J* ≤ *T J − J*, *J * − T J* ≤ *T J − J*.
1−α 1−α
1 α
*Jµ − J* ≤ *Tµ J − J*, *Jµ − Tµ J* ≤ *Tµ J − J*.
1−α 1−α
Furthermore, for every " > 0, there exists µ" ∈ M such that
Proof: We note that the right-hand side of Eq. (2.1) holds by Prop.
2.1.1(e) (see the remark following its proof). Thus inf µ∈M Jµ (x) ≤ J * (x)
for all x ∈ X. To show the reverse inequality as well as the left-hand side
of Eq. (2.1), we note that for all µ ∈ M, we have T J * ≤ Tµ J * , and since
J * = T J * , it follows that J * ≤ Tµ J * . By applying repeatedly Tµ to both
sides of this inequality and by using the monotonicity Assumption 2.1.1,
we obtain J * ≤ Tµk J * for all k > 0. Taking the limit as k → ∞, we see
that J * ≤ Jµ for all µ ∈ M, so that J * (x) ≤ inf µ∈M Jµ (x) for all x ∈ X.
Q.E.D.
Note that without monotonicity, we may have inf µ∈M Jµ (x) < J * (x)
for some x. This is illustrated by the following example.
Example 2.1.1 (Counterexample Without Monotonicity)
with J¯ being some function in B(X), where Tµ0 · · · Tµk J denotes the com-
position of the mappings Tµ0 , . . . , Tµk applied to J, i.e.,
! "
Tµ0 · · · Tµk J = Tµ0 Tµ1 · · · (Tµk−1 (Tµk J)) · · · .
and the last equality holds by Prop. 2.1.1(b)]. Combining the preceding
relations, we obtain J * (x) = inf π∈Π Jπ (x).
Thus, in DP terms, we may view J * as an optimal cost function over
all policies, including nonstationary ones. At the same time, Prop. 2.1.2
states that stationary policies are sufficient in the sense that the optimal
cost can be attained to within arbitrary accuracy with a stationary policy
[uniformly for all x ∈ X, as Eq. (2.1) shows].
Sec. 2.1 Bellman’s Equation and Optimality Conditions 59
J ≤ J# + c v ⇒ Tµ J ≤ Tµ J # + αc v, (2.2)
H(x, u, J + c v) − H(x, u, J)
≤ α*J + c v − J* = αc.
v(x)
The condition (2.3) implies the desired condition (2.2). Conversely, con-
dition (2.2) for c = 0 yields the monotonicity assumption, while for c =
*J # − J* it yields the contraction assumption. Q.E.D.
αk c
TJ ≤ J + cv ⇒ J * ≤ T kJ + v,
1−α
αk c
J ≤ TJ + cv ⇒ T kJ ≤ J * + v.
1−α
Proof: (a) We show the first relation. Applying Eq. (2.2) with J # and J
replaced by J and T J, respectively, and taking infimum over µ ∈ M, we
see that if T J ≤ J + c v, then T 2 J ≤ T J + αc v. Proceeding similarly, it
follows that
T ! J ≤ T !−1 J + α!−1 c v.
We now write for every k,
k
$ k
$
T kJ − J = (T ! J − T !−1 J) ≤ α!−1 c v,
!=1 !=1
TJ ≤ J + cv
Sec. 2.2 Limited Lookahead Policies 61
T k+1 J ≤ T k J + αk c v.
˜ ≤ α
*Jµ − T J* *T J˜ − J*,
˜ (2.5)
1−α
2α ˜
*Jµ − J * * ≤ *J − J * *, (2.6)
1−α
and
2 ˜
*Jµ − J * * ≤ *T J̃ − J*. (2.7)
1−α
Proof: Equation (2.5) follows from the second relation of Prop. 2.1.1(e)
˜ Also from the first relation of Prop. 2.1.1(e) with J = J * , we
with J = J.
have
1
*Jµ − J * * ≤ *Tµ J * − J * *.
1−α
62 Contractive Models Chap. 2
where the second inequality follows from Eqs. (2.5) and (2.8). This proves
Eq. (2.7). Q.E.D.
2α"
1−α
Example 2.2.1
Consider a discounted optimal control problem with two states, 1 and 2, and
deterministic transitions. State 2 is absorbing, but at state 1 there are two
possible decisions: move to state 2 (policy µ∗ ) or stay at state 1 (policy µ).
Sec. 2.2 Limited Lookahead Policies 63
The cost of each transition is 0 except for the transition from 1 to itself under
policy µ, which has cost 2α", where " is a positive scalar and α ∈ [0, 1) is the
discount factor. The optimal policy µ∗ is to move from state 1 to state 2, and
the optimal cost-to-go function is J ∗ (1) = J ∗ (2) = 0. Consider the vector J˜
˜ = −" and J˜(2) = ", so that
with J(1)
#J˜ − J ∗ # = ",
as assumed in Eq. (2.6) (cf. Prop. 2.2.1). The policy µ that decides to stay
at state 1 is a one-step lookahead policy based on J˜, because
We have
2α" 2α
Jµ (1) = = #J˜ − J ∗ #,
1−α 1−α
so the bound of Eq. (2.6) holds with equality.
where δ and " are nonnegative scalars. These scalars are usually unknown,
so the resulting analysis will have a mostly qualitative character.
The case δ > 0 arises when the state space is either infinite or it is
finite but very large. Then instead of calculating (T J)(x) for all states x,
one may do so only for some states and estimate (T J)(x) for the remain-
ing states x by some form of interpolation. Alternatively, one may use
simulation data [e.g., noisy values of (T J)(x) for some or all x] and some
kind of least-squares error fit of (T J)(x) with a function from a suitable
parametric class. The function J˜ thus obtained will satisfy *J˜ − T J* ≤ δ
with δ > 0. Note that δ may not be small in this context, and the resulting
performance degradation may be a primary concern.
Cases where " > 0 may arise when the control space is infinite or
finite but large, and the minimization involved in the calculation of (T J)(x)
cannot be done exactly. Note, however, that it is possible that
δ > 0, " = 0,
64 Contractive Models Chap. 2
*Tµ J − T J* ≤ "
for some " > 0, and then we may set J˜ = Tµ J. In this case we have " = δ
in Eq. (2.9).
In a multistep method with approximations, we are given a posi-
tive integer m and a lookahead function Jm , and we successively compute
(backwards in time) Jm−1 , . . . , J0 and policies µm−1 , . . . , µ0 satisfying
Proof: Using the triangle inequality, Eq. (2.10), and the contraction prop-
erty of T , we have for all k
δ(1 − αk )
*Jm−k − T k Jm * ≤ , k = 1, . . . , m. (2.13)
1−α
Sec. 2.2 Limited Lookahead Policies 65
From Eq. (2.10), we have *Jk − Tµk Jk+1 * ≤ δ + ", so for all k
showing that
(δ + ")(1 − αk )
*Jm−k − Tµm−k · · · Tµm−1 Jm * ≤ , k = 1, . . . , m.
1−α
(2.14)
Using the fact *Tµ0 J1 − T J1 * ≤ " [cf. Eq. (2.10)], we obtain
where the last inequality follows from Eqs. (2.13) and (2.14) for k = m − 1.
From this relation and the fact that Tµ0 · · · Tµm−1 and T m are con-
tractions with modulus αm , we obtain
We also have using Prop. 2.1.1(e), applied in the context of the multistep
mapping of Example 1.3.1,
1
*Jπ − J * * ≤ *Tµ0 · · · Tµm−1 J * − J * *.
1 − αm
Combining the last two relations, we obtain the desired result. Q.E.D.
Note that for m = 1 and δ = " = 0, i.e., the case of one-step lookahead
policy µ with lookahead function J1 and no approximation error in the
minimization involved in T J1 , Eq. (2.11) yields the bound
2α
*Jµ − J * * ≤ *J1 − J * *,
1−α
66 Contractive Models Chap. 2
In this section, we discuss value iteration (VI for short), the algorithm
that starts with some J ∈ B(X), and generates T J, T 2 J, . . .. Since T is
a weighted sup-norm contraction under Assumption 2.1.2, the algorithm
converges to J * , and the rate of convergence is governed by
*T k J − J * * ≤ αk *J − J * *, k = 0, 1, . . . .
*Tµk J − Jµ * ≤ αk *J − Jµ *, k = 0, 1, . . . .
*J˜ − J * * ≤ γ, Tµ J˜ = T J,
˜
2α γ
*Jµ − J * * ≤ . (2.15)
1−α
*Jk+1 − T Jk * ≤ δ, k = 0, 1, . . . . (2.17)
68 Contractive Models Chap. 2
δ
lim sup *Jk − J * * ≤ , (2.19)
k→∞ 1−α
" 2αδ
lim sup *Jµk − J * * ≤ + . (2.20)
k→∞ 1 − α (1 − α)2
Proof: Using the triangle inequality, Eq. (2.17), and the contraction prop-
erty of T , we have
and finally
(1 − αk )δ
*Jk − T k J0 * ≤ , k = 0, 1, . . . . (2.21)
1−α
By taking limit as k → ∞ and by using the fact limk→∞ T k J0 = J * , we
obtain Eq. (2.19).
We also have using the triangle inequality and the contraction prop-
erty of Tµk and T ,
Since for a zero cost per stage and the given deterministic transitions, we
have T Jk = (2αrk , 2αrk ), the preceding minimization is written as
, -
rk+1 ∈ arg min ξ1 (r − 2αrk )2 + ξ2 (2r − 2αrk )2 ,
r
lim *Jµk − J * * = 0,
k→∞
Proof: We have
T J 45 Degree Line
Prob. = 1 Prob. =
∗ TJ
n Value Iterations Prob. = 1 Prob. =
J J∗ = T J∗
0 Prob. = 1
J0 Do not Replace Set S 1 J J
= T J0 = T 2 J0
Tµ0 J J
∗ TJ
Prob. = 1 Prob. =
J J∗ = T J∗
0 Prob. = 1
J Jµ0 = Tµ0 Jµ0 1 J J
J µ1 = Tµ1 J µ1
Policy Improvement Exact Policy Evaluation (Exact if
In the case where the set of policies is infinite, we may assert the
convergence of the sequence of generated policies under some compactness
and continuity conditions. In particular, we will assume that the state
space is finite, X = {1, . . . , n}, and that each control constraint set U (x)
is a compact subset of %m . We will view a cost function J as an element
of %n , and a policy µ as an element of the set U (1) × · · · × U (n) ⊂ %mn ,
which is compact. Then {µk } has at least one limit point µ, which must
be an admissible policy. The following proposition guarantees, under an
additional continuity assumption for H(x, ·, ·), that every limit point µ is
optimal.
we have
! "
H x, µk (x), Jµk−1 ≤ H(x, u, Jµk−1 ), x = 1, . . . , n, u ∈ U (x).
We now consider the PI method where the policy evaluation step and/or
the policy improvement step of the method are implemented through ap-
proximations. This method generates a sequence of policies {µk } and a
corresponding sequence of approximate cost functions {Jk } satisfying
where δ and " are some scalars, and *·* denotes the weighted sup-norm (the
one used in the contraction Assumption 2.1.2). The following proposition
provides an error bound for this algorithm.
*J − Jµ * ≤ δ, *Tµ J − T J* ≤ ",
Tµ Jµ ≤ Jµ + (" + 2αδ) v,
α(" + 2αδ)
Jµ = Tµ Jµ = Tµ Jµ + (Tµ Jµ − Tµ Jµ ) ≤ Tµ Jµ + v,
1−α
where the inequality follows by using Prop. 2.1.3 and Eq. (2.27). Subtract-
ing J * from both sides, we have
α(" + 2αδ)
Jµ − J * ≤ T µ Jµ − J * + v. (2.28)
1−α
Also by subtracting J * from both sides of Eq. (2.26), and using the
contraction property
T Jµ − J * = T Jµ − T J * ≤ α*Jµ − J * * v,
we obtain
" + 2αδ
*Jµk+1 − J * * ≤ α*Jµk − J * * + ,
1−α
which by taking the lim sup of both sides as k → ∞ yields the desired result.
Q.E.D.
We note that the error bound of Prop. 2.4.3 is tight, as can be shown
with an example from [BeT96], Section 6.2.3. The error bound is com-
parable to the one for approximate VI, derived earlier in Prop. 2.3.2. In
particular, the error *Jµk −J * * is asymptotically proportional to 1/(1−α)2
and to the approximation error in policy evaluation or value iteration, re-
spectively. This is noteworthy, as it indicates that contrary to the case of
exact implementation, approximate PI need not hold a convergence rate
advantage over approximate VI, despite its greater overhead per iteration.
Note that when δ = " = 0, Eq. (2.25) yields
*Jµk+1 − J * * ≤ α*Jµk − J * *.
Thus in the case of an infinite state space and/or control space, exact
PI converges at a geometric rate under the contraction and monotonicity
assumptions of this section. This rate is the same as the rate of convergence
of exact VI. It follows that judging solely from the point of view of rate
of convergence estimates, exact PI holds an advantage over exact VI only
when the number of states is finite. This raises the question what happens
when the number of states is finite but very large. However, this question
is not very interesting from a practical point of view, since for a very large
number of states, neither VI or PI can be implemented in practice without
approximations (see the discussion of Section 1.2.4).
cf. Eq. (2.23). Using Eq. (2.31) and the fact Jµ = Tµ Jµ , we have
˜ + *T J˜ − Tµ J*
*T Jµ − Jµ * ≤ *T Jµ − T J* ˜ + *Tµ J˜ − Jµ *
˜ + *T J˜ − Tµ J*
= *T Jµ − T J* ˜ + *Tµ J˜ − Tµ Jµ *
(2.32)
˜ + " + α*J˜ − Jµ *
≤ α*Jµ − J*
≤ " + 2αδ.
Using Prop. 2.1.1(d) with J = Jµ , we obtain the error bound (2.30).
Q.E.D.
The preceding error bound can be extended to the case where two
successive policies generated by the approximate PI algorithm are “not too
different” rather than being identical. In particular, suppose that µ and µ
are successive policies, which in addition to
*J˜ − Jµ * ≤ δ, *Tµ J˜ − T J*
˜ ≤ ",
*T J˜ − Tµ J* ˜ + *Tµ J˜ − Tµ J*
˜ ≤ *T J˜ − Tµ J* ˜ ≤ " + ζ,
and by replacing " with " + ζ and µ with µ in Eq. (2.32), we obtain
" + ζ + 2αδ
*Jµ − J ∗ * ≤ .
1−α
When ζ is small enough to be of the order of max{δ, "}, this error bound
is comparable to the one for the case where policies converge.
where {mk } is a sequence of positive integers (see Fig. 2.5.1, which shows
one iteration of the method where mk = 3). There is no systematic guide-
line for selecting the integers mk . Usually their best values are chosen
empirically, and tend to be considerably larger than 1 (in the case where
mk ≡ 1 the optimistic PI method coincides with the VI method). The
convergence of this method is discussed in Section 2.5.1.
Variants of optimistic PI include methods with approximations in the
policy evaluation and policy improvement phases (Section 2.5.2), and meth-
ods where the number mk is randomized (Section 2.5.3). An interesting
advantage of the latter methods is that they do not require the monotonic-
ity Assumption 2.1.1 for convergence in problems with a finite number of
policies.
A method that is conceptually similar to the optimistic PI method is
the λ-PI method defined by
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , k = 0, 1, . . . , (2.34)
78 Contractive Models Chap. 2
Tµ0 J
T J = minµ Tµ J
= T J0
Policy Improvement Exact Policy Evaluation Approximate Polic
Evaluation
where J0 is an initial function in B(X), and for any policy µ and scalar
(λ)
λ ∈ (0, 1), Tµ is the multistep mapping defined by
∞
$
(λ)
Tµ J = (1 − λ) λ! Tµ!+1 J, J ∈ B(X),
!=0
(cf. Section 1.2.5). To compare optimistic PI and λ-PI, note that they both
involve multiple applications of the VI mapping Tµk : a fixed number mk
in the former case, and a geometrically weighted number in the latter case.
(λ)
In fact, we may view the λ-PI iterate Tµk Jk as the expected value of the
mk
optimistic PI iterate Tµk Jµk when mk is chosen by a geometric probability
distribution with parameter λ.
One of the reasons that make λ-PI interesting is its relation with
TD(λ) and other temporal difference methods on one hand, and the prox-
imal algorithm on the other. In particular, in λ-PI a policy evaluation is
performed with a single iteration of an extrapolated proximal algorithm;
cf. the discussion of Section 1.2.5 and Exercise 1.2. Thus implementation
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 79
of λ-PI can benefit from the rich methodology that has developed around
temporal difference and proximal methods.
Generally the optimistic and λ-PI methods have similar convergence
properties. In this section, we focus primarily on optimistic PI, and we
discuss briefly λ-PI in Section 2.5.3, where we will prove convergence for a
randomized version. For a convergence proof of λ-PI without randomiza-
tion in discounted stochastic optimal control and stochastic shortest path
problems, see the paper [BeI96] and the book [BeT96] (Section 2.3.1).
We will now focus on the optimistic PI algorithm (2.33). The following two
propositions provide its convergence properties.
J ≤ J# ⇒ W J ≤ W J #, ∀ J, J # ∈ B(X),
*W J − W J # * ≤ α*J − J # *, ∀ J, J # ∈ B(X),
for some α ∈ (0, 1).
(a) For all J, J # ∈ B(X) and scalar c ≥ 0, we have
J ≥ J# − c v ⇒ W J ≥ W J # − αc v. (2.35)
αk
J ≥ WJ − cv ⇒ W kJ ≥ J * − c v, (2.36)
1−α
αk
WJ ≥ J − cv ⇒ J * ≥ W kJ − c v, (2.37)
1−α
where J * is the fixed point of W .
Proof: The proof of part (a) follows the one of Prop. 2.1.4(b), while the
proof of part (b) follows the one of Prop. 2.1.4(c). Q.E.D.
J ≥ T J − c v,
and
Tµk J ≥ T (Tµk J) − αk c v. (2.39)
J1 ≥ T J1 − αm0 c v = T J1 − β1 c v.
Jk ≥ T Jk − βk c v.
and
Jk+1 ≥ T Jk+1 − αmk βk c v = T Jk+1 − βk+1 c v,
respectively. This completes the induction. Q.E.D.
*J0 − T J0 * ≤ c. (2.43)
αk βk (k + 1)αk
Jk + c v ≥ Jk + c v ≥ J * ≥ Jk − c v, (2.44)
1−α 1−α 1−α
Proof: Using the relation J0 ≥ T J0 − c v [cf. Eq. (2.43)] and Lemma 2.5.3,
we have
Jk ≥ T Jk − βk c v, k = 0, 1, . . . .
Using this relation in Lemma 2.5.1(b) with W = T and k = 0, we obtain
βk
Jk ≥ J * − c v,
1−α
which together with the fact αk ≥ βk , shows the left-hand side of Eq.
(2.44).
Using the relation T J0 ≥ J0 − c v [cf. Eq. (2.43)] and Lemma 2.5.1(b)
with W = T , we have
αk
J * ≥ T k J0 − c v, k = 0, 1, . . . . (2.45)
1−α
Using again the relation J0 ≥ T J0 − c v in conjunction with Lemma 2.5.3,
we also have
α
T Jj ≥ Jj+1 − βj c v, j = 0, . . . , k − 1.
1−α
Applying T k−j−1 to both sides of this inequality and using the monotonicity
and contraction properties of T k−j−1 , we obtain
αk−j
T k−j Jj ≥ T k−j−1 Jj+1 − βj c v, j = 0, . . . , k − 1,
1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 83
Proof of Props. 2.5.1 and 2.5.2: Let c be a scalar satisfying Eq. (2.43).
Then the error bounds (2.44) show that limk→∞ *Jk −J * * = 0, i.e., the first
part of Prop. 2.5.1. To show the second part (finite termination when the
. be the finite set of nonoptimal policies.
number of policies is finite), let M
. which
Then there exists " > 0 such that *Tµ̂ J * − T J * * > " for all µ̂ ∈ M,
.
implies that *Tµ̂ Jk − T Jk * > " for all µ̂ ∈ M and k sufficiently large. This
implies that µk ∈ / M. for all k sufficiently large. The proof of Prop. 2.5.2
follows using the compactness and continuity Assumption 2.4.1, and the
convergence argument of Prop. 2.4.2. Q.E.D.
Let us consider the convergence rate bounds of Lemma 2.5.4 for optimistic
PI, and write them in the form
(k + 1)αk αm0 +···+mk−1
*J0 − T J0 * ≤ c ⇒ Jk − c v ≤ J * ≤ Jk + c v.
1−α 1−α
(2.47)
We may contrast these bounds with the ones for VI, where
αk αk
*J0 − T J0 * ≤ c ⇒ T k J0 − c v ≤ J * ≤ T k J0 + c v (2.48)
1−α 1−α
[cf. Prop. 2.1.4(c)].
In comparing the bounds (2.47) and (2.48), we should also take into
account the associated overhead for a single iteration of each method: op-
timistic PI requires at iteration k a single application of T and mk − 1
applications of Tµk (each being less time-consuming than an application of
T ), while VI requires a single application of T . It can then be seen that the
upper bound for optimistic PI is better than the one for VI (same bound
for less overhead), while the lower bound for optimistic PI is worse than the
one for VI (worse bound for more overhead). This suggests that the choice
of the initial condition J0 is important in optimistic PI, and in particular
it is preferable to have J0 ≥ T J0 (implying convergence to J * from above)
rather than J0 ≤ T J0 (implying convergence to J * from below). This is
consistent with the results of other works, which indicate that the conver-
gence properties of the method are fragile when the condition J0 ≥ T J0
does not hold (see [WiB93], [BeT96], [BeY10], [BeY12], [YuB13a]).
84 Contractive Models Chap. 2
We will now derive error bounds for the case where the policy evaluation
and policy improvement operations are approximate, similar to the nonop-
timistic PI case of Section 2.4.1. In particular, we consider a method that
generates a sequence of policies {µk } and a corresponding sequence of ap-
proximate cost functions {Jk } satisfying
m
*Jk − Tµkk Jk−1 * ≤ δ, *Tµk+1 Jk − T Jk * ≤ ", k = 0, 1, . . . , (2.49)
y(x)
M (y) = sup .
x∈X v(x)
Then the condition (2.50) can be written for all J, J # ∈ B(X), and µ ∈ M
as
M (Tµ J − Tµ J # ) ≤ αM (J − J # ), (2.51)
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 85
which can be proved by induction using Eq. (2.51). We have the following
proposition.
" + 2αδ
lim sup *Jµk − J * * ≤ .
k→∞ (1 − α)2
Proof: Let us fix k ≥ 1, and for simplicity let us assume that mk ≡ m for
some m, and denote
J = Jk−1 , J = Jk , µ = µk , µ̄ = µk+1 ,
r = Tµ J − J, r̄ = Tµ̄ J − J.
r̄ = Tµ̄ J − J
= (Tµ̄ J − Tµ J) + (Tµ J − J)
! " ! "
≤ (Tµ̄ J − T J) + Tµ J − Tµ (Tµm J) + (Tµm J − J) + Tµm (Tµ J) − Tµm J
≤ "v + αM (J − Tµm J)v + δv + αm M (Tµ J − J)v
≤ (" + δ)v + αδv + αm M (r)v,
where the first inequality follows from Tµ̄ J ≥ T J, and the second and third
inequalities follow from Eqs. (2.49) and (2.52). From this relation we have
! "
M (r̄) ≤ " + (1 + α)δ + βM (r),
86 Contractive Models Chap. 2
" + (1 + α)δ
lim sup M (r) ≤ . (2.54)
k→∞ 1−β
s = Jµ − Tµm J
= Tµm Jµ − Tµm J
≤ αm M (Jµ − J)v
αm
≤ M (Tµ J − J)v
1−α
αm
= M (r)v,
1−α
where the first inequality follows from Eq. (2.52) and the second inequality
αm
follows by using Prop. 2.1.4(b). Thus we have M (s) ≤ 1−α M (r), from
which by taking lim sup of both sides and using Eq. (2.54), we obtain
! "
β " + (1 + α)δ
lim sup M (s) ≤ . (2.55)
k→∞ (1 − α)(1 − β)
T J − T J * ≤ αM (J − J * )v
= αM (J − Tµm J + Tµm J − J * )v
≤ αM (J − Tµm J)v + αM (Tµm J − J * )v
≤ αδv + αM (t)v.
t̄ = Tµ̄m J − J *
= (Tµ̄m J − Tµ̄m−1 J) + · · · + (Tµ̄2 J − Tµ̄ J) + (Tµ̄ J − T J) + (T J − T J * )
≤ (αm−1 + · · · + α)M (Tµ̄ J − J)v + "v + αδv + αM (t)v,
so finally
α − αm
M (t̄) ≤ M (r̄) + (" + αδ) + αM (t).
1−α
By taking lim sup of both sides and using Eq. (2.54), it follows that
! "
(α − β) " + (1 + α)δ " + αδ
lim sup M (t) ≤ 2 (1 − β)
+ . (2.56)
k→∞ (1 − α) 1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 87
∞
$
p(1) > 0, p(j) = 1.
j=1
88 Contractive Models Chap. 2
† Note that without monotonicity, J ∗ need not have any formal optimality
properties (cf. the discussion of Section 2.1 and Example 2.1.1).
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 89
The preceding proof illustrates the key idea of the randomized opti-
m
mistic PI algorithm, which is that for µ ∈ M∗ , the mappings Tµ k have a
*
common fixed point that is equal to J , the fixed point of T . Thus within
a distance " from J * , the iterates (2.57) aim consistently at J * . Moreover,
because the probability of a VI (an iteration with mk = 1) is positive, the
algorithm is guaranteed to eventually come within " from J * through a
sufficiently long sequence of contiguous VI iterations. For this we need the
sequence {Jk } to be bounded, which will be shown as part of the proof of
the following proposition.
Proposition 2.5.5: Let Assumption 2.5.2 hold. Then for any start-
ing point J0 ∈ F (X), a sequence {Jk } generated by the randomized
optimistic PI algorithm (2.57)-(2.58) belongs to F (X) and converges
to J * with probability one.
Proof: We will show that {Jk } is bounded by showing that for all k, we
have
1
max *Jk − Jµ * ≤ ρk max *J0 − Jµ * + max *Jµ − Jµ# *, (2.59)
µ∈M µ∈M 1 − ρ µ,µ# ∈M
where ρ is a common contraction modulus of Tµ , µ ∈ M, and T . Indeed,
we have for all µ ∈ M
*Jk − Jµ * ≤ *Jk − Jµk−1 * + *Jµk−1 − Jµ *
k−1m
= *Tµk−1 Jk−1 − Jµk−1 * + *Jµk−1 − Jµ *
≤ ρmk−1 *Jk−1 − Jµk−1 * + *Jµk−1 − Jµ *
≤ ρmk−1 max *Jk−1 − Jµ * + max *Jµ − Jµ# *
µ∈M µ,µ# ∈M
We use this fact to argue that with enough contiguous value iterations, i.e.,
iterations where mk = 1, Jk can be brought arbitrarily close to J * , and
once this happens, the algorithm operates like the ordinary VI algorithm.
Indeed, each time the iteration Jk+1 = T Jk is performed (i.e., when
mk = 1), the distance of the iterate Jk from J * is reduced by a factor ρ, i.e.,
*Jk+1 − J * * ≤ ρ*Jk − J * *. Since {Jk } belongs to the bounded set D, and
our randomization scheme includes the condition p(1) > 0, the algorithm is
guaranteed (with probability one) to eventually execute a sufficient number
of contiguous iterations Jk+1 = T Jk to enter a sphere
% &
S" = J ∈ F (X) | *J − J * * < "
of small enough radius " to guarantee that the generated policy µk belongs
to M∗ , as per Prop. 2.5.4. Once this happens, all subsequent iterations
reduce the distance *Jk − J * * by a factor ρ at every iteration, since
)
)Tµm J − J * * ≤ ρ*Tµm−1 J − J * * ≤ ρ*J − J * *, ∀ µ ∈ M∗ , m ≥ 1, J ∈ S" .
Thus once {Jk } enters S" , it stays within S" and converges to J * . Q.E.D.
cf. Eq. (2.34), we consider a randomized version that involves a fixed prob-
ability p ∈ (0, 1). It has the form
'
T Jk with probability p,
Tµk Jk = T Jk , Jk+1 = (λ)
Tµk Jk with probability 1 − p. (2.60)
2.5.5, the sequence {Jk } generated by the randomized λ-PI algorithm (2.60)
belongs to F (X) and converges to J * with probability one. The reason is
that the contraction property of Tµ over F (X) with respect to the norm *·*
(λ) (λ)
implies that Tµ is well-defined, and also implies that Tµ is a contraction
ove