PRACTICAL DIFFICULTIES OF DP
• The curse of dimensionality
− Exponential growth of the computational and
storage requirements as the number of state
variables and control variables increases
− Quick explosion of the number of states in
combinatorial problems
− Intractability of imperfect state information
problems
• The curse of modeling
− Mathematical models
− Computer/simulation models
• There may be real-time solution constraints
− A family of problems may be addressed. The
data of the problem to be solved is given with
little advance notice
− The problem data may change as the system
is controlled – need for on-line replanning
COST-TO-GO FUNCTION APPROXIMATION
• Use a policy computed from the DP equation
where the optimal cost-to-go function Jk+1 is
replaced
by an approximation J˜k+1 . (Sometimes
E gk is also replaced by an approximation.)
• Apply µk (xk ), which attains the minimum in
n o
min E gk (xk , uk , wk )+J˜k+1 fk (xk , uk , wk )
uk ∈Uk (xk )
• There are several ways to compute J˜k+1 :
− Off-line approximation: The entire function
J˜k+1 is computed for every k, before the con-
trol process begins.
− On-line approximation: Only the values J˜k+1 (xk+1 )
at the relevant next states xk+1 are com-
puted and used to compute uk just after the
current state xk becomes known.
− Simulation-based methods: These are off-
line and on-line methods that share the com-
mon characteristic that they are based on
Monte-Carlo simulation. Some of these meth-
ods are suitable for problems of very large
size.
CERTAINTY EQUIVALENT CONTROL (CEC)
• Idea: Replace the stochastic problem with a
deterministic problem
• At each time k, the future uncertain quantities
are fixed at some “typical” values
• On-line implementation for a perfect state info
problem. At each time k:
(1) Fix the wi , i ≥ k, at some wi . Solve the
deterministic problem:
N
X −1
minimize gN (xN ) + gi xi , ui , wi
i=k
where xk is known, and
ui ∈ Ui , xi+1 = fi xi , ui , wi .
(2) Use as control the first element in the opti-
mal control sequence found.
• So we apply µ̄k (xk ) that minimizes
˜
gk xk , uk , wk + Jk+1 fk (xk , uk , wk )
where J˜k+1 is the optimal cost of the correspond-
ing deterministic problem.
ALTERNATIVE OFF-LINE IMPLEMENTATION
d
• Let µ0 (x0 ), . . . , µdN −1 (xN −1 ) be an optimal
controller obtained from the DP algorithm for the
deterministic problem
N −1
X
minimize gN (xN ) + gk xk , µk (xk ), wk
k=0
subject to xk+1 = fk xk , µk (xk ), wk , µk (xk ) ∈ Uk
• The CEC applies at time k the control input
µdk (xk ).
• In an imperfect info version, xk is replaced by
an estimate xk (Ik ).
wk vk
d
u k = mk (x k) System xk Measurement zk
xk + 1 = fk(xk ,u k ,wk) zk = hk(xk ,u k - 1,vk)
uk -1
Delay
uk -1
Actuator x k (Ik ) zk
Estimator
mdk
PARTIALLY STOCHASTIC CEC
• Instead of fixing all future disturbances to their
typical values, fix only some, and treat the rest as
stochastic.
• Important special case: Treat an imperfect
state information problem as one of perfect state
information, using an estimate xk (Ik ) of xk as if
it were exact.
• Multiaccess communication example:black Con-
sider controlling the slotted Aloha system (Ex-
ample 5.1.1 in the text) by optimally choosing
the probability of transmission of waiting pack-
ets. This is a hard problem of imperfect state
info, whose perfect state info version is easy.
• Natural partially stochastic CEC:
1
µ̃k (Ik ) = min 1, ,
xk (Ik )
where xk (Ik ) is an estimate of the current packet
backlog based on the entire past channel history
of successes, idles, and collisions (which is Ik ).
GENERAL COST-TO-GO APPROXIMATION
• One-step lookahead (1SL) policy: At each k
and state xk , use the control µk (xk ) that
˜
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )
where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Two-step lookahead policy: At each k and
xk , use the control µ̃k (xk ) attaining the minimum
above, where the function J˜k+1 is obtained using
a 1SL approximation (solve a 2-step DP problem).
• If J˜k+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
• Sometimes one also replaces Uk (xk ) above with
a subset of “most promising controls” U k (xk ).
• As the length of lookahead increases, the re-
quired computation quickly explodes.
PERFORMANCE BOUNDS FOR 1SL
• Let J k (xk ) be the cost-to-go from (xk , k) of the
1SL policy, based on functions J˜k .
• Assume that for all (xk , k), we have
Jˆk (xk ) ≤ J˜k (xk ), (*)
where JˆN = gN and for all k,
Jˆk (xk ) =
min E gk (xk , uk , wk )
uk ∈Uk (xk )
+ J˜k+1 fk (xk , uk , wk )
,
[so Jˆk (xk ) is computed along with µk (xk )]. Then
J k (xk ) ≤ Jˆk (xk ), for all (xk , k).
• Important application: When J˜k is the cost-to-
go of some heuristic policy (then the 1SL policy is
called the rollout policy).
• The bound can be extended to the case where
there is a δk in the RHS of (*). Then
J k (xk ) ≤ J˜k (xk ) + δk + · · · + δN −1
COMPUTATIONAL ASPECTS
• Sometimes nonlinear programming can be used
to calculate the 1SL or the multistep version [par-
ticularly when Uk (xk ) is not a discrete set]. Con-
nection with stochastic programming methods
(see text).
• The choice of the approximating functions J˜k
is critical, and is calculated in a variety of ways.
• Some approaches:
(a) Problem Approximation: Approximate the
optimal cost-to-go with some cost derived
from a related but simpler problem
(b) Parametric Cost-to-Go Approximation: Ap-
proximate the optimal cost-to-go with a func-
tion of a suitable parametric form, whose pa-
rameters are tuned by some heuristic or sys-
tematic scheme (Neuro-Dynamic Program-
ming)
(c) Rollout Approach: Approximate the opti-
mal cost-to-go with the cost of some subop-
timal policy, which is calculated either ana-
lytically or by simulation
PROBLEM APPROXIMATION
• Many (problem-dependent) possibilities
− Replace uncertain quantities by nominal val-
ues, or simplify the calculation of expected
values by limited simulation
− Simplify difficult constraints or dynamics
• Example of enforced decomposition: Route m
vehicles that move over a graph. Each node has a
“value.” The first vehicle that passes through the
node collects its value. Max the total collected
value, subject to initial and final time constraints
(plus time windows and other constraints).
• Usually the 1-vehicle version of the problem is
much simpler. This motivates an approximation
obtained by solving single vehicle problems.
• 1SL scheme: At time k and state xk (position
of vehicles and “collected value nodes”), consider
all possible kth moves by the vehicles, and at the
resulting states we approximate the optimal value-
to-go with the value collected by optimizing the
vehicle routes one-at-a-time
PARAMETRIC COST-TO-GO APPROXIMATION
• Use a cost-to-go approximation from a paramet-
˜ r) where x is the current state and
ric class J(x,
r = (r1 , . . . , rm ) is a vector of “tunable” scalars
(weights).
• By adjusting the weights, one can change the
“shape” of the approximation J˜ so that it is rea-
sonably close to the true optimal cost-to-go func-
tion.
• Two key issues:
˜ r) (the
− The choice of parametric class J(x,
approximation architecture).
− Method for tuning the weights (“training”
the architecture).
• Successful application strongly depends on how
these issues are handled, and on insight about the
problem.
• Sometimes a simulation-based algorithm is used,
particularly when there is no mathematical model
of the system.
• We will look in detail at these issues after a few
lectures.
APPROXIMATION ARCHITECTURES
• Divided in linear and nonlinear [i.e., linear or
˜ r) on r].
nonlinear dependence of J(x,
• Linear architectures are easier to train, but non-
linear ones (e.g., neural networks) are richer.
• Architectures based on feature extraction
Feature Cost Approximation
State x Vector y J (y,r )
Feature Extraction Cost Approximator w/
Mapping Parameter Vector r
• Ideally, the features will encode much of the
nonlinearity that is inherent in the cost-to-go ap-
proximated, and the approximation may be quite
accurate without a complicated architecture.
• Sometimes the state space is partitioned, and
“local” features are introduced for each subset of
the partition (they are 0 outside the subset).
• With a well-chosen feature vector y(x), we can
use a linear architecture
X
˜ r) = Jˆ y(x), r =
J(x, ri yi (x)
i
AN EXAMPLE - COMPUTER CHESS
• Programs use a feature-based position evaluator
that assigns a score to each move/position
Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features
Position Evaluator
• Many context-dependent special features.
• Most often the weighting of features is linear
but multistep lookahead is involved.
• Most often the training is done by trial and
error.
ROLLOUT ALGORITHMS
• One-step lookahead policy: At each k and
state xk , use the control µk (xk ) that
˜
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )
where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Rollout algorithm: When J˜k is the cost-to-go
of some heuristic policy (called the base policy)
• Cost improvement property (to be shown): The
rollout algorithm achieves no worse (and usually
much better) cost than the base heuristic starting
from the same state.
• Main difficulty: Calculating J˜k (xk ) may be com-
putationally intensive if the cost-to-go of the base
policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case.
EXAMPLE: THE QUIZ PROBLEM
• A person is given N questions; answering cor-
rectly question i has probability pi , reward vi .
Quiz terminates at the first incorrect answer.
• Problem: Choose the ordering of questions so
as to maximize the total expected reward.
• Assuming no other constraints, it is optimal to
use the index policy: Answer questions in decreas-
ing order of pi vi /(1 − pi ).
• With minor changes in the problem, the index
policy need not be optimal. Examples:
− A limit (< N ) on the maximum number of
questions that can be answered.
− Time windows, sequence-dependent rewards,
precedence constraints.
• Rollout with the index policy as base policy:
Convenient because at a given state (subset of
questions already answered), the index policy and
its expected reward can be easily calculated.
• Very effective for solving the quiz problem and
important generalizations in scheduling (see Bert-
sekas and Castanon, J. of Heuristics, Vol. 5, 1999).
COST IMPROVEMENT PROPERTY
• Let
J k (xk ): Cost-to-go of the rollout policy
Hk (xk ): Cost-to-go of the base policy
• We claim that J k (xk ) ≤ Hk (xk ) for all xk , k
• Proof by induction: We have J N (xN ) = HN (xN )
for all xN . Assume that
J k+1 (xk+1 ) ≤ Hk+1 (xk+1 ), ∀ xk+1 .
Then, for all xk
J k (xk ) = E gk xk , µk (xk ), wk + J k+1 fk xk , µk (xk ), wk
≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk
≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk
= Hk (xk )
− Induction hypothesis ==> 1st inequality
− Min selection of µk (xk ) ==> 2nd inequality
− Definition of Hk , µk ==> last equality
DISCRETE DETERMINISTIC PROBLEMS
• Any discrete optimization problem (with finite
number of choices/feasible solutions) can be repre-
sented sequentially by breaking down the decision
process into stages.
• A tree/shortest path representation. The leaves
of the tree correspond to the feasible solutions.
• Decisions can be made in stages.
− May complete partial solutions, one stage at
a time.
− May apply rollout with any heuristic that
can complete a partial solution.
− No costly stochastic simulation needed.
• Example: Traveling salesman problem. Find a
minimum cost tour that goes exactly once through
each of N cities.
A Origin Node s
AB AC AD
ABC ABD ACB ACD ADB ADC
ABCD ABDC ACBD ACDB ADBC ADCB
EXAMPLE: THE BREAKTHROUGH PROBLEM
root
• Given a binary tree with N stages.
• Each arc is either free or is blocked (crossed out
in the figure).
• Problem: Find a free path from the root to the
leaves (such as the one shown with thick lines).
• Base heuristic (greedy): Follow the right branch
if free; else follow the left branch if free.
• This is a rare rollout instance that admits a
detailed analysis.
• For large N and given prob. of free branch:
the rollout algorithm requires O(N ) times more
computation, but has O(N ) times larger prob. of
finding a free path than the greedy algorithm.
DET. EXAMPLE: ONE-DIMENSIONAL WALK
• A person takes either a unit step to the left or
a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)
_
(N,-N) (N,0) i (N,N)
g(i)
_
-N 0 i N-2 N i
• Base heuristic: Always go to the right. Rollout
finds the rightmost local minimum.
• Base heuristic: Compare always go to the right
and always go the left. Choose the best of the two.
Rollout finds a global minimum.
DET. EXAMPLE: ONE-DIMENSIONAL WALK
• A person takes either a unit step to the left or
a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)
_
(N,-N) (N,0) i (N,N)
g(i)
_
-N 0 i N-2 N i
• Base heuristic: Always go to the right. Rollout
finds the rightmost local minimum.
• Base heuristic: Compare always go to the right
and always go the left. Choose the best of the two.
Rollout finds a global minimum.
A ROLLOUT ISSUE FOR DISCRETE PROBLEMS
• The base heuristic need not constitute a policy
in the DP sense.
• Reason: Depending on its starting point, the
base heuristic may not apply the same control at
the same state.
• As a result the cost improvement property may
be lost (except if the base heuristic has a property
called sequential consistency; see the text for a
formal definition).
• The cost improvement property is restored in
two ways:
− The base heuristic has a property called se-
quential improvement (see the text for a for-
mal definition).
− A variant of the rollout algorithm, called for-
tified rollout, is used, which enforces cost
improvement. Roughly speaking the “best”
solution found so far is maintained, and it
is followed whenever at any time the stan-
dard version of the algorithm tries to follow
a “worse” solution (see the text).
ROLLING HORIZON WITH ROLLOUT
• We can use a rolling horizon approximation in
calculating the cost-to-go of the base heuristic.
• Because the heuristic is suboptimal, the ratio-
nale for a long rolling horizon becomes weaker.
• Example: N -stage stopping problem where the
stopping cost is 0, the continuation cost is either
−ǫ or 1, where 0 < ǫ << 1, and the first state
with continuation cost equal to 1 is state m. Then
the optimal policy is to stop at state m, and the
optimal cost is −mǫ.
- e - e ... 1 ...
0 1 2 m N
Stopped State
• Consider the heuristic that continues at every
state, and the rollout policy that is based on this
heuristic, with a rolling horizon of l ≤ m steps.
• It will continue up to the first m − l + 1 stages,
thus compiling a cost of −(m−l +1)ǫ. The rollout
performance improves as l becomes shorter!
• Limited vision may work to our advantage!