0% found this document useful (0 votes)
32 views21 pages

Chapter 6 Approximate Dynamic Programming

The document discusses the practical difficulties of dynamic programming (DP), including the curse of dimensionality and modeling, as well as real-time solution constraints. It explores cost-to-go function approximation methods, certainty equivalent control, and various implementation strategies for DP, including rollout algorithms and parametric cost-to-go approximations. Additionally, it provides examples such as computer chess and the quiz problem to illustrate the concepts and their applications in optimization scenarios.

Uploaded by

suhezero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views21 pages

Chapter 6 Approximate Dynamic Programming

The document discusses the practical difficulties of dynamic programming (DP), including the curse of dimensionality and modeling, as well as real-time solution constraints. It explores cost-to-go function approximation methods, certainty equivalent control, and various implementation strategies for DP, including rollout algorithms and parametric cost-to-go approximations. Additionally, it provides examples such as computer chess and the quiz problem to illustrate the concepts and their applications in optimization scenarios.

Uploaded by

suhezero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

PRACTICAL DIFFICULTIES OF DP

• The curse of dimensionality


− Exponential growth of the computational and
storage requirements as the number of state
variables and control variables increases
− Quick explosion of the number of states in
combinatorial problems
− Intractability of imperfect state information
problems
• The curse of modeling
− Mathematical models
− Computer/simulation models
• There may be real-time solution constraints
− A family of problems may be addressed. The
data of the problem to be solved is given with
little advance notice
− The problem data may change as the system
is controlled – need for on-line replanning
COST-TO-GO FUNCTION APPROXIMATION

• Use a policy computed from the DP equation


where the optimal cost-to-go function Jk+1 is
replaced
 by an approximation J˜k+1 . (Sometimes
E gk is also replaced by an approximation.)
• Apply µk (xk ), which attains the minimum in
n o
min E gk (xk , uk , wk )+J˜k+1 fk (xk , uk , wk )
uk ∈Uk (xk )

• There are several ways to compute J˜k+1 :


− Off-line approximation: The entire function
J˜k+1 is computed for every k, before the con-
trol process begins.
− On-line approximation: Only the values J˜k+1 (xk+1 )
at the relevant next states xk+1 are com-
puted and used to compute uk just after the
current state xk becomes known.
− Simulation-based methods: These are off-
line and on-line methods that share the com-
mon characteristic that they are based on
Monte-Carlo simulation. Some of these meth-
ods are suitable for problems of very large
size.
CERTAINTY EQUIVALENT CONTROL (CEC)

• Idea: Replace the stochastic problem with a


deterministic problem
• At each time k, the future uncertain quantities
are fixed at some “typical” values
• On-line implementation for a perfect state info
problem. At each time k:
(1) Fix the wi , i ≥ k, at some wi . Solve the
deterministic problem:
N
X −1

minimize gN (xN ) + gi xi , ui , wi
i=k

where xk is known, and



ui ∈ Ui , xi+1 = fi xi , ui , wi .
(2) Use as control the first element in the opti-
mal control sequence found.
• So we apply µ̄k (xk ) that minimizes
˜
 
gk xk , uk , wk + Jk+1 fk (xk , uk , wk )
where J˜k+1 is the optimal cost of the correspond-
ing deterministic problem.
ALTERNATIVE OFF-LINE IMPLEMENTATION
 d
• Let µ0 (x0 ), . . . , µdN −1 (xN −1 ) be an optimal
controller obtained from the DP algorithm for the
deterministic problem

N −1
X 
minimize gN (xN ) + gk xk , µk (xk ), wk
k=0

subject to xk+1 = fk xk , µk (xk ), wk , µk (xk ) ∈ Uk

• The CEC applies at time k the control input


µdk (xk ).
• In an imperfect info version, xk is replaced by
an estimate xk (Ik ).

wk vk

d
u k = mk (x k) System xk Measurement zk
xk + 1 = fk(xk ,u k ,wk) zk = hk(xk ,u k - 1,vk)
uk -1

Delay

uk -1
Actuator x k (Ik ) zk
Estimator
mdk
PARTIALLY STOCHASTIC CEC

• Instead of fixing all future disturbances to their


typical values, fix only some, and treat the rest as
stochastic.
• Important special case: Treat an imperfect
state information problem as one of perfect state
information, using an estimate xk (Ik ) of xk as if
it were exact.
• Multiaccess communication example:black Con-
sider controlling the slotted Aloha system (Ex-
ample 5.1.1 in the text) by optimally choosing
the probability of transmission of waiting pack-
ets. This is a hard problem of imperfect state
info, whose perfect state info version is easy.
• Natural partially stochastic CEC:
 
1
µ̃k (Ik ) = min 1, ,
xk (Ik )

where xk (Ik ) is an estimate of the current packet


backlog based on the entire past channel history
of successes, idles, and collisions (which is Ik ).
GENERAL COST-TO-GO APPROXIMATION

• One-step lookahead (1SL) policy: At each k


and state xk , use the control µk (xk ) that

˜
 
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )

where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Two-step lookahead policy: At each k and
xk , use the control µ̃k (xk ) attaining the minimum
above, where the function J˜k+1 is obtained using
a 1SL approximation (solve a 2-step DP problem).
• If J˜k+1 is readily available and the minimiza-
tion above is not too hard, the 1SL policy is im-
plementable on-line.
• Sometimes one also replaces Uk (xk ) above with
a subset of “most promising controls” U k (xk ).
• As the length of lookahead increases, the re-
quired computation quickly explodes.
PERFORMANCE BOUNDS FOR 1SL

• Let J k (xk ) be the cost-to-go from (xk , k) of the


1SL policy, based on functions J˜k .
• Assume that for all (xk , k), we have

Jˆk (xk ) ≤ J˜k (xk ), (*)

where JˆN = gN and for all k,

Jˆk (xk ) =

min E gk (xk , uk , wk )
uk ∈Uk (xk )

+ J˜k+1 fk (xk , uk , wk )

,
[so Jˆk (xk ) is computed along with µk (xk )]. Then

J k (xk ) ≤ Jˆk (xk ), for all (xk , k).


• Important application: When J˜k is the cost-to-
go of some heuristic policy (then the 1SL policy is
called the rollout policy).
• The bound can be extended to the case where
there is a δk in the RHS of (*). Then
J k (xk ) ≤ J˜k (xk ) + δk + · · · + δN −1
COMPUTATIONAL ASPECTS

• Sometimes nonlinear programming can be used


to calculate the 1SL or the multistep version [par-
ticularly when Uk (xk ) is not a discrete set]. Con-
nection with stochastic programming methods
(see text).
• The choice of the approximating functions J˜k
is critical, and is calculated in a variety of ways.
• Some approaches:
(a) Problem Approximation: Approximate the
optimal cost-to-go with some cost derived
from a related but simpler problem
(b) Parametric Cost-to-Go Approximation: Ap-
proximate the optimal cost-to-go with a func-
tion of a suitable parametric form, whose pa-
rameters are tuned by some heuristic or sys-
tematic scheme (Neuro-Dynamic Program-
ming)
(c) Rollout Approach: Approximate the opti-
mal cost-to-go with the cost of some subop-
timal policy, which is calculated either ana-
lytically or by simulation
PROBLEM APPROXIMATION

• Many (problem-dependent) possibilities


− Replace uncertain quantities by nominal val-
ues, or simplify the calculation of expected
values by limited simulation
− Simplify difficult constraints or dynamics
• Example of enforced decomposition: Route m
vehicles that move over a graph. Each node has a
“value.” The first vehicle that passes through the
node collects its value. Max the total collected
value, subject to initial and final time constraints
(plus time windows and other constraints).
• Usually the 1-vehicle version of the problem is
much simpler. This motivates an approximation
obtained by solving single vehicle problems.
• 1SL scheme: At time k and state xk (position
of vehicles and “collected value nodes”), consider
all possible kth moves by the vehicles, and at the
resulting states we approximate the optimal value-
to-go with the value collected by optimizing the
vehicle routes one-at-a-time
PARAMETRIC COST-TO-GO APPROXIMATION

• Use a cost-to-go approximation from a paramet-


˜ r) where x is the current state and
ric class J(x,
r = (r1 , . . . , rm ) is a vector of “tunable” scalars
(weights).
• By adjusting the weights, one can change the
“shape” of the approximation J˜ so that it is rea-
sonably close to the true optimal cost-to-go func-
tion.
• Two key issues:
˜ r) (the
− The choice of parametric class J(x,
approximation architecture).
− Method for tuning the weights (“training”
the architecture).
• Successful application strongly depends on how
these issues are handled, and on insight about the
problem.
• Sometimes a simulation-based algorithm is used,
particularly when there is no mathematical model
of the system.
• We will look in detail at these issues after a few
lectures.
APPROXIMATION ARCHITECTURES

• Divided in linear and nonlinear [i.e., linear or


˜ r) on r].
nonlinear dependence of J(x,
• Linear architectures are easier to train, but non-
linear ones (e.g., neural networks) are richer.
• Architectures based on feature extraction
Feature Cost Approximation
State x Vector y J (y,r )
Feature Extraction Cost Approximator w/
Mapping Parameter Vector r

• Ideally, the features will encode much of the


nonlinearity that is inherent in the cost-to-go ap-
proximated, and the approximation may be quite
accurate without a complicated architecture.
• Sometimes the state space is partitioned, and
“local” features are introduced for each subset of
the partition (they are 0 outside the subset).
• With a well-chosen feature vector y(x), we can
use a linear architecture
 X
˜ r) = Jˆ y(x), r =
J(x, ri yi (x)
i
AN EXAMPLE - COMPUTER CHESS

• Programs use a feature-based position evaluator


that assigns a score to each move/position

Features:
Material balance,
Mobility,
Safety, etc Score
Feature Weighting
Extraction of Features

Position Evaluator

• Many context-dependent special features.


• Most often the weighting of features is linear
but multistep lookahead is involved.
• Most often the training is done by trial and
error.
ROLLOUT ALGORITHMS

• One-step lookahead policy: At each k and


state xk , use the control µk (xk ) that

˜
 
min E gk (xk , uk , wk )+Jk+1 fk (xk , uk , wk ) ,
uk ∈Uk (xk )

where
− J˜N = gN .
− J˜k+1 : approximation to true cost-to-go Jk+1
• Rollout algorithm: When J˜k is the cost-to-go
of some heuristic policy (called the base policy)
• Cost improvement property (to be shown): The
rollout algorithm achieves no worse (and usually
much better) cost than the base heuristic starting
from the same state.
• Main difficulty: Calculating J˜k (xk ) may be com-
putationally intensive if the cost-to-go of the base
policy cannot be analytically calculated.
− May involve Monte Carlo simulation if the
problem is stochastic.
− Things improve in the deterministic case.
EXAMPLE: THE QUIZ PROBLEM

• A person is given N questions; answering cor-


rectly question i has probability pi , reward vi .
Quiz terminates at the first incorrect answer.
• Problem: Choose the ordering of questions so
as to maximize the total expected reward.
• Assuming no other constraints, it is optimal to
use the index policy: Answer questions in decreas-
ing order of pi vi /(1 − pi ).
• With minor changes in the problem, the index
policy need not be optimal. Examples:
− A limit (< N ) on the maximum number of
questions that can be answered.
− Time windows, sequence-dependent rewards,
precedence constraints.
• Rollout with the index policy as base policy:
Convenient because at a given state (subset of
questions already answered), the index policy and
its expected reward can be easily calculated.
• Very effective for solving the quiz problem and
important generalizations in scheduling (see Bert-
sekas and Castanon, J. of Heuristics, Vol. 5, 1999).
COST IMPROVEMENT PROPERTY

• Let

J k (xk ): Cost-to-go of the rollout policy

Hk (xk ): Cost-to-go of the base policy

• We claim that J k (xk ) ≤ Hk (xk ) for all xk , k


• Proof by induction: We have J N (xN ) = HN (xN )
for all xN . Assume that

J k+1 (xk+1 ) ≤ Hk+1 (xk+1 ), ∀ xk+1 .

Then, for all xk


  
J k (xk ) = E gk xk , µk (xk ), wk + J k+1 fk xk , µk (xk ), wk
  
≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk
  
≤ E gk xk , µk (xk ), wk + Hk+1 fk xk , µk (xk ), wk

= Hk (xk )

− Induction hypothesis ==> 1st inequality


− Min selection of µk (xk ) ==> 2nd inequality
− Definition of Hk , µk ==> last equality
DISCRETE DETERMINISTIC PROBLEMS

• Any discrete optimization problem (with finite


number of choices/feasible solutions) can be repre-
sented sequentially by breaking down the decision
process into stages.
• A tree/shortest path representation. The leaves
of the tree correspond to the feasible solutions.
• Decisions can be made in stages.
− May complete partial solutions, one stage at
a time.
− May apply rollout with any heuristic that
can complete a partial solution.
− No costly stochastic simulation needed.
• Example: Traveling salesman problem. Find a
minimum cost tour that goes exactly once through
each of N cities.
A Origin Node s

AB AC AD

ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCB


EXAMPLE: THE BREAKTHROUGH PROBLEM

root

• Given a binary tree with N stages.


• Each arc is either free or is blocked (crossed out
in the figure).
• Problem: Find a free path from the root to the
leaves (such as the one shown with thick lines).
• Base heuristic (greedy): Follow the right branch
if free; else follow the left branch if free.
• This is a rare rollout instance that admits a
detailed analysis.
• For large N and given prob. of free branch:
the rollout algorithm requires O(N ) times more
computation, but has O(N ) times larger prob. of
finding a free path than the greedy algorithm.
DET. EXAMPLE: ONE-DIMENSIONAL WALK

• A person takes either a unit step to the left or


a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)

_
(N,-N) (N,0) i (N,N)
g(i)

_
-N 0 i N-2 N i

• Base heuristic: Always go to the right. Rollout


finds the rightmost local minimum.
• Base heuristic: Compare always go to the right
and always go the left. Choose the best of the two.
Rollout finds a global minimum.
DET. EXAMPLE: ONE-DIMENSIONAL WALK

• A person takes either a unit step to the left or


a unit step to the right. Minimize the cost g(i) of
the point i where he will end up after N steps.
(0,0)

_
(N,-N) (N,0) i (N,N)
g(i)

_
-N 0 i N-2 N i

• Base heuristic: Always go to the right. Rollout


finds the rightmost local minimum.
• Base heuristic: Compare always go to the right
and always go the left. Choose the best of the two.
Rollout finds a global minimum.
A ROLLOUT ISSUE FOR DISCRETE PROBLEMS

• The base heuristic need not constitute a policy


in the DP sense.
• Reason: Depending on its starting point, the
base heuristic may not apply the same control at
the same state.
• As a result the cost improvement property may
be lost (except if the base heuristic has a property
called sequential consistency; see the text for a
formal definition).
• The cost improvement property is restored in
two ways:
− The base heuristic has a property called se-
quential improvement (see the text for a for-
mal definition).
− A variant of the rollout algorithm, called for-
tified rollout, is used, which enforces cost
improvement. Roughly speaking the “best”
solution found so far is maintained, and it
is followed whenever at any time the stan-
dard version of the algorithm tries to follow
a “worse” solution (see the text).
ROLLING HORIZON WITH ROLLOUT

• We can use a rolling horizon approximation in


calculating the cost-to-go of the base heuristic.
• Because the heuristic is suboptimal, the ratio-
nale for a long rolling horizon becomes weaker.
• Example: N -stage stopping problem where the
stopping cost is 0, the continuation cost is either
−ǫ or 1, where 0 < ǫ << 1, and the first state
with continuation cost equal to 1 is state m. Then
the optimal policy is to stop at state m, and the
optimal cost is −mǫ.
- e - e ... 1 ...
0 1 2 m N

Stopped State

• Consider the heuristic that continues at every


state, and the rollout policy that is based on this
heuristic, with a rolling horizon of l ≤ m steps.
• It will continue up to the first m − l + 1 stages,
thus compiling a cost of −(m−l +1)ǫ. The rollout
performance improves as l becomes shorter!
• Limited vision may work to our advantage!

You might also like