Lecture 3:
Markov Decision Processes and Dynamic Programming
Hado van Hasselt
January 30, 2018, UCL
Background
Sutton & Barto 2018, Chapter 3 + 4
Recap
I Reinforcement learning is the science of learning to make decisions
I Agents can learn a policy, value function and/or a model
I The general problem involves taking into account time and consequences
I Decisions affect the reward, the agent state, and environment state
This Lecture
I Last lecture: multiple actions, but only one state—no model
I This lecture:
I Formalize the problem with full sequential structure
I Discuss first class of solution methods which assume true model is given
I These methods are called dynamic programming
I Next lectures: use similar ideas, but use sampling instead of true model
Formalizing the RL interface
I We discuss a mathematical formulation of the agent-environment interaction
I This is called a Markov decision process
I We can then talk clearly about the objective and how to reach it
Introduction to MDPs
I Markov decision processes (MDPs) formally describe an environment
I For now, assume the environment is fully observable:
the current observation contains all relevant information
I Almost all RL problems can be formalized as MDPs, e.g.,
I Optimal control primarily deals with continuous MDPs
I Partially observable problems can be converted into MDPs
I Bandits are MDPs with one state
Markov Property
“The future is independent of the past given the present”
Consider a sequence of random states, {St }t∈N , indexed by time.
Definition
A state s has the Markov property when for states ∀s 0 ∈ S and all rewards r ∈ R
p Rt+1 = r , St+1 = s 0 | St = s = p Rt+1 = r , St+1 = s 0 | S1 , . . . , St−1 , St = s
for all possible histories S1 , . . . , St−1
I The state captures all relevant information from the history
I Once the state is known, the history may be thrown away
I The state is a sufficient statistic of the past
Markov Property
I A Markov decision process is a tuple (S, A, p, γ), where
I S is the set of all possible states
I A is the set of all possible actions (e.g., motor controls)
I p(r , s 0 | s, a) is the joint probability of a reward r and next state s 0 , given a state s
and action a
I γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones
I p defines the dynamics of the problem
I The rewards and discount together define the goal
I Sometimes it is useful to marginalize out the state transitions or expected reward:
X X X
p(s 0 | s, a) = p(s 0 , r | s, a) E [R | s, a] = r p(r , s 0 | s, a) .
r r s0
Example: cleaning robot
I Consider a robot that cleans cans
I Two states: high battery charge or low battery charge
I Actions: {wait, search} in high, {wait, search, recharge} in low
I Dynamics may be stochastic
I p(St+1 = high | St = high, At = search) = α
I p(St+1 = low | St = high, At = search) = 1 − α
I Reward could be expected number of collected cans (deterministic), or actual
number of collected cans (stochastic)
Example: robot MDP
Example: robot MDP
Return
I Acting in a MDP results in returns Gt : total discounted reward from time-step t
∞
X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1
k=0
I This is a random variables that depends on MDP and policy
I The discount γ ∈ [0, 1] is the present value of future rewards
I The marginal value of receiving reward R after k + 1 time-steps is γ k R
I For γ < 1, immediate rewards are more important than delayed rewards
I γ close to 0 leads to ”myopic” evaluation
I γ close to 1 leads to ”far-sighted” evaluation
Why discount?
Most Markov reward and decision processes are discounted. Why?
I Mathematically convenient to discount rewards
I Avoids infinite returns in cyclic Markov processes
I Immediate rewards may actually be more valuable (e.g., consider earning interest)
I Animal/human behaviour shows preference for immediate reward
I Sometimes we can use undiscounted Markov reward processes (i.e. γ = 1), e.g. if
all sequences terminate
I Note that reward and discount together determine the goal
Value Function
I The value function v (s) gives the long-term value of state s
vπ (s) = E [Gt | St = s, π]
I It can be defined recursively:
vπ (s) = E [Rt+1 + γGt+1 | St = s, π]
= E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]
X XX
= π(a | s) p(r , s 0 | s, a) r + γvπ (s 0 )
a r s0
I The final step writes out the expectation explicitly
Action values
I We can define state-action values
qπ (s, a) = E [Gt | St = s, At = a, π]
I This implies
qπ (s, a) = E [Rt+1 + γvπ (St+1 ) | St = s, At = a]
= E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]
!
XX X
0 0 0 0 0
= p(r , s | s, a) r + γ π(a | s )qπ (s , a )
r s0 a0
I Note that
X
vπ (s) = π(a | s)qπ (s, a) = E [qπ (St , At ) | St = s, π] , ∀s
a
Bellman Equation in Matrix Form
I The Bellman value equation, for given π, can be expressed using matrices,
v = r + γPπ v
where
vi = v (si )
ri = E [Rt | St = si , At ∼ π(St )]
X
Pijπ = π(a | si )p(sj | si , a)
a
Bellman Equation in Matrix Form
I The Bellman equation, for a given policy π, can be expressed using matrices,
v = r + γPπ v
I This is a linear equation that can be solved directly:
v = r + γPπ v
(I − γPπ ) v = r
v = (I − γPπ )−1 r
I Computational complexity is O(|S|3 ) — only possible for small problems
I There are iterative methods for larger problems
I Dynamic programming
I Monte-Carlo evaluation
I Temporal-Difference learning
Optimal Value Function
Definition
The optimal state-value function v ∗ (s) is the maximum value function over all policies
v ∗ (s) = max v π (s)
π
The optimal action-value function q ∗ (s, a) is the maximum action-value function over
all policies
q ∗ (s, a) = max q π (s, a)
π
I The optimal value function specifies the best possible performance in the MDP
I An MDP is “solved” when we know the optimal value function
Value Function
I Estimating vπ or qπ is called policy evaluation or, simply, prediction
I Estimating v∗ or q∗ is sometimes called control, because these can be used for
policy optimization
Bellman equations
I There are four main Bellman equations:
vπ (s) = E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]
v∗ (s) = max E [Rt+1 + γv∗ (St+1 ) | St = s, At = a]
a
qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]
0
q∗ (s, a) = E Rt+1 + γ max 0
q∗ (St+1 , a ) | St = s, At = a
a
There can be no policy with a higher value than v∗ (s) = maxπ vπ (s), ∀s
Bellman equations
I There are equivalences between state and action values
X
vπ (s) = π(a | s)qπ (s, a)
a
v∗ (s) = max q∗ (s, a)
a
Optimal Policy
Define a partial ordering over policies
0
π ≥ π0 ⇐⇒ v π (s) ≥ v π (s) , ∀s
Theorem
For any Markov decision process
I There exists an optimal policy π ∗ that is better than or equal to all other policies,
π ∗ ≥ π, ∀π
(There can be more than one such optimal policy.)
∗
I All optimal policies achieve the optimal value function, v π (s) = v ∗ (s)
∗
I All optimal policies achieve the optimal action-value function, q π (s, a) = q ∗ (s, a)
Finding an Optimal Policy
An optimal policy can be found by maximising over q ∗ (s, a),
(
1 if a = argmax q ∗ (s, a)
π ∗ (s, a) = a∈A
0 otherwise
I There is always a deterministic optimal policy for any MDP
I If we know q ∗ (s, a), we immediately have the optimal policy
I There can be multiple optimal policies
I If multiple actions maximize q∗ (s, ·), we can also just pick any of these
(including stochastically)
Solving the Bellman Optimality Equation
I The Bellman optimality equation is non-linear
I Cannot use the same direct matrix solution as for policy evaluation (in general)
I Many iterative solution methods
I Using models / dynamic programming
I Value iteration
I Policy iteration
I Using samples
I Monte Carlo
I Q-learning
I Sarsa
Dynamic Programming
The 1950s were not good years for mathematical research. I felt I had to
shield the Air Force from the fact that I was really doing mathematics. What
title, what name, could I choose? I was interested in planning, in decision
making, in thinking. But planning is not a good word for various reasons. I
decided to use the word ‘programming.’ I wanted to get across the idea that
this was dynamic, this was time-varying—I thought, let’s kill two birds with
one stone. Let’s take a word that has a precise meaning, namely dynamic, in
the classical physical sense. It also is impossible to use the word, dynamic, in
a pejorative sense. Try thinking of some combination that will possibly give it
a pejorative meaning. It’s impossible. Thus, I thought dynamic programming
was a good name. It was something not even a Congressman could object to.
So I used it as an umbrella for my activities.
– Richard Bellman
(slightly paraphrased for conciseness, emphasis mine)
Dynamic programming
Dynamic programming refers to a collection of algorithms that can be used
to compute optimal policies given a perfect model of the environment as a
Markov decision process (MDP).
Sutton & Barto 2018
I We will discuss several dynamic programming methods to solve MDPs
I All such methods consist of two important parts:
policy evaluation and policy improvement
Policy evaluation
I We start by discussing how to estimate
vπ (s) = E [Rt+1 + γvπ (St+1 ) | s, π]
I Idea: turn this equality into an update
I First, initialize v0 , e.g., to zero
I Then, iterate
∀s : vk+1 (s) = E [Rt+1 + γvk (St+1 ) | s, π]
I Note: whenever vk+1 (s) = vk (s), for all s, we must have found vπ
Policy evaluation
I Does policy evaluation always converge?
I Answer: yes, under appropriate conditions (e.g., γ < 1)
I Simple proof-sketch (continuing, discounted case):
max |vk+1 (s) − vπ (s)| = max |E [Rt+1 + γvk (St+1 ) | St = s, π]
s s
− E [Rt+1 + γvπ (St+1 ) | St = s, π] |
= max |E [γvk (St+1 ) − γvπ (St+1 ) | St = s, π] |
s
= γ max |E [vk (St+1 ) − vπ (St+1 ) | St = s, π] |
s
≤ γ max |vk (s) − vπ (s)|
s
I Implies limk→∞ vk = vπ
I Finite-horizon episodic case is a bit harder, but also works
Policy evaluation
Policy evaluation
Policy evaluation
Policy Improvement
I The example already shows we can use evaluation to then improve our policy
I In fact, just being greedy with respect to the values of the random policy sufficed!
I That is not true in general
I Instead, we can iterate, using
∀s : πnew (s) = argmax qπ (s, a)
a
= argmax E [Rt+1 + γvπ (St+1 ) | St = s, At = a]
a
I Then, evaluate πnew and repeat
I One can show that vπnew (s) ≥ vπ (s), for all s
Policy Improvement
I One can show that vπnew (s) ≥ vπ (s), for all s
I Let’s assume that, at some point, we have vπnew (s) = vπ (s)
I That implies
vπnew (s) = max E [Rt+1 + γvπnew (St+1 ) | St = s]
a
I But that is the Bellman optimality equation!
I Therefore, πnew must either be an improvement (when πnew > π), or be optimal
(when πnew = π)
Policy Iteration
Policy evaluation Estimate v π
Policy improvement Generate π 0 ≥ π
Jack’s Car Rental
I States: Two locations, maximum of 20 cars at each
I Actions: Move up to 5 cars overnight (-$2 each)
I Reward: $10 for each available car rented, γ = 0.9
I Transitions: Cars returned and requested randomly
n
I Poisson distribution, n returns/requests with prob λn! e −λ
I 1st location: average requests = 3, average returns = 3
I 2nd location: average requests = 4, average returns = 2
Jack’s Car Rental
Policy Iteration
I Does policy evaluation need to converge to v π ?
I Or should we stop when we are ‘close’ ?
(E.g., with a threshold on the change to the values)
I Or simply stop after k iterations of iterative policy evaluation?
I In the small gridworld k = 3 was sufficient to achieve optimal policy
I Why not update policy every iteration — i.e. stop after k = 1?
I This is equivalent to value iteration (up next)
Value Iteration
I We could take the Bellman optimality equation, and turn that into an update
∀s : vk+1 (s) ← max E [Rt+1 + γvk (St+1 ) | St = s, At = s]
a
I This is equivalent to policy iteration, with k = 1 step of policy evaluation between
each two (greedy) policy improvement steps
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2
0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
Problem V1 V2 V3
0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3
-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4
-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5
-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6
V4 V5 V6 V7
Synchronous Dynamic Programming Algorithms
Problem Bellman Equation Algorithm
Iterative
Prediction Bellman Expectation Equation
Policy Evaluation
Bellman Expectation Equation
Control Policy Iteration
+ Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration
I Algorithms are based on state-value function v π (s) or v ∗ (s)
I Complexity O(mn2 ) per iteration, for m actions and n states
I Could also apply to action-value function q π (s, a) or q ∗ (s, a)
I Complexity O(m2 n2 ) per iteration
Asynchronous Dynamic Programming
I DP methods described so far used synchronous updates (all states in parallel)
I Asynchronous DP backs up states individually, in any order
I Can significantly reduce computation
I Guaranteed to converge if all states continue to be selected
Asynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
I In-place dynamic programming
I Prioritised sweeping
I Real-time dynamic programming
In-Place Dynamic Programming
I Synchronous value iteration stores two copies of value function
for all s in S : vnew (s) ← max E [Rt+1 + γvold (St+1 ) | St = s]
a
vold ← vnew
I In-place value iteration only stores one copy of value function
for all s in S : v (s) ← max E [Rt+1 + γv (St+1 ) | St = s]
a
Prioritised Sweeping
I Use magnitude of Bellman error to guide state selection, e.g.
max E [Rt+1 + γv (St+1 ) | St = s] − v (s)
a
I Backup the state with the largest remaining Bellman error
I Update Bellman error of affected states after each backup
I Requires knowledge of reverse dynamics (predecessor states)
I Can be implemented efficiently by maintaining a priority queue
Real-Time Dynamic Programming
I Idea: only update states that are relevant to agent
I E.g., if the agent is in state St , update that state value, or states that it expects
to be in soon
Full-Width Backups
I Standard DP uses full-width backups
I For each backup (sync or async)
I Every successor state and action is considered s Vk+1(s)
I Using true model of transitions and reward function
I DP is effective for medium-sized problems (millions of a
states) r
I For large problems DP suffers from curse of dimensionality s' Vk(s')
I Number of states n = |S| grows exponentially with number
of state variables
I Even one full backup can be too expensive
Approximate Dynamic Programming
I Key idea: Approximate the value function
I Using a function approximator vθ (s), with a parameter vector θ ∈ Rm
I The estimated value function at iteration k is vθk
I Use dynamic programming to compute vθk+1 from vθk
I e.g. ‘fitted value iteration’ repeats at each iteration k,
I Sample states S̃ ⊆ S (more simply, let S̃ = S)
I For each sampled state s ∈ S̃, compute target using Bellman optimality equation,
ṽk (s) = max E [Rt+1 + γvk (St+1 ) | St = s]
a
I Find θk+1 by minimizing loss (e.g., with one gradient step)
X 2
(vθk (s) − ṽk (s))
s∈S̃
Bootstrapping in Dynamic Programming
I Dynamic programming improves the estimate of the value at a state using the
estimate of the value function at subsequent states
I This idea is core to RL — it is called bootstrapping
I There is a theoretical danger of divergence when combining
1. Bootstrapping
2. General function approximation
3. Updating values for a state distribution that doesn’t match the transition dynamics
of the MDP
I This theoretical danger is rarely encountered in practice
Example of divergence with dynamic programming
Tsitsiklis and Van Roy made an example where dynamic programming with linear
I
function approximation can diverge. Consider the two state example below, where
the rewards are all zero, there are no decisions, and there is a single parameter for
R 11. OFF-POLICY METHODS
estimating WITH APPROXIMATION
the value.
X
1# "
θk+1 = argmin (vθ (s) − E [vθk (St+1 ) | St = s])2
θ s∈S
! 2! = argmin (θ − γ2θk )2 + (2θ − γ(1 − )2θk )2
θ
"
6 − 4
= γθk
5
s and Van Roy’s counterexample to DP policy evaluation with least-
I What
approximation.
is limk→∞ θk when θ0 = 1, = 18 , and γ = 1?
I This is only a problem when we update the states, e.g., synchronously, without
t would if the looking at the{ time
feature vectors, (s) : an
s 2 agent would
S}, formed spend in each state
a linearly
hey do in Baird’s counterexample, because then exact approx-
n each iteration and the method reduces to standard tabular
Questions?
The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton
For questions that arise outside of class, please use Moodle