0% found this document useful (0 votes)
40 views50 pages

RL - 03 Markov Decision Processes and Dynamic Programming

This lecture introduces Markov Decision Processes (MDPs) and dynamic programming as methods for solving reinforcement learning problems. It formalizes the agent-environment interaction, discusses the Markov property, and outlines the structure of MDPs, including value functions and Bellman equations. The lecture also covers policy evaluation and improvement techniques to find optimal policies within MDPs.

Uploaded by

g46rjf5cps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views50 pages

RL - 03 Markov Decision Processes and Dynamic Programming

This lecture introduces Markov Decision Processes (MDPs) and dynamic programming as methods for solving reinforcement learning problems. It formalizes the agent-environment interaction, discusses the Markov property, and outlines the structure of MDPs, including value functions and Bellman equations. The lecture also covers policy evaluation and improvement techniques to find optimal policies within MDPs.

Uploaded by

g46rjf5cps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 3:

Markov Decision Processes and Dynamic Programming

Hado van Hasselt

January 30, 2018, UCL


Background

Sutton & Barto 2018, Chapter 3 + 4


Recap

I Reinforcement learning is the science of learning to make decisions


I Agents can learn a policy, value function and/or a model
I The general problem involves taking into account time and consequences
I Decisions affect the reward, the agent state, and environment state
This Lecture

I Last lecture: multiple actions, but only one state—no model


I This lecture:
I Formalize the problem with full sequential structure
I Discuss first class of solution methods which assume true model is given
I These methods are called dynamic programming
I Next lectures: use similar ideas, but use sampling instead of true model
Formalizing the RL interface

I We discuss a mathematical formulation of the agent-environment interaction


I This is called a Markov decision process
I We can then talk clearly about the objective and how to reach it
Introduction to MDPs

I Markov decision processes (MDPs) formally describe an environment


I For now, assume the environment is fully observable:
the current observation contains all relevant information
I Almost all RL problems can be formalized as MDPs, e.g.,
I Optimal control primarily deals with continuous MDPs
I Partially observable problems can be converted into MDPs
I Bandits are MDPs with one state
Markov Property

“The future is independent of the past given the present”


Consider a sequence of random states, {St }t∈N , indexed by time.
Definition
A state s has the Markov property when for states ∀s 0 ∈ S and all rewards r ∈ R
 
p Rt+1 = r , St+1 = s 0 | St = s = p Rt+1 = r , St+1 = s 0 | S1 , . . . , St−1 , St = s

for all possible histories S1 , . . . , St−1


I The state captures all relevant information from the history
I Once the state is known, the history may be thrown away
I The state is a sufficient statistic of the past
Markov Property

I A Markov decision process is a tuple (S, A, p, γ), where


I S is the set of all possible states
I A is the set of all possible actions (e.g., motor controls)
I p(r , s 0 | s, a) is the joint probability of a reward r and next state s 0 , given a state s
and action a
I γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones
I p defines the dynamics of the problem
I The rewards and discount together define the goal
I Sometimes it is useful to marginalize out the state transitions or expected reward:
X X X
p(s 0 | s, a) = p(s 0 , r | s, a) E [R | s, a] = r p(r , s 0 | s, a) .
r r s0
Example: cleaning robot

I Consider a robot that cleans cans


I Two states: high battery charge or low battery charge
I Actions: {wait, search} in high, {wait, search, recharge} in low
I Dynamics may be stochastic
I p(St+1 = high | St = high, At = search) = α
I p(St+1 = low | St = high, At = search) = 1 − α
I Reward could be expected number of collected cans (deterministic), or actual
number of collected cans (stochastic)
Example: robot MDP
Example: robot MDP
Return

I Acting in a MDP results in returns Gt : total discounted reward from time-step t



X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1
k=0

I This is a random variables that depends on MDP and policy


I The discount γ ∈ [0, 1] is the present value of future rewards
I The marginal value of receiving reward R after k + 1 time-steps is γ k R
I For γ < 1, immediate rewards are more important than delayed rewards
I γ close to 0 leads to ”myopic” evaluation
I γ close to 1 leads to ”far-sighted” evaluation
Why discount?

Most Markov reward and decision processes are discounted. Why?


I Mathematically convenient to discount rewards
I Avoids infinite returns in cyclic Markov processes
I Immediate rewards may actually be more valuable (e.g., consider earning interest)
I Animal/human behaviour shows preference for immediate reward
I Sometimes we can use undiscounted Markov reward processes (i.e. γ = 1), e.g. if
all sequences terminate
I Note that reward and discount together determine the goal
Value Function

I The value function v (s) gives the long-term value of state s

vπ (s) = E [Gt | St = s, π]

I It can be defined recursively:

vπ (s) = E [Rt+1 + γGt+1 | St = s, π]


= E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]
X XX 
= π(a | s) p(r , s 0 | s, a) r + γvπ (s 0 )
a r s0

I The final step writes out the expectation explicitly


Action values
I We can define state-action values

qπ (s, a) = E [Gt | St = s, At = a, π]

I This implies

qπ (s, a) = E [Rt+1 + γvπ (St+1 ) | St = s, At = a]


= E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]
!
XX X
0 0 0 0 0
= p(r , s | s, a) r + γ π(a | s )qπ (s , a )
r s0 a0

I Note that
X
vπ (s) = π(a | s)qπ (s, a) = E [qπ (St , At ) | St = s, π] , ∀s
a
Bellman Equation in Matrix Form

I The Bellman value equation, for given π, can be expressed using matrices,

v = r + γPπ v

where

vi = v (si )
ri = E [Rt | St = si , At ∼ π(St )]
X
Pijπ = π(a | si )p(sj | si , a)
a
Bellman Equation in Matrix Form
I The Bellman equation, for a given policy π, can be expressed using matrices,

v = r + γPπ v

I This is a linear equation that can be solved directly:

v = r + γPπ v
(I − γPπ ) v = r
v = (I − γPπ )−1 r

I Computational complexity is O(|S|3 ) — only possible for small problems


I There are iterative methods for larger problems
I Dynamic programming
I Monte-Carlo evaluation
I Temporal-Difference learning
Optimal Value Function

Definition
The optimal state-value function v ∗ (s) is the maximum value function over all policies

v ∗ (s) = max v π (s)


π

The optimal action-value function q ∗ (s, a) is the maximum action-value function over
all policies

q ∗ (s, a) = max q π (s, a)


π

I The optimal value function specifies the best possible performance in the MDP
I An MDP is “solved” when we know the optimal value function
Value Function

I Estimating vπ or qπ is called policy evaluation or, simply, prediction


I Estimating v∗ or q∗ is sometimes called control, because these can be used for
policy optimization
Bellman equations

I There are four main Bellman equations:

vπ (s) = E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]


v∗ (s) = max E [Rt+1 + γv∗ (St+1 ) | St = s, At = a]
a
qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]
 
0
q∗ (s, a) = E Rt+1 + γ max 0
q∗ (St+1 , a ) | St = s, At = a
a

There can be no policy with a higher value than v∗ (s) = maxπ vπ (s), ∀s
Bellman equations

I There are equivalences between state and action values


X
vπ (s) = π(a | s)qπ (s, a)
a
v∗ (s) = max q∗ (s, a)
a
Optimal Policy

Define a partial ordering over policies


0
π ≥ π0 ⇐⇒ v π (s) ≥ v π (s) , ∀s

Theorem
For any Markov decision process
I There exists an optimal policy π ∗ that is better than or equal to all other policies,
π ∗ ≥ π, ∀π
(There can be more than one such optimal policy.)

I All optimal policies achieve the optimal value function, v π (s) = v ∗ (s)

I All optimal policies achieve the optimal action-value function, q π (s, a) = q ∗ (s, a)
Finding an Optimal Policy

An optimal policy can be found by maximising over q ∗ (s, a),


(
1 if a = argmax q ∗ (s, a)
π ∗ (s, a) = a∈A
0 otherwise

I There is always a deterministic optimal policy for any MDP


I If we know q ∗ (s, a), we immediately have the optimal policy
I There can be multiple optimal policies
I If multiple actions maximize q∗ (s, ·), we can also just pick any of these
(including stochastically)
Solving the Bellman Optimality Equation

I The Bellman optimality equation is non-linear


I Cannot use the same direct matrix solution as for policy evaluation (in general)
I Many iterative solution methods
I Using models / dynamic programming
I Value iteration
I Policy iteration
I Using samples
I Monte Carlo
I Q-learning
I Sarsa
Dynamic Programming
The 1950s were not good years for mathematical research. I felt I had to
shield the Air Force from the fact that I was really doing mathematics. What
title, what name, could I choose? I was interested in planning, in decision
making, in thinking. But planning is not a good word for various reasons. I
decided to use the word ‘programming.’ I wanted to get across the idea that
this was dynamic, this was time-varying—I thought, let’s kill two birds with
one stone. Let’s take a word that has a precise meaning, namely dynamic, in
the classical physical sense. It also is impossible to use the word, dynamic, in
a pejorative sense. Try thinking of some combination that will possibly give it
a pejorative meaning. It’s impossible. Thus, I thought dynamic programming
was a good name. It was something not even a Congressman could object to.
So I used it as an umbrella for my activities.
– Richard Bellman
(slightly paraphrased for conciseness, emphasis mine)
Dynamic programming

Dynamic programming refers to a collection of algorithms that can be used


to compute optimal policies given a perfect model of the environment as a
Markov decision process (MDP).

Sutton & Barto 2018


I We will discuss several dynamic programming methods to solve MDPs
I All such methods consist of two important parts:
policy evaluation and policy improvement
Policy evaluation

I We start by discussing how to estimate

vπ (s) = E [Rt+1 + γvπ (St+1 ) | s, π]

I Idea: turn this equality into an update


I First, initialize v0 , e.g., to zero
I Then, iterate
∀s : vk+1 (s) = E [Rt+1 + γvk (St+1 ) | s, π]
I Note: whenever vk+1 (s) = vk (s), for all s, we must have found vπ
Policy evaluation
I Does policy evaluation always converge?
I Answer: yes, under appropriate conditions (e.g., γ < 1)
I Simple proof-sketch (continuing, discounted case):

max |vk+1 (s) − vπ (s)| = max |E [Rt+1 + γvk (St+1 ) | St = s, π]


s s
− E [Rt+1 + γvπ (St+1 ) | St = s, π] |
= max |E [γvk (St+1 ) − γvπ (St+1 ) | St = s, π] |
s
= γ max |E [vk (St+1 ) − vπ (St+1 ) | St = s, π] |
s
≤ γ max |vk (s) − vπ (s)|
s

I Implies limk→∞ vk = vπ
I Finite-horizon episodic case is a bit harder, but also works
Policy evaluation
Policy evaluation
Policy evaluation
Policy Improvement

I The example already shows we can use evaluation to then improve our policy
I In fact, just being greedy with respect to the values of the random policy sufficed!
I That is not true in general
I Instead, we can iterate, using

∀s : πnew (s) = argmax qπ (s, a)


a
= argmax E [Rt+1 + γvπ (St+1 ) | St = s, At = a]
a

I Then, evaluate πnew and repeat


I One can show that vπnew (s) ≥ vπ (s), for all s
Policy Improvement

I One can show that vπnew (s) ≥ vπ (s), for all s


I Let’s assume that, at some point, we have vπnew (s) = vπ (s)
I That implies

vπnew (s) = max E [Rt+1 + γvπnew (St+1 ) | St = s]


a

I But that is the Bellman optimality equation!


I Therefore, πnew must either be an improvement (when πnew > π), or be optimal
(when πnew = π)
Policy Iteration

Policy evaluation Estimate v π


Policy improvement Generate π 0 ≥ π
Jack’s Car Rental

I States: Two locations, maximum of 20 cars at each


I Actions: Move up to 5 cars overnight (-$2 each)
I Reward: $10 for each available car rented, γ = 0.9
I Transitions: Cars returned and requested randomly
n
I Poisson distribution, n returns/requests with prob λn! e −λ
I 1st location: average requests = 3, average returns = 3
I 2nd location: average requests = 4, average returns = 2
Jack’s Car Rental
Policy Iteration

I Does policy evaluation need to converge to v π ?


I Or should we stop when we are ‘close’ ?
(E.g., with a threshold on the change to the values)
I Or simply stop after k iterations of iterative policy evaluation?
I In the small gridworld k = 3 was sufficient to achieve optimal policy
I Why not update policy every iteration — i.e. stop after k = 1?
I This is equivalent to value iteration (up next)
Value Iteration

I We could take the Bellman optimality equation, and turn that into an update

∀s : vk+1 (s) ← max E [Rt+1 + γvk (St+1 ) | St = s, At = s]


a

I This is equivalent to policy iteration, with k = 1 step of policy evaluation between


each two (greedy) policy improvement steps
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2

0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2

0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2

0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2

Problem V1 V2 V3

0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3

-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4

-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5

-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6

V4 V5 V6 V7
Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm


Iterative
Prediction Bellman Expectation Equation
Policy Evaluation
Bellman Expectation Equation
Control Policy Iteration
+ Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration

I Algorithms are based on state-value function v π (s) or v ∗ (s)


I Complexity O(mn2 ) per iteration, for m actions and n states
I Could also apply to action-value function q π (s, a) or q ∗ (s, a)
I Complexity O(m2 n2 ) per iteration
Asynchronous Dynamic Programming

I DP methods described so far used synchronous updates (all states in parallel)


I Asynchronous DP backs up states individually, in any order
I Can significantly reduce computation
I Guaranteed to converge if all states continue to be selected
Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:


I In-place dynamic programming
I Prioritised sweeping
I Real-time dynamic programming
In-Place Dynamic Programming

I Synchronous value iteration stores two copies of value function

for all s in S : vnew (s) ← max E [Rt+1 + γvold (St+1 ) | St = s]


a
vold ← vnew

I In-place value iteration only stores one copy of value function

for all s in S : v (s) ← max E [Rt+1 + γv (St+1 ) | St = s]


a
Prioritised Sweeping

I Use magnitude of Bellman error to guide state selection, e.g.

max E [Rt+1 + γv (St+1 ) | St = s] − v (s)


a

I Backup the state with the largest remaining Bellman error


I Update Bellman error of affected states after each backup
I Requires knowledge of reverse dynamics (predecessor states)
I Can be implemented efficiently by maintaining a priority queue
Real-Time Dynamic Programming

I Idea: only update states that are relevant to agent


I E.g., if the agent is in state St , update that state value, or states that it expects
to be in soon
Full-Width Backups

I Standard DP uses full-width backups


I For each backup (sync or async)
I Every successor state and action is considered s Vk+1(s)
I Using true model of transitions and reward function
I DP is effective for medium-sized problems (millions of a
states) r
I For large problems DP suffers from curse of dimensionality s' Vk(s')
I Number of states n = |S| grows exponentially with number
of state variables
I Even one full backup can be too expensive
Approximate Dynamic Programming
I Key idea: Approximate the value function
I Using a function approximator vθ (s), with a parameter vector θ ∈ Rm
I The estimated value function at iteration k is vθk
I Use dynamic programming to compute vθk+1 from vθk
I e.g. ‘fitted value iteration’ repeats at each iteration k,
I Sample states S̃ ⊆ S (more simply, let S̃ = S)
I For each sampled state s ∈ S̃, compute target using Bellman optimality equation,

ṽk (s) = max E [Rt+1 + γvk (St+1 ) | St = s]


a

I Find θk+1 by minimizing loss (e.g., with one gradient step)


X 2
(vθk (s) − ṽk (s))
s∈S̃
Bootstrapping in Dynamic Programming

I Dynamic programming improves the estimate of the value at a state using the
estimate of the value function at subsequent states
I This idea is core to RL — it is called bootstrapping
I There is a theoretical danger of divergence when combining
1. Bootstrapping
2. General function approximation
3. Updating values for a state distribution that doesn’t match the transition dynamics
of the MDP
I This theoretical danger is rarely encountered in practice
Example of divergence with dynamic programming

Tsitsiklis and Van Roy made an example where dynamic programming with linear
I
function approximation can diverge. Consider the two state example below, where
the rewards are all zero, there are no decisions, and there is a single parameter for
R 11. OFF-POLICY METHODS
estimating WITH APPROXIMATION
the value.
X
1# "
θk+1 = argmin (vθ (s) − E [vθk (St+1 ) | St = s])2
θ s∈S

! 2! = argmin (θ − γ2θk )2 + (2θ − γ(1 − )2θk )2


θ
"
6 − 4
= γθk
5
s and Van Roy’s counterexample to DP policy evaluation with least-
I What
approximation.
is limk→∞ θk when θ0 = 1,  = 18 , and γ = 1?
I This is only a problem when we update the states, e.g., synchronously, without
t would if the looking at the{ time
feature vectors, (s) : an
s 2 agent would
S}, formed spend in each state
a linearly
hey do in Baird’s counterexample, because then exact approx-
n each iteration and the method reduces to standard tabular
Questions?

The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton

For questions that arise outside of class, please use Moodle

You might also like