0% found this document useful (0 votes)

40 views50 pages

RL - 03 Markov Decision Processes and Dynamic Programming

This lecture introduces Markov Decision Processes (MDPs) and dynamic programming as methods for solving reinforcement learning problems. It formalizes the agent-environment interaction, discusses the Markov property, and outlines the structure of MDPs, including value functions and Bellman equations. The lecture also covers policy evaluation and improvement techniques to find optimal policies within MDPs.

Uploaded by

g46rjf5cps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views50 pages

RL - 03 Markov Decision Processes and Dynamic Programming

Uploaded by

g46rjf5cps

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 3:

Markov Decision Processes and Dynamic Programming

Hado van Hasselt

January 30, 2018, UCL

Background

Sutton & Barto 2018, Chapter 3 + 4

Recap

I Reinforcement learning is the science of learning to make decisions

I Agents can learn a policy, value function and/or a model
I The general problem involves taking into account time and consequences
I Decisions affect the reward, the agent state, and environment state
This Lecture

I Last lecture: multiple actions, but only one state—no model

I This lecture:
I Formalize the problem with full sequential structure
I Discuss first class of solution methods which assume true model is given
I These methods are called dynamic programming
I Next lectures: use similar ideas, but use sampling instead of true model
Formalizing the RL interface

I We discuss a mathematical formulation of the agent-environment interaction

I This is called a Markov decision process
I We can then talk clearly about the objective and how to reach it
Introduction to MDPs

I Markov decision processes (MDPs) formally describe an environment

I For now, assume the environment is fully observable:
the current observation contains all relevant information
I Almost all RL problems can be formalized as MDPs, e.g.,
I Optimal control primarily deals with continuous MDPs
I Partially observable problems can be converted into MDPs
I Bandits are MDPs with one state
Markov Property

“The future is independent of the past given the present”

Consider a sequence of random states, {St }t∈N , indexed by time.
Definition
A state s has the Markov property when for states ∀s 0 ∈ S and all rewards r ∈ R

p Rt+1 = r , St+1 = s 0 | St = s = p Rt+1 = r , St+1 = s 0 | S1 , . . . , St−1 , St = s

for all possible histories S1 , . . . , St−1

I The state captures all relevant information from the history
I Once the state is known, the history may be thrown away
I The state is a sufficient statistic of the past
Markov Property

I A Markov decision process is a tuple (S, A, p, γ), where

I S is the set of all possible states
I A is the set of all possible actions (e.g., motor controls)
I p(r , s 0 | s, a) is the joint probability of a reward r and next state s 0 , given a state s
and action a
I γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones
I p defines the dynamics of the problem
I The rewards and discount together define the goal
I Sometimes it is useful to marginalize out the state transitions or expected reward:
X X X
p(s 0 | s, a) = p(s 0 , r | s, a) E [R | s, a] = r p(r , s 0 | s, a) .
r r s0
Example: cleaning robot

I Consider a robot that cleans cans

I Two states: high battery charge or low battery charge
I Actions: {wait, search} in high, {wait, search, recharge} in low
I Dynamics may be stochastic
I p(St+1 = high | St = high, At = search) = α
I p(St+1 = low | St = high, At = search) = 1 − α
I Reward could be expected number of collected cans (deterministic), or actual
number of collected cans (stochastic)
Example: robot MDP
Example: robot MDP
Return

I Acting in a MDP results in returns Gt : total discounted reward from time-step t

∞
X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1
k=0

I This is a random variables that depends on MDP and policy

I The discount γ ∈ [0, 1] is the present value of future rewards
I The marginal value of receiving reward R after k + 1 time-steps is γ k R
I For γ < 1, immediate rewards are more important than delayed rewards
I γ close to 0 leads to ”myopic” evaluation
I γ close to 1 leads to ”far-sighted” evaluation
Why discount?

Most Markov reward and decision processes are discounted. Why?

I Mathematically convenient to discount rewards
I Avoids infinite returns in cyclic Markov processes
I Immediate rewards may actually be more valuable (e.g., consider earning interest)
I Animal/human behaviour shows preference for immediate reward
I Sometimes we can use undiscounted Markov reward processes (i.e. γ = 1), e.g. if
all sequences terminate
I Note that reward and discount together determine the goal
Value Function

I The value function v (s) gives the long-term value of state s

vπ (s) = E [Gt | St = s, π]

I It can be defined recursively:

vπ (s) = E [Rt+1 + γGt+1 | St = s, π]

= E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]
X XX
= π(a | s) p(r , s 0 | s, a) r + γvπ (s 0 )
a r s0

I The final step writes out the expectation explicitly

Action values
I We can define state-action values

qπ (s, a) = E [Gt | St = s, At = a, π]

I This implies

qπ (s, a) = E [Rt+1 + γvπ (St+1 ) | St = s, At = a]

= E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]
!
XX X
0 0 0 0 0
= p(r , s | s, a) r + γ π(a | s )qπ (s , a )
r s0 a0

I Note that
X
vπ (s) = π(a | s)qπ (s, a) = E [qπ (St , At ) | St = s, π] , ∀s
a
Bellman Equation in Matrix Form

I The Bellman value equation, for given π, can be expressed using matrices,

v = r + γPπ v

where

vi = v (si )
ri = E [Rt | St = si , At ∼ π(St )]
X
Pijπ = π(a | si )p(sj | si , a)
a
Bellman Equation in Matrix Form
I The Bellman equation, for a given policy π, can be expressed using matrices,

v = r + γPπ v

I This is a linear equation that can be solved directly:

v = r + γPπ v
(I − γPπ ) v = r
v = (I − γPπ )−1 r

I Computational complexity is O(|S|3 ) — only possible for small problems

I There are iterative methods for larger problems
I Dynamic programming
I Monte-Carlo evaluation
I Temporal-Difference learning
Optimal Value Function

Definition
The optimal state-value function v ∗ (s) is the maximum value function over all policies

v ∗ (s) = max v π (s)

The optimal action-value function q ∗ (s, a) is the maximum action-value function over
all policies

q ∗ (s, a) = max q π (s, a)

I The optimal value function specifies the best possible performance in the MDP
I An MDP is “solved” when we know the optimal value function
Value Function

I Estimating vπ or qπ is called policy evaluation or, simply, prediction

I Estimating v∗ or q∗ is sometimes called control, because these can be used for
policy optimization
Bellman equations

I There are four main Bellman equations:

vπ (s) = E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]

v∗ (s) = max E [Rt+1 + γv∗ (St+1 ) | St = s, At = a]
a
qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]

0
q∗ (s, a) = E Rt+1 + γ max 0
q∗ (St+1 , a ) | St = s, At = a
a

There can be no policy with a higher value than v∗ (s) = maxπ vπ (s), ∀s
Bellman equations

I There are equivalences between state and action values

X
vπ (s) = π(a | s)qπ (s, a)
a
v∗ (s) = max q∗ (s, a)
a
Optimal Policy

Define a partial ordering over policies

0
π ≥ π0 ⇐⇒ v π (s) ≥ v π (s) , ∀s

Theorem
For any Markov decision process
I There exists an optimal policy π ∗ that is better than or equal to all other policies,
π ∗ ≥ π, ∀π
(There can be more than one such optimal policy.)
∗
I All optimal policies achieve the optimal value function, v π (s) = v ∗ (s)
∗
I All optimal policies achieve the optimal action-value function, q π (s, a) = q ∗ (s, a)
Finding an Optimal Policy

An optimal policy can be found by maximising over q ∗ (s, a),

(
1 if a = argmax q ∗ (s, a)
π ∗ (s, a) = a∈A
0 otherwise

I There is always a deterministic optimal policy for any MDP

I If we know q ∗ (s, a), we immediately have the optimal policy
I There can be multiple optimal policies
I If multiple actions maximize q∗ (s, ·), we can also just pick any of these
(including stochastically)
Solving the Bellman Optimality Equation

I The Bellman optimality equation is non-linear

I Cannot use the same direct matrix solution as for policy evaluation (in general)
I Many iterative solution methods
I Using models / dynamic programming
I Value iteration
I Policy iteration
I Using samples
I Monte Carlo
I Q-learning
I Sarsa
Dynamic Programming
The 1950s were not good years for mathematical research. I felt I had to
shield the Air Force from the fact that I was really doing mathematics. What
title, what name, could I choose? I was interested in planning, in decision
making, in thinking. But planning is not a good word for various reasons. I
decided to use the word ‘programming.’ I wanted to get across the idea that
this was dynamic, this was time-varying—I thought, let’s kill two birds with
one stone. Let’s take a word that has a precise meaning, namely dynamic, in
the classical physical sense. It also is impossible to use the word, dynamic, in
a pejorative sense. Try thinking of some combination that will possibly give it
a pejorative meaning. It’s impossible. Thus, I thought dynamic programming
was a good name. It was something not even a Congressman could object to.
So I used it as an umbrella for my activities.
– Richard Bellman
(slightly paraphrased for conciseness, emphasis mine)
Dynamic programming

Dynamic programming refers to a collection of algorithms that can be used

to compute optimal policies given a perfect model of the environment as a
Markov decision process (MDP).

Sutton & Barto 2018

I We will discuss several dynamic programming methods to solve MDPs
I All such methods consist of two important parts:
policy evaluation and policy improvement
Policy evaluation

I We start by discussing how to estimate

vπ (s) = E [Rt+1 + γvπ (St+1 ) | s, π]

I Idea: turn this equality into an update

I First, initialize v0 , e.g., to zero
I Then, iterate
∀s : vk+1 (s) = E [Rt+1 + γvk (St+1 ) | s, π]
I Note: whenever vk+1 (s) = vk (s), for all s, we must have found vπ
Policy evaluation
I Does policy evaluation always converge?
I Answer: yes, under appropriate conditions (e.g., γ < 1)
I Simple proof-sketch (continuing, discounted case):

max |vk+1 (s) − vπ (s)| = max |E [Rt+1 + γvk (St+1 ) | St = s, π]

I Implies limk→∞ vk = vπ
I Finite-horizon episodic case is a bit harder, but also works
Policy evaluation
Policy evaluation
Policy evaluation
Policy Improvement

I The example already shows we can use evaluation to then improve our policy
I In fact, just being greedy with respect to the values of the random policy sufficed!
I That is not true in general
I Instead, we can iterate, using

∀s : πnew (s) = argmax qπ (s, a)

a
= argmax E [Rt+1 + γvπ (St+1 ) | St = s, At = a]
a

I Then, evaluate πnew and repeat

I One can show that vπnew (s) ≥ vπ (s), for all s
Policy Improvement

I One can show that vπnew (s) ≥ vπ (s), for all s

I Let’s assume that, at some point, we have vπnew (s) = vπ (s)
I That implies

vπnew (s) = max E [Rt+1 + γvπnew (St+1 ) | St = s]

I But that is the Bellman optimality equation!

I Therefore, πnew must either be an improvement (when πnew > π), or be optimal
(when πnew = π)
Policy Iteration

Policy evaluation Estimate v π

Policy improvement Generate π 0 ≥ π
Jack’s Car Rental

I States: Two locations, maximum of 20 cars at each

I Actions: Move up to 5 cars overnight (-$2 each)
I Reward: $10 for each available car rented, γ = 0.9
I Transitions: Cars returned and requested randomly
n
I Poisson distribution, n returns/requests with prob λn! e −λ
I 1st location: average requests = 3, average returns = 3
I 2nd location: average requests = 4, average returns = 2
Jack’s Car Rental
Policy Iteration

I Does policy evaluation need to converge to v π ?

I Or should we stop when we are ‘close’ ?
(E.g., with a threshold on the change to the values)
I Or simply stop after k iterations of iterative policy evaluation?
I In the small gridworld k = 3 was sufficient to achieve optimal policy
I Why not update policy every iteration — i.e. stop after k = 1?
I This is equivalent to value iteration (up next)
Value Iteration

I We could take the Bellman optimality equation, and turn that into an update

∀s : vk+1 (s) ← max E [Rt+1 + γvk (St+1 ) | St = s, At = s]

I This is equivalent to policy iteration, with k = 1 step of policy evaluation between

each two (greedy) policy improvement steps
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2

0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2

0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2

Problem V1 V2 V3

0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3

-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4

-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5

-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6

V4 V5 V6 V7
Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Iterative
Prediction Bellman Expectation Equation
Policy Evaluation
Bellman Expectation Equation
Control Policy Iteration
+ Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration

I Algorithms are based on state-value function v π (s) or v ∗ (s)

I Complexity O(mn2 ) per iteration, for m actions and n states
I Could also apply to action-value function q π (s, a) or q ∗ (s, a)
I Complexity O(m2 n2 ) per iteration
Asynchronous Dynamic Programming

I DP methods described so far used synchronous updates (all states in parallel)

I Asynchronous DP backs up states individually, in any order
I Can significantly reduce computation
I Guaranteed to converge if all states continue to be selected
Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

I In-place dynamic programming
I Prioritised sweeping
I Real-time dynamic programming
In-Place Dynamic Programming

I Synchronous value iteration stores two copies of value function

for all s in S : vnew (s) ← max E [Rt+1 + γvold (St+1 ) | St = s]

a
vold ← vnew

I In-place value iteration only stores one copy of value function

for all s in S : v (s) ← max E [Rt+1 + γv (St+1 ) | St = s]

a
Prioritised Sweeping

I Use magnitude of Bellman error to guide state selection, e.g.

max E [Rt+1 + γv (St+1 ) | St = s] − v (s)

I Backup the state with the largest remaining Bellman error

I Update Bellman error of affected states after each backup
I Requires knowledge of reverse dynamics (predecessor states)
I Can be implemented efficiently by maintaining a priority queue
Real-Time Dynamic Programming

I Idea: only update states that are relevant to agent

I E.g., if the agent is in state St , update that state value, or states that it expects
to be in soon
Full-Width Backups

I Standard DP uses full-width backups

I For each backup (sync or async)
I Every successor state and action is considered s Vk+1(s)
I Using true model of transitions and reward function
I DP is effective for medium-sized problems (millions of a
states) r
I For large problems DP suffers from curse of dimensionality s' Vk(s')
I Number of states n = |S| grows exponentially with number
of state variables
I Even one full backup can be too expensive
Approximate Dynamic Programming
I Key idea: Approximate the value function
I Using a function approximator vθ (s), with a parameter vector θ ∈ Rm
I The estimated value function at iteration k is vθk
I Use dynamic programming to compute vθk+1 from vθk
I e.g. ‘fitted value iteration’ repeats at each iteration k,
I Sample states S̃ ⊆ S (more simply, let S̃ = S)
I For each sampled state s ∈ S̃, compute target using Bellman optimality equation,

ṽk (s) = max E [Rt+1 + γvk (St+1 ) | St = s]

I Find θk+1 by minimizing loss (e.g., with one gradient step)

X 2
(vθk (s) − ṽk (s))
s∈S̃
Bootstrapping in Dynamic Programming

I Dynamic programming improves the estimate of the value at a state using the
estimate of the value function at subsequent states
I This idea is core to RL — it is called bootstrapping
I There is a theoretical danger of divergence when combining
1. Bootstrapping
2. General function approximation
3. Updating values for a state distribution that doesn’t match the transition dynamics
of the MDP
I This theoretical danger is rarely encountered in practice
Example of divergence with dynamic programming

Tsitsiklis and Van Roy made an example where dynamic programming with linear
I
function approximation can diverge. Consider the two state example below, where
the rewards are all zero, there are no decisions, and there is a single parameter for
R 11. OFF-POLICY METHODS
estimating WITH APPROXIMATION
the value.
X
1# "
θk+1 = argmin (vθ (s) − E [vθk (St+1 ) | St = s])2
θ s∈S

! 2! = argmin (θ − γ2θk )2 + (2θ − γ(1 − )2θk )2

θ
"
6 − 4
= γθk
5
s and Van Roy’s counterexample to DP policy evaluation with least-
I What
approximation.
is limk→∞ θk when θ0 = 1, = 18 , and γ = 1?
I This is only a problem when we update the states, e.g., synchronously, without
t would if the looking at the{ time
feature vectors, (s) : an
s 2 agent would
S}, formed spend in each state
a linearly
hey do in Baird’s counterexample, because then exact approx-
n each iteration and the method reduces to standard tabular
Questions?

The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton

For questions that arise outside of class, please use Moodle

Lec 12
No ratings yet
Lec 12
60 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Subtitle
No ratings yet
Subtitle
2 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Discounted Markov Decision Processes
No ratings yet
Discounted Markov Decision Processes
26 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Subtitle
No ratings yet
Subtitle
2 pages
CS229
No ratings yet
CS229
17 pages
Utilities and MDP: A Lesson in Multiagent System: Henry Hexmoor Siuc Siuc
No ratings yet
Utilities and MDP: A Lesson in Multiagent System: Henry Hexmoor Siuc Siuc
23 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Value Functions and Bellman Equations
No ratings yet
Value Functions and Bellman Equations
11 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Lec 3
No ratings yet
Lec 3
15 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Bellman Equation in Markov Decision Processes
No ratings yet
Bellman Equation in Markov Decision Processes
5 pages
Lec 02
No ratings yet
Lec 02
89 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
1 Markov
No ratings yet
1 Markov
34 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Bellman Equation & Markov Processes
No ratings yet
Bellman Equation & Markov Processes
10 pages
RL - 04 Model-Free Prediction and Control
No ratings yet
RL - 04 Model-Free Prediction and Control
59 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
123 pages
Reinforcement Learning Exercises
No ratings yet
Reinforcement Learning Exercises
10 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Lec 08
No ratings yet
Lec 08
59 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
Reinforcement Learning & MDPs
100% (1)
Reinforcement Learning & MDPs
8 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Dynamic Programming in RL Planning
No ratings yet
Dynamic Programming in RL Planning
17 pages
RL - 05 Function Approximation and Deep Reinforcement Learning
No ratings yet
RL - 05 Function Approximation and Deep Reinforcement Learning
50 pages
RL - 01 Introduction To Reinforcement Learning
No ratings yet
RL - 01 Introduction To Reinforcement Learning
62 pages
RL - 02 Exploration and Exploitation
No ratings yet
RL - 02 Exploration and Exploitation
56 pages
02 - Objects and Classes
No ratings yet
02 - Objects and Classes
21 pages
03 - Functions and Methods - Exercises
No ratings yet
03 - Functions and Methods - Exercises
7 pages
Decision Tree
No ratings yet
Decision Tree
4 pages
Sandholm
No ratings yet
Sandholm
2 pages
Feldman10 PPT ch10
50% (6)
Feldman10 PPT ch10
48 pages
Benefits of Voluntary Work for Teens
No ratings yet
Benefits of Voluntary Work for Teens
5 pages
T 2Kgb: A B O - D K G G T: EXT Ench Enchmark For Ntology Riven Nowledge Raph Eneration From EXT
No ratings yet
T 2Kgb: A B O - D K G G T: EXT Ench Enchmark For Ntology Riven Nowledge Raph Eneration From EXT
15 pages
5.4.3.alumni Meetings
No ratings yet
5.4.3.alumni Meetings
100 pages
IELTS Task 1
No ratings yet
IELTS Task 1
14 pages
IELTS Reading Practice for Eigg
No ratings yet
IELTS Reading Practice for Eigg
16 pages
Dri Resume 3
No ratings yet
Dri Resume 3
19 pages
This World of Inarticulate Power J.M. Synges Riders To The Sea and Magical Realism (C.Collins)
No ratings yet
This World of Inarticulate Power J.M. Synges Riders To The Sea and Magical Realism (C.Collins)
22 pages
IBA Karachi BS Computer Science Admission Results 2022
No ratings yet
IBA Karachi BS Computer Science Admission Results 2022
21 pages
Presentation On The Topic of "Should Wales Become An Independent Country"
No ratings yet
Presentation On The Topic of "Should Wales Become An Independent Country"
17 pages
1 DLP Kinder Week 18 E
No ratings yet
1 DLP Kinder Week 18 E
5 pages
Online Defense Schedule (EMBA & M.SC.)
100% (1)
Online Defense Schedule (EMBA & M.SC.)
7 pages
BIO BASIS AP Psych
No ratings yet
BIO BASIS AP Psych
13 pages
Employee Benefits Package
No ratings yet
Employee Benefits Package
15 pages
Ericka Joyce O. Reynera's Profile
No ratings yet
Ericka Joyce O. Reynera's Profile
2 pages
Unit 25. Vocabulary
No ratings yet
Unit 25. Vocabulary
6 pages
12.0 Entrepreneurship Studies
No ratings yet
12.0 Entrepreneurship Studies
43 pages
Visual and Multimedia Elements in English
No ratings yet
Visual and Multimedia Elements in English
33 pages
MDM Scheme IVRS/SMS RFP
100% (1)
MDM Scheme IVRS/SMS RFP
57 pages
Graduate Physical Therapy Transcript
No ratings yet
Graduate Physical Therapy Transcript
2 pages
English Essentials
No ratings yet
English Essentials
6 pages
Test 5 Grade - 2020
No ratings yet
Test 5 Grade - 2020
3 pages
iDempiere Architecture for Stakeholders
No ratings yet
iDempiere Architecture for Stakeholders
20 pages
Curriculum Development by Bilbao Compress 1
No ratings yet
Curriculum Development by Bilbao Compress 1
153 pages
ESOFT - Admin - Module-Assessment - Resource - 9298 - 9298-1732532832192-Unit 17 Business Process Support - Final
No ratings yet
ESOFT - Admin - Module-Assessment - Resource - 9298 - 9298-1732532832192-Unit 17 Business Process Support - Final
13 pages
1FR0 1H Que 20180519
No ratings yet
1FR0 1H Que 20180519
16 pages
Lesson Plan: Torricelli'S Law: Basic Competence
No ratings yet
Lesson Plan: Torricelli'S Law: Basic Competence
5 pages
Document
No ratings yet
Document
5 pages

RL - 03 Markov Decision Processes and Dynamic Programming

Uploaded by

RL - 03 Markov Decision Processes and Dynamic Programming

Uploaded by

Lecture 3:

Markov Decision Processes and Dynamic Programming

Hado van Hasselt

January 30, 2018, UCL

Sutton & Barto 2018, Chapter 3 + 4

I Reinforcement learning is the science of learning to make decisions

I Last lecture: multiple actions, but only one state—no model

I We discuss a mathematical formulation of the agent-environment interaction

I Markov decision processes (MDPs) formally describe an environment

“The future is independent of the past given the present”

for all possible histories S1 , . . . , St−1

I A Markov decision process is a tuple (S, A, p, γ), where

I Consider a robot that cleans cans

I Acting in a MDP results in returns Gt : total discounted reward from time-step t

I This is a random variables that depends on MDP and policy

Most Markov reward and decision processes are discounted. Why?

I The value function v (s) gives the long-term value of state s

I It can be defined recursively:

vπ (s) = E [Rt+1 + γGt+1 | St = s, π]

I The final step writes out the expectation explicitly

qπ (s, a) = E [Rt+1 + γvπ (St+1 ) | St = s, At = a]

I This is a linear equation that can be solved directly:

I Computational complexity is O(|S|3 ) — only possible for small problems

v ∗ (s) = max v π (s)

q ∗ (s, a) = max q π (s, a)

I Estimating vπ or qπ is called policy evaluation or, simply, prediction

I There are four main Bellman equations:

vπ (s) = E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]

I There are equivalences between state and action values

Define a partial ordering over policies

An optimal policy can be found by maximising over q ∗ (s, a),

I There is always a deterministic optimal policy for any MDP

I The Bellman optimality equation is non-linear

Dynamic programming refers to a collection of algorithms that can be used

Sutton & Barto 2018

I We start by discussing how to estimate

vπ (s) = E [Rt+1 + γvπ (St+1 ) | s, π]

I Idea: turn this equality into an update

max |vk+1 (s) − vπ (s)| = max |E [Rt+1 + γvk (St+1 ) | St = s, π]

∀s : πnew (s) = argmax qπ (s, a)

I Then, evaluate πnew and repeat

I One can show that vπnew (s) ≥ vπ (s), for all s

vπnew (s) = max E [Rt+1 + γvπnew (St+1 ) | St = s]

I But that is the Bellman optimality equation!

Policy evaluation Estimate v π

I States: Two locations, maximum of 20 cars at each

I Does policy evaluation need to converge to v π ?

∀s : vk+1 (s) ← max E [Rt+1 + γvk (St+1 ) | St = s, At = s]

I This is equivalent to policy iteration, with k = 1 step of policy evaluation between

Problem Bellman Equation Algorithm

I Algorithms are based on state-value function v π (s) or v ∗ (s)

I DP methods described so far used synchronous updates (all states in parallel)

Three simple ideas for asynchronous dynamic programming:

I Synchronous value iteration stores two copies of value function

for all s in S : vnew (s) ← max E [Rt+1 + γvold (St+1 ) | St = s]

I In-place value iteration only stores one copy of value function

for all s in S : v (s) ← max E [Rt+1 + γv (St+1 ) | St = s]

I Use magnitude of Bellman error to guide state selection, e.g.

max E [Rt+1 + γv (St+1 ) | St = s] − v (s)

I Backup the state with the largest remaining Bellman error

I Idea: only update states that are relevant to agent

I Standard DP uses full-width backups

ṽk (s) = max E [Rt+1 + γvk (St+1 ) | St = s]

I Find θk+1 by minimizing loss (e.g., with one gradient step)

! 2! = argmin (θ − γ2θk )2 + (2θ − γ(1 − )2θk )2

For questions that arise outside of class, please use Moodle

You might also like

! 2! = argmin (θ − γ2θk )2 + (2θ − γ(1 − )2θk )2