0% found this document useful (0 votes)
19 views56 pages

L7 Temporal Difference Learning

This lecture focuses on Temporal-Difference (TD) learning, a key method in reinforcement learning that builds on Monte Carlo methods. It covers the algorithm for estimating state values under a given policy, the properties of TD learning, and its convergence. Additionally, it discusses the relationship between TD learning and stochastic approximation methods, highlighting the advantages of TD over Monte Carlo approaches.

Uploaded by

schakraborty.ec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views56 pages

L7 Temporal Difference Learning

This lecture focuses on Temporal-Difference (TD) learning, a key method in reinforcement learning that builds on Monte Carlo methods. It covers the algorithm for estimating state values under a given policy, the properties of TD learning, and its convergence. Additionally, it discusses the relationship between TD learning and stochastic approximation methods, highlighting the advantages of TD over Monte Carlo approaches.

Uploaded by

schakraborty.ec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 7: Temporal-Difference Learning

Shiyu Zhao

Department of Artificial Intelligence


Westlake University
Outline

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:


with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Methods

Chapter 10: policy-based


Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Shiyu Zhao 1 / 55
Introduction

• This lecture introduces temporal-difference (TD) learning, which is one of


the most well-known methods in reinforcement learning (RL).
• Monte Carlo (MC) learning is the first model-free method. TD learning is
the second model-free method. TD has some advantages compared to MC.
• We will see how the stochastic approximation methods studied in the last
lecture are useful.

Shiyu Zhao 2 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 3 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 4 / 55
Motivating example: stochastic algorithms

We next consider some stochastic problems and show how to use the RM
algorithm to solve them.
First, revisit the mean estimation problem: calculate

w = E[X]

based on some iid samples {x} of X. We studied it in the last lecture.


• By writing g(w) = w − E[X], we can reformulate the problem to a
root-finding problem
g(w) = 0

• Since we can only obtain samples {x} of X, the noisy observation is


.
g̃(w, η) = w − x = (w − E[X]) + (E[X] − x) = g(w) + η

• According to the last lecture, we know the RM algorithm for solving


g(w) = 0 is
wk+1 = wk − αk g̃(wk , ηk ) = wk − αk (wk − xk )

Shiyu Zhao 5 / 55
Motivating example: stochastic algorithms

Second, consider a little more complex problem. That is to estimate the


mean of a function v(X),

w = E[v(X)],

based on some iid random samples {x} of X.

• To solve this problem, we define

g(w) = w − E[v(X)]
.
g̃(w, η) = w − v(x) = (w−E[v(X)]) + (E[v(X)] − v(x)) = g(w) + η.

• Then, the problem becomes a root-finding problem: g(w) = 0. The


corresponding RM algorithm is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk [wk − v(xk )]

Shiyu Zhao 6 / 55
Motivating example: stochastic algorithms

Third, consider an even more complex problem: calculate

w = E[R + γv(X)],

where R, X are random variables, γ is a constant, and v(·) is a function.

• Suppose we can obtain samples {x} and {r} of X and R. We define

g(w) = w − E[R + γv(X)],


g̃(w, η) = w − [r + γv(x)]
= (w−E[R + γv(X)]) + (E[R + γv(X)] − [r + γv(x)])
.
= g(w) + η.

• Then, the problem becomes a root-finding problem: g(w) = 0. The


corresponding RM algorithm is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk [wk − (rk + γv(xk ))]

This algorithm looks like TD algorithms as shown later.


Shiyu Zhao 7 / 55
Motivating example: stochastic algorithms

Quick summary:
• The above three examples become more and more complex.
• They can all be solved by the RM algorithm.
• We will see that the TD algorithms have similar expressions.

Shiyu Zhao 8 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 9 / 55
TD learning of state values – Algorithm description

Problem statement:
• Given policy π, the aim is to estimate the state values {vπ (s)}s∈S under π.
• Experience samples: (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) or {(st , rt+1 , st+1 )}t
generated by π.
Important notations:

v(s) −→ vπ (s)

v(st ) −→ vπ (st )

vt (st ) −→ vπ (st )

Shiyu Zhao 10 / 55
TD learning of state values – Algorithm description

The TD learning algorithm is


h i
vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − [rt+1 + γvt (st+1 )] , (1)

vt+1 (s) = vt (s), ∀s 6= st , (2)

where t = 0, 1, 2, . . . .

Here, vt (st ) is the estimated state value of vπ (st ); αt (st ) is the learning rate of
st at time t.
• At time t, only the value of the visited state st is updated whereas the values
of the unvisited states s 6= st remain unchanged.
• The update in (2) will be omitted when the context is clear.

Shiyu Zhao 11 / 55
TD learning of state values – Algorithm properties

The TD algorithm can be annotated as


TD error δt
z }| {
vt+1 (st ) = vt (st ) −αt (st ) vt (st ) − [rt+1 + γvt (st+1 )] , (3)
| {z } | {z } | {z }
new estimate current estimate TD target v̄t

Here,
.
v̄t = rt+1 + γvt (st+1 )

is called the TD target.


.
δt = vt (st ) − [rt+1 + γvt (st+1 )] = vt (st ) − v̄t

is called the TD error.

Observation: The new estimate vt+1 (st ) is a combination of the current


estimate vt (st ) and the TD error.

Shiyu Zhao 12 / 55
TD learning of state values – Algorithm properties

First, why is v̄t called the TD target?


That is because the algorithm drives v(st ) towards v̄t .
To see that,
 
vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − v̄t
 
=⇒ vt+1 (st )−v̄t = vt (st )−v̄t − αt (st ) vt (st ) − v̄t
=⇒ vt+1 (st )−v̄t = [1 − αt (st )][vt (st )−v̄t ]
=⇒ |vt+1 (st )−v̄t | = |1 − αt (st )||vt (st )−v̄t |

Since αt (st ) is a small positive number, we have

0 < 1 − αt (st ) < 1

Therefore,

|vt+1 (st )−v̄t | ≤ |vt (st )−v̄t |

which means v(st ) is driven towards v̄t !


Shiyu Zhao 13 / 55
TD learning of state values – Algorithm properties

Second, what is the interpretation of the TD error?

δt = vt (st ) − [rt+1 + γvt (st+1 )]

• It reflects the difference between two time steps.


• It reflects the difference between vt and vπ . To see that, denote
.
δπ,t = vπ (st ) − [rt+1 + γvπ (st+1 )]

Note that
 
E[δπ,t |St = st ] = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st = 0.

- If vt = vπ , then δt should be zero (in the expectation sense).


- Hence, if δt is not zero, then vt is not equal to vπ .
• The TD error can be interpreted as innovation, which means new
information obtained from the experience (st , rt+1 , st+1 ).

Shiyu Zhao 14 / 55
TD learning of state values – Algorithm properties

Other properties:
• The TD algorithm in (3) only estimates the state value of a given policy.
• It does not estimate the action values.
• It does not search for optimal policies.
• This algorithm will be extended to estimate action values and then search for
optimal policies later in this lecture.
• The TD algorithm in (3) is fundamental for understanding more complex TD
algorithms.

Shiyu Zhao 15 / 55
TD learning of state values – The idea of the algorithm

Q: What does this TD algorithm do mathematically?


A: It is a model-free algorithm for solving the Bellman equation of a given
policy π.
• Chapter 2 has introduced the model-based algorithm for solving the Bellman
equation: closed-form solution + iterative algorithm.

Shiyu Zhao 16 / 55
TD learning of state values – The idea of the algorithm

First, a new expression of the Bellman equation.


The definition of state value of π is
 
vπ (s) = E R + γG|S = s , s∈S (4)

where G is discounted return. Since


X X
E[G|S = s] = π(a|s) p(s0 |s, a)vπ (s0 ) = E[vπ (S 0 )|S = s],
a s0

where S 0 is the next state, we can rewrite (4) as

vπ (s) = E R + γvπ (S 0 )|S = s ,


 
s ∈ S. (5)

Equation (5) is another expression of the Bellman equation. It is sometimes


called the Bellman expectation equation, an important tool to design and
analyze TD algorithms.

Shiyu Zhao 17 / 55
TD learning of state values – The idea of the algorithm

Second, solve the Bellman equation in (5) using the RM algorithm.


In particular, by defining

g(v(s)) = v(s) − E R + γvπ (S 0 )|s ,


 

we can rewrite (5) as


g(v(s)) = 0.
Since we can only obtain the samples r and s0 of R and S 0 , the noisy
observation we have is
g̃(v(s)) = v(s) − r + γvπ (s0 )
 
    
= v(s)−E R + γvπ (S 0 )|s + E R + γvπ (S 0 )|s − r + γvπ (s0 ) .
  
| {z } | {z }
g(v(s)) η

Shiyu Zhao 18 / 55
TD learning of state values – The idea of the algorithm

Therefore, the RM algorithm for solving g(v(s)) = 0 is

vk+1 (s) = vk (s) − αk g̃(vk (s))


 
= vk (s) − αk vk (s) − rk + γvπ (s0k ) ,

k = 1, 2, 3, . . . (6)

where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the samples of
R, S 0 obtained at the kth step.

The RM algorithm in (6) looks very similar to the TD algorithm. However,


there are two differences.
• Difference 1: The RM algorithm requires {(s, rk , s0k )} for k = 1, 2, 3, . . . .
- Modification: {(s, rk , s0k )} is changed to {(st , rt+1 , st+1 )} so that the
algorithm can utilize the sequential samples in an episode.
• Difference 2: The RM algorithm requires vπ (s0k ).
- Modification: vπ (s0k ) is replaced by an estimate vt (st+1 ).
With the above modifications, the RM algorithms becomes exactly the TD
algorithm.
Shiyu Zhao 19 / 55
TD learning of state values – Algorithm convergence

Theorem (Convergence of TD Learning)

By the TD algorithm (1), vt (s) converges with probability 1 to vπ (s) for all s ∈ S as
t → ∞ if t αt (s) = ∞ and t α2t (s) < ∞ for all s ∈ S.
P P

The proof of the theorem can be found in my book.


Remarks:
• This theorem says the state value can be found by the TD algorithm for a given a
policy π.
P P 2
• t αt (s) = ∞ and t αt (s) < ∞ must be valid for all s ∈ S.
P
- For condition t αt (s) = ∞: At time step t,
 If s = st , then αt (s) > 0;
 If s 6= st , then αt (s) = 0.
P
As a result, t αt (s) = ∞ requires every state must be visited an infinite (or
sufficiently many) number of times.
- For condition t α2t (s) < ∞: In practice, the learning rate α is often selected as
P

a small constant. In this case, the condition that t α2t (s) < ∞ is invalid
P

anymore. When α is constant, it can still be shown that the algorithm converges
in the sense of expectation sense.
Shiyu Zhao 20 / 55
TD learning of state values – Algorithm properties

While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC learning?

TD/Sarsa learning MC learning

Online: TD learning is online. It can up- Offline: MC learning is offline. It has to


date the state/action values immediately wait until an episode has been complete-
after receiving a reward. ly collected.

Continuing tasks: Since TD learning is Episodic tasks: Since MC learning is of-


online, it can handle both episodic and fline, it can only handle episodic tasks
continuing tasks. that has terminate states.

Table: Comparison between TD learning and MC learning.

Shiyu Zhao 21 / 55
TD learning of state values – Algorithm properties

While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC learning?

TD/Sarsa learning MC learning

Bootstrapping: TD bootstraps because Non-bootstrapping: MC is not boot-


the update of a value relies on the pre- strapping, because it can directly es-
vious estimate of this value. Hence, it timate state/action values without any
requires initial guesses. initial guess.

Low estimation variance: TD has lower High estimation variance: To estimate


than MC because there are fewer random qπ (st , at ), we need samples of Rt+1 +
variables. For instance, Sarsa requires γRt+2 + γ 2 Rt+3 + . . . . Suppose the
Rt+1 , St+1 , At+1 . length of each episode is L. There are
|A|L possible episodes.

Table: Comparison between TD learning and MC learning (continued).

Shiyu Zhao 22 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 23 / 55
TD learning of action values – Sarsa

• The TD algorithm introduced in the last section can only estimate state
values.
• Next, we introduce, Sarsa, an algorithm that can directly estimate action
values.
• We will also see how to use Sarsa to find optimal policies.

Shiyu Zhao 24 / 55
Sarsa – Algorithm

First, our aim is to estimate the action values of a given policy π.

Suppose we have some experience {(st , at , rt+1 , st+1 , at+1 )}t .


We can use the following Sarsa algorithm to estimate the action values:
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γqt (st+1 , at+1 )] ,

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

where t = 0, 1, 2, . . . .
• qt (st , at ) is an estimate of qπ (st , at );
• αt (st , at ) is the learning rate depending on st , at .

Shiyu Zhao 25 / 55
Sarsa – Algorithm

• Why is this algorithm called Sarsa? That is because each step of the
algorithm involves (st , at , rt+1 , st+1 , at+1 ). Sarsa is the abbreviation of
state-action-reward-state-action.
• What is the relationship between Sarsa and the previous TD learning
algorithm? We can obtain Sarsa by replacing the state value estimate v(s)
in the TD algorithm with the action value estimate q(s, a). As a result,
Sarsa is an action-value version of the TD algorithm.
• What does the Sarsa algorithm do mathematically? The expression of
Sarsa suggests that it is a stochastic approximation algorithm solving the
following equation:

qπ (s, a) = E R + γqπ (S 0 , A0 )|s, a ,


 
∀s, a.

This is another expression of the Bellman equation expressed in terms of


action values. The proof is given in my book.

Shiyu Zhao 26 / 55
Sarsa – Algorithm

Theorem (Convergence of Sarsa learning)

By the Sarsa algorithm, qt (s, a) converges with probability 1 to the action


value qπ (s, a) as t → ∞ for all (s, a) if t αt (s, a) = ∞ and t αt2 (s, a) < ∞
P P

for all (s, a).

Remarks:
• This theorem says that the action values can be found by Sarsa for a given a
policy π.

Shiyu Zhao 27 / 55
Sarsa – Implementation

The ultimate goal of RL is to find optimal policies.


To do that, we can combine Sarsa with a policy improvement step.
The combined algorithm is also called Sarsa.

Pseudocode: Policy searching by Sarsa

For each episode, do


Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect an experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate
rt+1 , st+1 by interacting with the environment; generate at+1 following
πt (st+1 ).
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 +
i
γqt (st+1 , at+1 ))
Update policy for st :

πt+1 (a|st ) = 1 − |A(s (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
t )|

πt+1 (a|st ) = |A(s )| otherwise
t
st ← st+1 , at ← at+1
Shiyu Zhao 28 / 55
Sarsa – Implementation

Remarks about this algorithm:


• The policy of st is updated immediately after q(st , at ) is updated. This is
based on the idea of generalized policy iteration.
• The policy is -greedy instead of greedy to well balance exploitation and
exploration.
Be clear about the core idea and complication:
• The core idea is simple: that is to use an algorithm to solve the Bellman
equation of a given policy.
• The complication emerges when we try to find optimal policies and work
efficiently.

Shiyu Zhao 29 / 55
Sarsa – Examples

Task description:
• The task is to find a good path from a specific starting state to the target
state.
- This task is different from all the previous tasks where we need to find out
the optimal policy for every state!
- Each episode starts from the top-left state and end in the target state.
- In the future, pay attention to what the task is.
• rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The learning
rate is α = 0.1 and the value of  is 0.1.

Shiyu Zhao 30 / 55
Sarsa – Examples

Results:
• The left figures above show the final policy obtained by Sarsa.
• Not all states have the optimal policy.
• The right figures show the total reward and length of every episode.
• The metric of total reward per episode will be frequently used.
1 2 3 4 5

Total rewards
0
1
-200

2 -400

0 100 200 300 400 500


3
Episode length
200

4 100

0
5 0 100 200 300 400 500
Episode index

Shiyu Zhao 31 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 32 / 55
TD learning of action values: n-step Sarsa

n-step Sarsa can unify Sarsa and Monte Carlo learning

The definition of action value is

qπ (s, a) = E[Gt |St = s, At = a].

The discounted return Gt can be written in different forms as


(1)
Sarsa ←− Gt = Rt+1 + γqπ (St+1 , At+1 ),
(2)
Gt = Rt+1 + γRt+2 + γ 2 qπ (St+2 , At+2 ),
..
.
(n)
n-step Sarsa ←− Gt = Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n ),
..
.
(∞)
MC ←− Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

(1) (2) (n) (∞)


It should be noted that Gt = Gt = Gt = Gt = Gt , where the
superscripts merely indicate the different decomposition structures of Gt .
Shiyu Zhao 33 / 55
TD learning of action values: n-step Sarsa

• Sarsa aims to solve

(1)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γqπ (St+1 , At+1 )|s, a].

• MC learning aims to solve

(∞)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |s, a].

• An intermediate algorithm called n-step Sarsa aims to solve


(n)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n )|s, a].

• The algorithm of n-step Sarsa is


qt+1 (st , at ) = qt (st , at )
h i
− αt (st , at ) qt (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )] .

- n-step Sarsa becomes the (one-step) Sarsa algorithm when n = 1.


- n-step Sarsa becomes the MC learning algorithm when n = ∞.

Shiyu Zhao 34 / 55
TD learning of action values: n-step Sarsa

• Data: n-step Sarsa needs (st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ).
• Since (rt+n , st+n , at+n ) has not been collected at time t, we are not able to
implement n-step Sarsa at step t. We need to wait until time t + n to
update the q-value of (st , at ):
qt+n (st , at ) = qt+n−1 (st , at )
h i
− αt+n−1 (st , at ) qt+n−1 (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt+n−1 (st+n , at+n )]

• Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, its
performance is a blend of Sarsa and MC learning:
- If n is large, its performance is close to MC learning and hence has a large
variance but a small bias.
- If n is small, its performance is close to Sarsa and hence has a relatively
large bias due to the initial guess and relatively low variance.
• Finally, n-step Sarsa is also for policy evaluation. It can be combined with
the policy improvement step to search for optimal policies.

Shiyu Zhao 35 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 36 / 55
TD learning of optimal action values: Q-learning

• Next, we introduce Q-learning, one of the most widely used RL algorithms.


• Sarsa can estimate the action values of a given policy. It must be combined
with a policy improvement step to find optimal policies.
• Q-learning can directly estimate optimal action values and hence optimal
policies.

Shiyu Zhao 37 / 55
Q-learning – Algorithm

The Q-learning algorithm is


 
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γ max qt (st+1 , a)] ,
a∈A

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

Q-learning is very similar to Sarsa. They are different only in terms of the TD
target:
• The TD target in Q-learning is rt+1 + γ maxa∈A qt (st+1 , a)
• The TD target in Sarsa is rt+1 + γqt (st+1 , at+1 )

Shiyu Zhao 38 / 55
Q-learning – Algorithm

What does Q-learning do mathematically?


It aims to solve
h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a.
a

This is the Bellman optimality equation expressed in terms of action values.


See the proof in my book.

As a result, Q-learning can directly estimate the optimal action values instead
of action values of a given policy.

Shiyu Zhao 39 / 55
Off-policy vs on-policy

Before further studying Q-learning, we first introduce two important concepts:


on-policy learning and off-policy learning.

There exist two policies in a TD learning task:


• The behavior policy is used to generate experience samples.
• The target policy is constantly updated toward an optimal policy.

On-policy vs off-policy:
• When the behavior policy is the same as the target policy, such kind of
learning is called on-policy.
• When they are different, the learning is called off-policy.

Shiyu Zhao 40 / 55
Off-policy vs on-policy

Advantages of off-policy learning:

• It can search for optimal policies based on the experience samples generated
by any other policies.
• Example: The behavior policy is exploratory so that we can generate
episodes visiting every state-action pair sufficiently many times.

1 2 3 4 5

(a) Exploratory behavior policy (b) Generated episode

Shiyu Zhao 41 / 55
Off-policy vs on-policy

How to judge if a TD algorithm is on-policy or off-policy?


• First, check what math problem the algorithm aims to solve.
• Second, check what experience samples the algorithm requires.
It deserves special attention because it may be confusing to beginners.

Shiyu Zhao 42 / 55
Off-policy vs on-policy

• Sarsa aims to evaluate a given policy π by solving

qπ (s, a) = E R + γqπ (S 0 , A0 )|s, a ,


 
∀s, a.

where R ∼ p(R|s, a), S 0 ∼ p(S 0 |s, a), A0 ∼ π(A0 |S 0 ).

• MC aims to evaluate a given policy π by solving

qπ (s, a) = E [Rt+1 + γRt+2 + . . . |St = s, At = a] , ∀s, a.

where the samples are generated by π.

Both Sarsa and MC are on-policy.


• π is the behavior policy because we need the experience samples generated
by π to estimate the action values of π.
• π is also the target policy because it is updated continuously so that it
approaches the optimal policy.

Shiyu Zhao 43 / 55
Off-policy vs on-policy

Q-learning is off-policy.

• First, Q-learning aims to solve the Bellman optimality equation


h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a.
a

• Second, the algorithm is


 
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γ max qt (st+1 , a)]
a∈A

which requires (st , at , rt+1 , st+1 ).

• The behavior policy is the one for generating at in st . It can be any policy.

Shiyu Zhao 44 / 55
Q-learning – Implementation

Since Q-learning is off-policy, it can be implemented in an either off-policy or


on-policy fashion.

Pseudocode: Policy searching by Q-learning (on-policy version)

For each episode, do


If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at fol-
lowing πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 +
i
γ maxa qt (st+1 , a))
Update policy for st :

πt+1 (a|st ) = 1 − |A(s (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
t )|

πt+1 (a|st ) = |A(s )|
otherwise
t

See the book for more detailed pseudocode.

Shiyu Zhao 45 / 55
Q-learning – Algorithm

Pseudocode: Optimal policy search by Q-learning (off-policy version)

Goal: Learn an optimal target policy πT for all states from the experience samples
generated by πb .
For each episode {s0 , a0 , r1 , s1 , a1 , r2 , . . . } generated by πb , do
For each step t = 0, 1, 2, . . . of the episode, do
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) q(st , at ) − (rt+1 +
i
γ maxa qt (st+1 , a))
Update target policy for st :
πT ,t+1 (a|st ) = 1 if a = arg maxa qt+1 (st , a)
πT ,t+1 (a|st ) = 0 otherwise

See the book for more detailed pseudocode.

Shiyu Zhao 46 / 55
Q-learning – Examples

Task description:
• The task in these examples is to find an optimal policy for all the states.
• The reward setting is rboundary = rforbidden = −1, and rtarget = 1. The
discount rate is γ = 0.9. The learning rate is α = 0.1.
Ground truth: an optimal policy and the corresponding optimal state values.
1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) Optimal policy (b) Optimal state value

Shiyu Zhao 47 / 55
Q-learning – Examples

The behavior policy and the generated experience (105 steps):


1 2 3 4 5

(a) Behavior policy (b) Generated episode


The policy found by off-policy Q-learning:
1 2 3 4 5
8
1

6
2
State value error

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Estimated policy (b) State value error


Shiyu Zhao 48 / 55
Q-learning – Examples

The importance of exploration: episodes of 105 steps


If the policy is not sufficiently exploratory, the samples are not good.

1 2 3 4 5

1 8

State value error


2

3 4

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy  = 0.5 (b) Generated episode (c) Q-learning result

Shiyu Zhao 49 / 55
Q-learning – Examples

1 2 3 4 5

State value error


2

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result


 = 0.1

1 2 3 4 5

State value error


2

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result


 = 0.1

Shiyu Zhao 50 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 51 / 55
A unified point of view

All the algorithms we introduced in this lecture can be expressed in a unified


expression:

qt+1 (st , at ) = qt (st , at ) − αt (st , at )[qt (st , at ) − q̄t ]

where q̄t is the TD target.

Different TD algorithms have different q̄t .

Algorithm Expression of q̄t


Sarsa q̄t = rt+1 + γqt (st+1 , at+1 )
n-step Sarsa q̄t = rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )
Q-learning q̄t = rt+1 + γ maxa qt (st+1 , a)
Monte Carlo q̄t = rt+1 + γrt+2 + . . .

Remark: The MC method can also be expressed in this unified expression by


setting αt (st , at ) = 1. In particular, the expression is qt+1 (st , at ) = q̄t .

Shiyu Zhao 52 / 55
A unified point of view

All the TD algorithms can be viewed as stochastic approximation algorithms


solving the Bellman equation or Bellman optimality equation:

Algorithm Equation to solve


Sarsa BE: qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 )|St = s, At = a]
n-step Sarsa BE: qπ (s, a) = E[Rt+1 +γRt+2 +· · ·+γ n qπ (st+n , at+n )|St = s, At = a]
 
Q-learning BOE: q(s, a) = E Rt+1 + γ maxa q(St+1 , a) St = s, At = a
Monte Carlo BE: qπ (s, a) = E[Rt+1 + γRt+2 + . . . |St = s, At = a]

Shiyu Zhao 53 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 54 / 55
Summary

• Introduced various TD learning algorithms


• Their expressions, math interpretations, implementation, relationship,
examples
• Unified point of view

Shiyu Zhao 55 / 55

You might also like