0% found this document useful (0 votes)

19 views56 pages

L7 Temporal Difference Learning

This lecture focuses on Temporal-Difference (TD) learning, a key method in reinforcement learning that builds on Monte Carlo methods. It covers the algorithm for estimating state values under a given policy, the properties of TD learning, and its convergence. Additionally, it discusses the relationship between TD learning and stochastic approximation methods, highlighting the advantages of TD over Monte Carlo approaches.

Uploaded by

schakraborty.ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views56 pages

L7 Temporal Difference Learning

Uploaded by

schakraborty.ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 7: Temporal-Difference Learning

Shiyu Zhao

Department of Artificial Intelligence

Westlake University
Outline

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Methods

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Shiyu Zhao 1 / 55
Introduction

• This lecture introduces temporal-difference (TD) learning, which is one of

the most well-known methods in reinforcement learning (RL).
• Monte Carlo (MC) learning is the first model-free method. TD learning is
the second model-free method. TD has some advantages compared to MC.
• We will see how the stochastic approximation methods studied in the last
lecture are useful.

Shiyu Zhao 2 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 3 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 4 / 55
Motivating example: stochastic algorithms

We next consider some stochastic problems and show how to use the RM
algorithm to solve them.
First, revisit the mean estimation problem: calculate

w = E[X]

based on some iid samples {x} of X. We studied it in the last lecture.

• By writing g(w) = w − E[X], we can reformulate the problem to a
root-finding problem
g(w) = 0

• Since we can only obtain samples {x} of X, the noisy observation is

.
g̃(w, η) = w − x = (w − E[X]) + (E[X] − x) = g(w) + η

• According to the last lecture, we know the RM algorithm for solving

g(w) = 0 is
wk+1 = wk − αk g̃(wk , ηk ) = wk − αk (wk − xk )

Shiyu Zhao 5 / 55
Motivating example: stochastic algorithms

Second, consider a little more complex problem. That is to estimate the

mean of a function v(X),

w = E[v(X)],

based on some iid random samples {x} of X.

• To solve this problem, we define

g(w) = w − E[v(X)]
.
g̃(w, η) = w − v(x) = (w−E[v(X)]) + (E[v(X)] − v(x)) = g(w) + η.

• Then, the problem becomes a root-finding problem: g(w) = 0. The

corresponding RM algorithm is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk [wk − v(xk )]

Shiyu Zhao 6 / 55
Motivating example: stochastic algorithms

Third, consider an even more complex problem: calculate

w = E[R + γv(X)],

where R, X are random variables, γ is a constant, and v(·) is a function.

• Suppose we can obtain samples {x} and {r} of X and R. We define

g(w) = w − E[R + γv(X)],

g̃(w, η) = w − [r + γv(x)]
= (w−E[R + γv(X)]) + (E[R + γv(X)] − [r + γv(x)])
.
= g(w) + η.

• Then, the problem becomes a root-finding problem: g(w) = 0. The

corresponding RM algorithm is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk [wk − (rk + γv(xk ))]

This algorithm looks like TD algorithms as shown later.

Shiyu Zhao 7 / 55
Motivating example: stochastic algorithms

Quick summary:
• The above three examples become more and more complex.
• They can all be solved by the RM algorithm.
• We will see that the TD algorithms have similar expressions.

Shiyu Zhao 8 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 9 / 55
TD learning of state values – Algorithm description

Problem statement:
• Given policy π, the aim is to estimate the state values {vπ (s)}s∈S under π.
• Experience samples: (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) or {(st , rt+1 , st+1 )}t
generated by π.
Important notations:

v(s) −→ vπ (s)
⇓
v(st ) −→ vπ (st )
⇓
vt (st ) −→ vπ (st )

Shiyu Zhao 10 / 55
TD learning of state values – Algorithm description

The TD learning algorithm is

h i
vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − [rt+1 + γvt (st+1 )] , (1)

vt+1 (s) = vt (s), ∀s 6= st , (2)

where t = 0, 1, 2, . . . .

Here, vt (st ) is the estimated state value of vπ (st ); αt (st ) is the learning rate of
st at time t.
• At time t, only the value of the visited state st is updated whereas the values
of the unvisited states s 6= st remain unchanged.
• The update in (2) will be omitted when the context is clear.

Shiyu Zhao 11 / 55
TD learning of state values – Algorithm properties

The TD algorithm can be annotated as

TD error δt
z }| {
vt+1 (st ) = vt (st ) −αt (st ) vt (st ) − [rt+1 + γvt (st+1 )] , (3)
| {z } | {z } | {z }
new estimate current estimate TD target v̄t

Here,
.
v̄t = rt+1 + γvt (st+1 )

is called the TD target.

.
δt = vt (st ) − [rt+1 + γvt (st+1 )] = vt (st ) − v̄t

is called the TD error.

Observation: The new estimate vt+1 (st ) is a combination of the current

estimate vt (st ) and the TD error.

Shiyu Zhao 12 / 55
TD learning of state values – Algorithm properties

First, why is v̄t called the TD target?

That is because the algorithm drives v(st ) towards v̄t .
To see that,

vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − v̄t

=⇒ vt+1 (st )−v̄t = vt (st )−v̄t − αt (st ) vt (st ) − v̄t
=⇒ vt+1 (st )−v̄t = [1 − αt (st )][vt (st )−v̄t ]
=⇒ |vt+1 (st )−v̄t | = |1 − αt (st )||vt (st )−v̄t |

Since αt (st ) is a small positive number, we have

0 < 1 − αt (st ) < 1

Therefore,

|vt+1 (st )−v̄t | ≤ |vt (st )−v̄t |

which means v(st ) is driven towards v̄t !

Shiyu Zhao 13 / 55
TD learning of state values – Algorithm properties

Second, what is the interpretation of the TD error?

δt = vt (st ) − [rt+1 + γvt (st+1 )]

• It reflects the difference between two time steps.

• It reflects the difference between vt and vπ . To see that, denote
.
δπ,t = vπ (st ) − [rt+1 + γvπ (st+1 )]

Note that

E[δπ,t |St = st ] = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st = 0.

- If vt = vπ , then δt should be zero (in the expectation sense).

- Hence, if δt is not zero, then vt is not equal to vπ .
• The TD error can be interpreted as innovation, which means new
information obtained from the experience (st , rt+1 , st+1 ).

Shiyu Zhao 14 / 55
TD learning of state values – Algorithm properties

Other properties:
• The TD algorithm in (3) only estimates the state value of a given policy.
• It does not estimate the action values.
• It does not search for optimal policies.
• This algorithm will be extended to estimate action values and then search for
optimal policies later in this lecture.
• The TD algorithm in (3) is fundamental for understanding more complex TD
algorithms.

Shiyu Zhao 15 / 55
TD learning of state values – The idea of the algorithm

Q: What does this TD algorithm do mathematically?

A: It is a model-free algorithm for solving the Bellman equation of a given
policy π.
• Chapter 2 has introduced the model-based algorithm for solving the Bellman
equation: closed-form solution + iterative algorithm.

Shiyu Zhao 16 / 55
TD learning of state values – The idea of the algorithm

First, a new expression of the Bellman equation.

The definition of state value of π is

vπ (s) = E R + γG|S = s , s∈S (4)

where G is discounted return. Since

X X
E[G|S = s] = π(a|s) p(s0 |s, a)vπ (s0 ) = E[vπ (S 0 )|S = s],
a s0

where S 0 is the next state, we can rewrite (4) as

vπ (s) = E R + γvπ (S 0 )|S = s ,

s ∈ S. (5)

Equation (5) is another expression of the Bellman equation. It is sometimes

called the Bellman expectation equation, an important tool to design and
analyze TD algorithms.

Shiyu Zhao 17 / 55
TD learning of state values – The idea of the algorithm

Second, solve the Bellman equation in (5) using the RM algorithm.

In particular, by defining

g(v(s)) = v(s) − E R + γvπ (S 0 )|s ,

we can rewrite (5) as

g(v(s)) = 0.
Since we can only obtain the samples r and s0 of R and S 0 , the noisy
observation we have is
g̃(v(s)) = v(s) − r + γvπ (s0 )

= v(s)−E R + γvπ (S 0 )|s + E R + γvπ (S 0 )|s − r + γvπ (s0 ) .

| {z } | {z }
g(v(s)) η

Shiyu Zhao 18 / 55
TD learning of state values – The idea of the algorithm

Therefore, the RM algorithm for solving g(v(s)) = 0 is

vk+1 (s) = vk (s) − αk g̃(vk (s))

= vk (s) − αk vk (s) − rk + γvπ (s0k ) ,

k = 1, 2, 3, . . . (6)

where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the samples of
R, S 0 obtained at the kth step.

The RM algorithm in (6) looks very similar to the TD algorithm. However,

there are two differences.
• Difference 1: The RM algorithm requires {(s, rk , s0k )} for k = 1, 2, 3, . . . .
- Modification: {(s, rk , s0k )} is changed to {(st , rt+1 , st+1 )} so that the
algorithm can utilize the sequential samples in an episode.
• Difference 2: The RM algorithm requires vπ (s0k ).
- Modification: vπ (s0k ) is replaced by an estimate vt (st+1 ).
With the above modifications, the RM algorithms becomes exactly the TD
algorithm.
Shiyu Zhao 19 / 55
TD learning of state values – Algorithm convergence

Theorem (Convergence of TD Learning)

By the TD algorithm (1), vt (s) converges with probability 1 to vπ (s) for all s ∈ S as
t → ∞ if t αt (s) = ∞ and t α2t (s) < ∞ for all s ∈ S.
P P

The proof of the theorem can be found in my book.

Remarks:
• This theorem says the state value can be found by the TD algorithm for a given a
policy π.
P P 2
• t αt (s) = ∞ and t αt (s) < ∞ must be valid for all s ∈ S.
P
- For condition t αt (s) = ∞: At time step t,
If s = st , then αt (s) > 0;
If s 6= st , then αt (s) = 0.
P
As a result, t αt (s) = ∞ requires every state must be visited an infinite (or
sufficiently many) number of times.
- For condition t α2t (s) < ∞: In practice, the learning rate α is often selected as
P

a small constant. In this case, the condition that t α2t (s) < ∞ is invalid
P

anymore. When α is constant, it can still be shown that the algorithm converges
in the sense of expectation sense.
Shiyu Zhao 20 / 55
TD learning of state values – Algorithm properties

While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC learning?

TD/Sarsa learning MC learning

Online: TD learning is online. It can up- Offline: MC learning is offline. It has to

date the state/action values immediately wait until an episode has been complete-
after receiving a reward. ly collected.

Continuing tasks: Since TD learning is Episodic tasks: Since MC learning is of-

online, it can handle both episodic and fline, it can only handle episodic tasks
continuing tasks. that has terminate states.

Table: Comparison between TD learning and MC learning.

Shiyu Zhao 21 / 55
TD learning of state values – Algorithm properties

While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC learning?

TD/Sarsa learning MC learning

Bootstrapping: TD bootstraps because Non-bootstrapping: MC is not boot-

the update of a value relies on the pre- strapping, because it can directly es-
vious estimate of this value. Hence, it timate state/action values without any
requires initial guesses. initial guess.

Low estimation variance: TD has lower High estimation variance: To estimate

than MC because there are fewer random qπ (st , at ), we need samples of Rt+1 +
variables. For instance, Sarsa requires γRt+2 + γ 2 Rt+3 + . . . . Suppose the
Rt+1 , St+1 , At+1 . length of each episode is L. There are
|A|L possible episodes.

Table: Comparison between TD learning and MC learning (continued).

Shiyu Zhao 22 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 23 / 55
TD learning of action values – Sarsa

• The TD algorithm introduced in the last section can only estimate state
values.
• Next, we introduce, Sarsa, an algorithm that can directly estimate action
values.
• We will also see how to use Sarsa to find optimal policies.

Shiyu Zhao 24 / 55
Sarsa – Algorithm

First, our aim is to estimate the action values of a given policy π.

Suppose we have some experience {(st , at , rt+1 , st+1 , at+1 )}t .

We can use the following Sarsa algorithm to estimate the action values:
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γqt (st+1 , at+1 )] ,

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

where t = 0, 1, 2, . . . .
• qt (st , at ) is an estimate of qπ (st , at );
• αt (st , at ) is the learning rate depending on st , at .

Shiyu Zhao 25 / 55
Sarsa – Algorithm

• Why is this algorithm called Sarsa? That is because each step of the
algorithm involves (st , at , rt+1 , st+1 , at+1 ). Sarsa is the abbreviation of
state-action-reward-state-action.
• What is the relationship between Sarsa and the previous TD learning
algorithm? We can obtain Sarsa by replacing the state value estimate v(s)
in the TD algorithm with the action value estimate q(s, a). As a result,
Sarsa is an action-value version of the TD algorithm.
• What does the Sarsa algorithm do mathematically? The expression of
Sarsa suggests that it is a stochastic approximation algorithm solving the
following equation:

qπ (s, a) = E R + γqπ (S 0 , A0 )|s, a ,

∀s, a.

This is another expression of the Bellman equation expressed in terms of

action values. The proof is given in my book.

Shiyu Zhao 26 / 55
Sarsa – Algorithm

Theorem (Convergence of Sarsa learning)

By the Sarsa algorithm, qt (s, a) converges with probability 1 to the action

value qπ (s, a) as t → ∞ for all (s, a) if t αt (s, a) = ∞ and t αt2 (s, a) < ∞
P P

for all (s, a).

Remarks:
• This theorem says that the action values can be found by Sarsa for a given a
policy π.

Shiyu Zhao 27 / 55
Sarsa – Implementation

The ultimate goal of RL is to find optimal policies.

To do that, we can combine Sarsa with a policy improvement step.
The combined algorithm is also called Sarsa.

Pseudocode: Policy searching by Sarsa

For each episode, do

Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect an experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate
rt+1 , st+1 by interacting with the environment; generate at+1 following
πt (st+1 ).
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 +
i
γqt (st+1 , at+1 ))
Update policy for st :

πt+1 (a|st ) = 1 − |A(s (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
t )|

πt+1 (a|st ) = |A(s )| otherwise
t
st ← st+1 , at ← at+1
Shiyu Zhao 28 / 55
Sarsa – Implementation

Remarks about this algorithm:

• The policy of st is updated immediately after q(st , at ) is updated. This is
based on the idea of generalized policy iteration.
• The policy is -greedy instead of greedy to well balance exploitation and
exploration.
Be clear about the core idea and complication:
• The core idea is simple: that is to use an algorithm to solve the Bellman
equation of a given policy.
• The complication emerges when we try to find optimal policies and work
efficiently.

Shiyu Zhao 29 / 55
Sarsa – Examples

Task description:
• The task is to find a good path from a specific starting state to the target
state.
- This task is different from all the previous tasks where we need to find out
the optimal policy for every state!
- Each episode starts from the top-left state and end in the target state.
- In the future, pay attention to what the task is.
• rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The learning
rate is α = 0.1 and the value of is 0.1.

Shiyu Zhao 30 / 55
Sarsa – Examples

Results:
• The left figures above show the final policy obtained by Sarsa.
• Not all states have the optimal policy.
• The right figures show the total reward and length of every episode.
• The metric of total reward per episode will be frequently used.
1 2 3 4 5

Total rewards
0
1
-200

2 -400

0 100 200 300 400 500

3
Episode length
200

4 100

0
5 0 100 200 300 400 500
Episode index

Shiyu Zhao 31 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 32 / 55
TD learning of action values: n-step Sarsa

n-step Sarsa can unify Sarsa and Monte Carlo learning

The definition of action value is

qπ (s, a) = E[Gt |St = s, At = a].

The discounted return Gt can be written in different forms as

(1)
Sarsa ←− Gt = Rt+1 + γqπ (St+1 , At+1 ),
(2)
Gt = Rt+1 + γRt+2 + γ 2 qπ (St+2 , At+2 ),
..
.
(n)
n-step Sarsa ←− Gt = Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n ),
..
.
(∞)
MC ←− Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

(1) (2) (n) (∞)

It should be noted that Gt = Gt = Gt = Gt = Gt , where the
superscripts merely indicate the different decomposition structures of Gt .
Shiyu Zhao 33 / 55
TD learning of action values: n-step Sarsa

• Sarsa aims to solve

(1)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γqπ (St+1 , At+1 )|s, a].

• MC learning aims to solve

(∞)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |s, a].

• An intermediate algorithm called n-step Sarsa aims to solve

(n)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n )|s, a].

• The algorithm of n-step Sarsa is

qt+1 (st , at ) = qt (st , at )
h i
− αt (st , at ) qt (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )] .

- n-step Sarsa becomes the (one-step) Sarsa algorithm when n = 1.

- n-step Sarsa becomes the MC learning algorithm when n = ∞.

Shiyu Zhao 34 / 55
TD learning of action values: n-step Sarsa

• Data: n-step Sarsa needs (st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ).
• Since (rt+n , st+n , at+n ) has not been collected at time t, we are not able to
implement n-step Sarsa at step t. We need to wait until time t + n to
update the q-value of (st , at ):
qt+n (st , at ) = qt+n−1 (st , at )
h i
− αt+n−1 (st , at ) qt+n−1 (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt+n−1 (st+n , at+n )]

• Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, its
performance is a blend of Sarsa and MC learning:
- If n is large, its performance is close to MC learning and hence has a large
variance but a small bias.
- If n is small, its performance is close to Sarsa and hence has a relatively
large bias due to the initial guess and relatively low variance.
• Finally, n-step Sarsa is also for policy evaluation. It can be combined with
the policy improvement step to search for optimal policies.

Shiyu Zhao 35 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 36 / 55
TD learning of optimal action values: Q-learning

• Next, we introduce Q-learning, one of the most widely used RL algorithms.

• Sarsa can estimate the action values of a given policy. It must be combined
with a policy improvement step to find optimal policies.
• Q-learning can directly estimate optimal action values and hence optimal
policies.

Shiyu Zhao 37 / 55
Q-learning – Algorithm

The Q-learning algorithm is

qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γ max qt (st+1 , a)] ,
a∈A

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

Q-learning is very similar to Sarsa. They are different only in terms of the TD
target:
• The TD target in Q-learning is rt+1 + γ maxa∈A qt (st+1 , a)
• The TD target in Sarsa is rt+1 + γqt (st+1 , at+1 )

Shiyu Zhao 38 / 55
Q-learning – Algorithm

What does Q-learning do mathematically?

It aims to solve
h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a.
a

This is the Bellman optimality equation expressed in terms of action values.

See the proof in my book.

As a result, Q-learning can directly estimate the optimal action values instead
of action values of a given policy.

Shiyu Zhao 39 / 55
Off-policy vs on-policy

Before further studying Q-learning, we first introduce two important concepts:

on-policy learning and off-policy learning.

There exist two policies in a TD learning task:

• The behavior policy is used to generate experience samples.
• The target policy is constantly updated toward an optimal policy.

On-policy vs off-policy:
• When the behavior policy is the same as the target policy, such kind of
learning is called on-policy.
• When they are different, the learning is called off-policy.

Shiyu Zhao 40 / 55
Off-policy vs on-policy

Advantages of off-policy learning:

• It can search for optimal policies based on the experience samples generated
by any other policies.
• Example: The behavior policy is exploratory so that we can generate
episodes visiting every state-action pair sufficiently many times.

1 2 3 4 5

(a) Exploratory behavior policy (b) Generated episode

Shiyu Zhao 41 / 55
Off-policy vs on-policy

How to judge if a TD algorithm is on-policy or off-policy?

• First, check what math problem the algorithm aims to solve.
• Second, check what experience samples the algorithm requires.
It deserves special attention because it may be confusing to beginners.

Shiyu Zhao 42 / 55
Off-policy vs on-policy

• Sarsa aims to evaluate a given policy π by solving

qπ (s, a) = E R + γqπ (S 0 , A0 )|s, a ,

∀s, a.

where R ∼ p(R|s, a), S 0 ∼ p(S 0 |s, a), A0 ∼ π(A0 |S 0 ).

• MC aims to evaluate a given policy π by solving

qπ (s, a) = E [Rt+1 + γRt+2 + . . . |St = s, At = a] , ∀s, a.

where the samples are generated by π.

Both Sarsa and MC are on-policy.

• π is the behavior policy because we need the experience samples generated
by π to estimate the action values of π.
• π is also the target policy because it is updated continuously so that it
approaches the optimal policy.

Shiyu Zhao 43 / 55
Off-policy vs on-policy

Q-learning is off-policy.

• First, Q-learning aims to solve the Bellman optimality equation

h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a.
a

• Second, the algorithm is

qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γ max qt (st+1 , a)]
a∈A

which requires (st , at , rt+1 , st+1 ).

• The behavior policy is the one for generating at in st . It can be any policy.

Shiyu Zhao 44 / 55
Q-learning – Implementation

Since Q-learning is off-policy, it can be implemented in an either off-policy or

on-policy fashion.

Pseudocode: Policy searching by Q-learning (on-policy version)

For each episode, do

If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at fol-
lowing πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 +
i
γ maxa qt (st+1 , a))
Update policy for st :

πt+1 (a|st ) = 1 − |A(s (|A(st )| − 1) if a = arg maxa qt+1 (st , a)
t )|

πt+1 (a|st ) = |A(s )|
otherwise
t

See the book for more detailed pseudocode.

Shiyu Zhao 45 / 55
Q-learning – Algorithm

Pseudocode: Optimal policy search by Q-learning (off-policy version)

Goal: Learn an optimal target policy πT for all states from the experience samples
generated by πb .
For each episode {s0 , a0 , r1 , s1 , a1 , r2 , . . . } generated by πb , do
For each step t = 0, 1, 2, . . . of the episode, do
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) q(st , at ) − (rt+1 +
i
γ maxa qt (st+1 , a))
Update target policy for st :
πT ,t+1 (a|st ) = 1 if a = arg maxa qt+1 (st , a)
πT ,t+1 (a|st ) = 0 otherwise

See the book for more detailed pseudocode.

Shiyu Zhao 46 / 55
Q-learning – Examples

Task description:
• The task in these examples is to find an optimal policy for all the states.
• The reward setting is rboundary = rforbidden = −1, and rtarget = 1. The
discount rate is γ = 0.9. The learning rate is α = 0.1.
Ground truth: an optimal policy and the corresponding optimal state values.
1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) Optimal policy (b) Optimal state value

Shiyu Zhao 47 / 55
Q-learning – Examples

The behavior policy and the generated experience (105 steps):

1 2 3 4 5

(a) Behavior policy (b) Generated episode

The policy found by off-policy Q-learning:
1 2 3 4 5
8
1

6
2
State value error

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Estimated policy (b) State value error

Shiyu Zhao 48 / 55
Q-learning – Examples

The importance of exploration: episodes of 105 steps

If the policy is not sufficiently exploratory, the samples are not good.

1 2 3 4 5

1 8

State value error

3 4

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy = 0.5 (b) Generated episode (c) Q-learning result

Shiyu Zhao 49 / 55
Q-learning – Examples

1 2 3 4 5

State value error

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result

= 0.1

1 2 3 4 5

State value error

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result

= 0.1

Shiyu Zhao 50 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 51 / 55
A unified point of view

All the algorithms we introduced in this lecture can be expressed in a unified

expression:

qt+1 (st , at ) = qt (st , at ) − αt (st , at )[qt (st , at ) − q̄t ]

where q̄t is the TD target.

Different TD algorithms have different q̄t .

Algorithm Expression of q̄t

Sarsa q̄t = rt+1 + γqt (st+1 , at+1 )
n-step Sarsa q̄t = rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )
Q-learning q̄t = rt+1 + γ maxa qt (st+1 , a)
Monte Carlo q̄t = rt+1 + γrt+2 + . . .

Remark: The MC method can also be expressed in this unified expression by

setting αt (st , at ) = 1. In particular, the expression is qt+1 (st , at ) = q̄t .

Shiyu Zhao 52 / 55
A unified point of view

All the TD algorithms can be viewed as stochastic approximation algorithms

solving the Bellman equation or Bellman optimality equation:

Algorithm Equation to solve

Sarsa BE: qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 )|St = s, At = a]
n-step Sarsa BE: qπ (s, a) = E[Rt+1 +γRt+2 +· · ·+γ n qπ (st+n , at+n )|St = s, At = a]

Q-learning BOE: q(s, a) = E Rt+1 + γ maxa q(St+1 , a) St = s, At = a
Monte Carlo BE: qπ (s, a) = E[Rt+1 + γRt+2 + . . . |St = s, At = a]

Shiyu Zhao 53 / 55
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: n-step Sarsa

5 TD learning of optimal action values: Q-learning

6 A unified point of view

7 Summary

Shiyu Zhao 54 / 55
Summary

• Introduced various TD learning algorithms

• Their expressions, math interpretations, implementation, relationship,
examples
• Unified point of view

Shiyu Zhao 55 / 55

3 - Chapter 7 Temporal-Difference Methods
No ratings yet
3 - Chapter 7 Temporal-Difference Methods
26 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
No ratings yet
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
57 pages
Exercises 6
No ratings yet
Exercises 6
9 pages
TD Convergence: An Optimization Perspective: Equal Contribution
No ratings yet
TD Convergence: An Optimization Perspective: Equal Contribution
15 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Temporal Difference Learning in AI
No ratings yet
Temporal Difference Learning in AI
17 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Reinforcement Learning Insights from CS188
No ratings yet
Reinforcement Learning Insights from CS188
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
15 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Advanced Online Learning Algorithms
No ratings yet
Advanced Online Learning Algorithms
125 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
GTD2 TDC Suttonetal2009
No ratings yet
GTD2 TDC Suttonetal2009
8 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
30 pages
Module 5-rl
No ratings yet
Module 5-rl
54 pages
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
No ratings yet
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
62 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
RL - 04 Model-Free Prediction and Control
No ratings yet
RL - 04 Model-Free Prediction and Control
59 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
Convergence in Stochastic DP Algorithms
No ratings yet
Convergence in Stochastic DP Algorithms
8 pages
Reinforcement Learning Basics and Algorithms
No ratings yet
Reinforcement Learning Basics and Algorithms
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning Concepts Explained
No ratings yet
Reinforcement Learning Concepts Explained
4 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
50 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
L6-Stochastic Approximation and Stochastic Gradient Descent
No ratings yet
L6-Stochastic Approximation and Stochastic Gradient Descent
66 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
245 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
16 RL
No ratings yet
16 RL
51 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
Hidden Markov Models Overview
No ratings yet
Hidden Markov Models Overview
15 pages
13 - Econometric Applications of Dynamic Programming: V X Uxi EV Xi
No ratings yet
13 - Econometric Applications of Dynamic Programming: V X Uxi EV Xi
5 pages
06 CS272 01 TD
No ratings yet
06 CS272 01 TD
32 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
Define The Problem
No ratings yet
Define The Problem
6 pages
RL - Exam2023 Solved
No ratings yet
RL - Exam2023 Solved
6 pages
S1-S4 MTH Learner's Research Book (LBL)
No ratings yet
S1-S4 MTH Learner's Research Book (LBL)
22 pages
2021 Y9 Da Vinci AT2
No ratings yet
2021 Y9 Da Vinci AT2
8 pages
Letters of St. Elizabeth of The Trinity
100% (1)
Letters of St. Elizabeth of The Trinity
17 pages
FT Lecture 11 Error Control Codes Fall22
No ratings yet
FT Lecture 11 Error Control Codes Fall22
24 pages
List of Indian Stone Industry Companies
No ratings yet
List of Indian Stone Industry Companies
113 pages
Cambridge International AS & A Level: English Language 9093/32
No ratings yet
Cambridge International AS & A Level: English Language 9093/32
8 pages
Java Sample Bee Solved
No ratings yet
Java Sample Bee Solved
39 pages
1.3. Percentages Fractions Conversions:: Percentage Is Denoted by "%", Which Means "/100"
No ratings yet
1.3. Percentages Fractions Conversions:: Percentage Is Denoted by "%", Which Means "/100"
1 page
Trimester II Exam Schedule 2024
No ratings yet
Trimester II Exam Schedule 2024
8 pages
Perkasa Free To Boycott Singer Jaclyn Victor
No ratings yet
Perkasa Free To Boycott Singer Jaclyn Victor
12 pages
Vanniar History
100% (2)
Vanniar History
3 pages
Math Olympiad Challenge 2025
No ratings yet
Math Olympiad Challenge 2025
2 pages
Maharishi Markandeshwar (Deemed To Be University), Mullana (Ambala)
No ratings yet
Maharishi Markandeshwar (Deemed To Be University), Mullana (Ambala)
8 pages
Tips For Saving Water
No ratings yet
Tips For Saving Water
2 pages
Baptist Reflections on Life Challenges
No ratings yet
Baptist Reflections on Life Challenges
4 pages
Graphic Organizer - Activity Sheet in RWS - Selecting and Organizing Information
No ratings yet
Graphic Organizer - Activity Sheet in RWS - Selecting and Organizing Information
2 pages
Software Engeenring KL
No ratings yet
Software Engeenring KL
19 pages
Arvanitika
No ratings yet
Arvanitika
15 pages
AI Techniques: ImageNet, WaveNet, Word2Vec
No ratings yet
AI Techniques: ImageNet, WaveNet, Word2Vec
21 pages
Power Point Noun and Pronoun Basic Engli
0% (1)
Power Point Noun and Pronoun Basic Engli
30 pages
Week 11-Vinay and Darbelnets A Methodology For Translation
No ratings yet
Week 11-Vinay and Darbelnets A Methodology For Translation
17 pages
RC 2
No ratings yet
RC 2
13 pages
LK 1.1 Modul 5
No ratings yet
LK 1.1 Modul 5
4 pages
POEM - My Mother at Sixty Six
No ratings yet
POEM - My Mother at Sixty Six
20 pages
Senior High School Final Grades Summary
No ratings yet
Senior High School Final Grades Summary
10 pages
The Easter Triduum
No ratings yet
The Easter Triduum
11 pages
Module 3 Developing The Gift of Prophecy
No ratings yet
Module 3 Developing The Gift of Prophecy
5 pages
Software Engineer Resume - Akshay Morab
No ratings yet
Software Engineer Resume - Akshay Morab
1 page
Disable Logging of Changes On Customizing Tables
No ratings yet
Disable Logging of Changes On Customizing Tables
2 pages
Contagious Leadership Workbook
82% (11)
Contagious Leadership Workbook
21 pages