L7 Temporal Difference Learning
L7 Temporal Difference Learning
Shiyu Zhao
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Methods
Shiyu Zhao 1 / 55
Introduction
Shiyu Zhao 2 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 3 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 4 / 55
Motivating example: stochastic algorithms
We next consider some stochastic problems and show how to use the RM
algorithm to solve them.
First, revisit the mean estimation problem: calculate
w = E[X]
Shiyu Zhao 5 / 55
Motivating example: stochastic algorithms
w = E[v(X)],
g(w) = w − E[v(X)]
.
g̃(w, η) = w − v(x) = (w−E[v(X)]) + (E[v(X)] − v(x)) = g(w) + η.
Shiyu Zhao 6 / 55
Motivating example: stochastic algorithms
w = E[R + γv(X)],
Quick summary:
• The above three examples become more and more complex.
• They can all be solved by the RM algorithm.
• We will see that the TD algorithms have similar expressions.
Shiyu Zhao 8 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 9 / 55
TD learning of state values – Algorithm description
Problem statement:
• Given policy π, the aim is to estimate the state values {vπ (s)}s∈S under π.
• Experience samples: (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) or {(st , rt+1 , st+1 )}t
generated by π.
Important notations:
v(s) −→ vπ (s)
⇓
v(st ) −→ vπ (st )
⇓
vt (st ) −→ vπ (st )
Shiyu Zhao 10 / 55
TD learning of state values – Algorithm description
where t = 0, 1, 2, . . . .
Here, vt (st ) is the estimated state value of vπ (st ); αt (st ) is the learning rate of
st at time t.
• At time t, only the value of the visited state st is updated whereas the values
of the unvisited states s 6= st remain unchanged.
• The update in (2) will be omitted when the context is clear.
Shiyu Zhao 11 / 55
TD learning of state values – Algorithm properties
Here,
.
v̄t = rt+1 + γvt (st+1 )
Shiyu Zhao 12 / 55
TD learning of state values – Algorithm properties
Therefore,
Note that
E[δπ,t |St = st ] = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st = 0.
Shiyu Zhao 14 / 55
TD learning of state values – Algorithm properties
Other properties:
• The TD algorithm in (3) only estimates the state value of a given policy.
• It does not estimate the action values.
• It does not search for optimal policies.
• This algorithm will be extended to estimate action values and then search for
optimal policies later in this lecture.
• The TD algorithm in (3) is fundamental for understanding more complex TD
algorithms.
Shiyu Zhao 15 / 55
TD learning of state values – The idea of the algorithm
Shiyu Zhao 16 / 55
TD learning of state values – The idea of the algorithm
Shiyu Zhao 17 / 55
TD learning of state values – The idea of the algorithm
Shiyu Zhao 18 / 55
TD learning of state values – The idea of the algorithm
where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the samples of
R, S 0 obtained at the kth step.
By the TD algorithm (1), vt (s) converges with probability 1 to vπ (s) for all s ∈ S as
t → ∞ if t αt (s) = ∞ and t α2t (s) < ∞ for all s ∈ S.
P P
a small constant. In this case, the condition that t α2t (s) < ∞ is invalid
P
anymore. When α is constant, it can still be shown that the algorithm converges
in the sense of expectation sense.
Shiyu Zhao 20 / 55
TD learning of state values – Algorithm properties
While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC learning?
Shiyu Zhao 21 / 55
TD learning of state values – Algorithm properties
While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC learning?
Shiyu Zhao 22 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 23 / 55
TD learning of action values – Sarsa
• The TD algorithm introduced in the last section can only estimate state
values.
• Next, we introduce, Sarsa, an algorithm that can directly estimate action
values.
• We will also see how to use Sarsa to find optimal policies.
Shiyu Zhao 24 / 55
Sarsa – Algorithm
where t = 0, 1, 2, . . . .
• qt (st , at ) is an estimate of qπ (st , at );
• αt (st , at ) is the learning rate depending on st , at .
Shiyu Zhao 25 / 55
Sarsa – Algorithm
• Why is this algorithm called Sarsa? That is because each step of the
algorithm involves (st , at , rt+1 , st+1 , at+1 ). Sarsa is the abbreviation of
state-action-reward-state-action.
• What is the relationship between Sarsa and the previous TD learning
algorithm? We can obtain Sarsa by replacing the state value estimate v(s)
in the TD algorithm with the action value estimate q(s, a). As a result,
Sarsa is an action-value version of the TD algorithm.
• What does the Sarsa algorithm do mathematically? The expression of
Sarsa suggests that it is a stochastic approximation algorithm solving the
following equation:
Shiyu Zhao 26 / 55
Sarsa – Algorithm
Remarks:
• This theorem says that the action values can be found by Sarsa for a given a
policy π.
Shiyu Zhao 27 / 55
Sarsa – Implementation
Shiyu Zhao 29 / 55
Sarsa – Examples
Task description:
• The task is to find a good path from a specific starting state to the target
state.
- This task is different from all the previous tasks where we need to find out
the optimal policy for every state!
- Each episode starts from the top-left state and end in the target state.
- In the future, pay attention to what the task is.
• rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The learning
rate is α = 0.1 and the value of is 0.1.
Shiyu Zhao 30 / 55
Sarsa – Examples
Results:
• The left figures above show the final policy obtained by Sarsa.
• Not all states have the optimal policy.
• The right figures show the total reward and length of every episode.
• The metric of total reward per episode will be frequently used.
1 2 3 4 5
Total rewards
0
1
-200
2 -400
4 100
0
5 0 100 200 300 400 500
Episode index
Shiyu Zhao 31 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 32 / 55
TD learning of action values: n-step Sarsa
(1)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γqπ (St+1 , At+1 )|s, a].
(∞)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |s, a].
Shiyu Zhao 34 / 55
TD learning of action values: n-step Sarsa
• Data: n-step Sarsa needs (st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ).
• Since (rt+n , st+n , at+n ) has not been collected at time t, we are not able to
implement n-step Sarsa at step t. We need to wait until time t + n to
update the q-value of (st , at ):
qt+n (st , at ) = qt+n−1 (st , at )
h i
− αt+n−1 (st , at ) qt+n−1 (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt+n−1 (st+n , at+n )]
• Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, its
performance is a blend of Sarsa and MC learning:
- If n is large, its performance is close to MC learning and hence has a large
variance but a small bias.
- If n is small, its performance is close to Sarsa and hence has a relatively
large bias due to the initial guess and relatively low variance.
• Finally, n-step Sarsa is also for policy evaluation. It can be combined with
the policy improvement step to search for optimal policies.
Shiyu Zhao 35 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 36 / 55
TD learning of optimal action values: Q-learning
Shiyu Zhao 37 / 55
Q-learning – Algorithm
Q-learning is very similar to Sarsa. They are different only in terms of the TD
target:
• The TD target in Q-learning is rt+1 + γ maxa∈A qt (st+1 , a)
• The TD target in Sarsa is rt+1 + γqt (st+1 , at+1 )
Shiyu Zhao 38 / 55
Q-learning – Algorithm
As a result, Q-learning can directly estimate the optimal action values instead
of action values of a given policy.
Shiyu Zhao 39 / 55
Off-policy vs on-policy
On-policy vs off-policy:
• When the behavior policy is the same as the target policy, such kind of
learning is called on-policy.
• When they are different, the learning is called off-policy.
Shiyu Zhao 40 / 55
Off-policy vs on-policy
• It can search for optimal policies based on the experience samples generated
by any other policies.
• Example: The behavior policy is exploratory so that we can generate
episodes visiting every state-action pair sufficiently many times.
1 2 3 4 5
Shiyu Zhao 41 / 55
Off-policy vs on-policy
Shiyu Zhao 42 / 55
Off-policy vs on-policy
Shiyu Zhao 43 / 55
Off-policy vs on-policy
Q-learning is off-policy.
• The behavior policy is the one for generating at in st . It can be any policy.
Shiyu Zhao 44 / 55
Q-learning – Implementation
Shiyu Zhao 45 / 55
Q-learning – Algorithm
Goal: Learn an optimal target policy πT for all states from the experience samples
generated by πb .
For each episode {s0 , a0 , r1 , s1 , a1 , r2 , . . . } generated by πb , do
For each step t = 0, 1, 2, . . . of the episode, do
Update q-value for (st , at ): h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) q(st , at ) − (rt+1 +
i
γ maxa qt (st+1 , a))
Update target policy for st :
πT ,t+1 (a|st ) = 1 if a = arg maxa qt+1 (st , a)
πT ,t+1 (a|st ) = 0 otherwise
Shiyu Zhao 46 / 55
Q-learning – Examples
Task description:
• The task in these examples is to find an optimal policy for all the states.
• The reward setting is rboundary = rforbidden = −1, and rtarget = 1. The
discount rate is γ = 0.9. The learning rate is α = 0.1.
Ground truth: an optimal policy and the corresponding optimal state values.
1 2 3 4 5 1 2 3 4 5
Shiyu Zhao 47 / 55
Q-learning – Examples
6
2
State value error
4
3
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
1 2 3 4 5
1 8
3 4
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
(a) Behavior policy = 0.5 (b) Generated episode (c) Q-learning result
Shiyu Zhao 49 / 55
Q-learning – Examples
1 2 3 4 5
4
3
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
1 2 3 4 5
4
3
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
Shiyu Zhao 50 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 51 / 55
A unified point of view
Shiyu Zhao 52 / 55
A unified point of view
Shiyu Zhao 53 / 55
Outline
1 Motivating examples
7 Summary
Shiyu Zhao 54 / 55
Summary
Shiyu Zhao 55 / 55