0% found this document useful (0 votes)
18 views72 pages

L8-Value Function Approximation

Lecture 8 focuses on Value Function Approximation, discussing the transition from tabular to function representation for state and action values. It outlines key concepts, including the objective function for state value estimation, optimization algorithms, and various methods such as Sarsa and Q-learning with function approximation. The lecture emphasizes the benefits of function approximation, such as storage efficiency and generalization ability, while also addressing the challenges of maintaining accuracy in approximations.

Uploaded by

schakraborty.ec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views72 pages

L8-Value Function Approximation

Lecture 8 focuses on Value Function Approximation, discussing the transition from tabular to function representation for state and action values. It outlines key concepts, including the objective function for state value estimation, optimization algorithms, and various methods such as Sarsa and Q-learning with function approximation. The lecture emphasizes the benefits of function approximation, such as storage efficiency and generalization ability, while also addressing the challenges of maintaining accuracy in approximations.

Uploaded by

schakraborty.ec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 8: Value Function Approximation

Shiyu Zhao
Outline

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:


with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based


Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Shiyu Zhao 1 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 2 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 3 / 71
Motivating examples: from table to function

So far in this book, state and action values are represented by tables.

• For example, state value:

State s1 s2 ··· sn
True value vπ (s1 ) vπ (s2 ) ··· vπ (sn )
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )

• For example, action value:

a1 a2 a3 a4 a5
s1 qπ (s1 , a1 ) qπ (s1 , a2 ) qπ (s1 , a3 ) qπ (s1 , a4 ) qπ (s1 , a5 )
. . . . . .
. . . . . .
. . . . . .
s9 qπ (s9 , a1 ) qπ (s9 , a2 ) qπ (s9 , a3 ) qπ (s9 , a4 ) qπ (s9 , a5 )

• Advantage: intuitive and easy to analyze


• Disadvantage: difficult to handle large or continuous state or action spaces.
Two aspects: 1) storage; 2) generalization ability
Shiyu Zhao 4 / 71
Motivating examples: from table to function

Consider an example:
• Suppose there are finite states s1 , . . . , sn .
• Their state values are vπ (s1 ), . . . , vπ (sn ), where π is a given policy.
• Suppose n is very large and we hope to use a simple curve to approximate
these dots to save storage.

Shiyu Zhao 5 / 71
Motivating examples: from table to function

For example, we can use a simple straight line to fit the dots.

v̂(s) = as + b
v̂(s)

s1 s2 s3 s4 · · · sn s

Suppose the equation of the straight line is


" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w
| {z } b
φT (s) | {z }
w

w is the parameter vector; φ(s) the feature vector of s; v̂(s, w) is linear in w.

Shiyu Zhao 6 / 71
Motivating examples: from table to function

Difference between the tabular and function methods:


Difference 1: How to retrieve the value of a state
• When the values are represented by a table, we can directly read the value in
the table.
• When the values are represented by a function, we need to input the state
index s into the function and calculate the function value.

s v̂(s, w)
w

function

For example, s → φ(s) → v̂(s, w) = φT (s)w


Benefit: storage. We do not need to store |S| state values. We only need to
store a lower-dimensional w.

Shiyu Zhao 7 / 71
Motivating examples: from table to function

Difference between the tabular and function methods:


Difference 2: How to update the value of a state
• When the values are represented by a table, we can directly rewrite the value
in the table.
• When the values are represented by a function, we must update w to change
the values indirectly.
How to update w to find optimal state values will be addressed in detail later.

Shiyu Zhao 8 / 71
Motivating examples: from table to function

Difference between the tabular and function methods:


Difference 2: How to update the value of a state

v̂(s) v̂(s)
update v̂(s3 )

s1 s2 s3 s s1 s2 s3 s

(a) Tabular method

v̂(s) v̂(s)
update w for s3

s1 s2 s3 s s1 s2 s3 s

(b) Function approximation method

Benefit: generalization ability. When we update v̂(s) by changing w, the


values of the neighboring states are also changed.
Shiyu Zhao 9 / 71
Motivating examples: from table to function

The benefits are not free. It comes with a cost: the state values can not be
represented accurately. This is why this method is called approximation.
We can fit the points more precisely using high-order curves.
 
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1]   = φT (s)w.
 
b
| {z }  
φT (s) c
| {z }
w

In this case,
• The dimensions of w and φ(s) increase; the values may be fitted more
accurately.
• Although v̂(s, w) is nonlinear in s, it is linear in w. The nonlinearity is
contained in φ(s).

Shiyu Zhao 10 / 71
Motivating examples: from table to function

Quick summary:
• Idea: Approximate the state and action values using parameterized
functions: v̂(s, w) ≈ vπ (s) where w ∈ Rm is the parameter vector.
• Key difference: How to retrieve and change the value of v(s)
• Advantages:
1) Storage: The dimension of w may be much smaller than |S|.
2) Generalization: When a state s is visited, the parameter w is updated
so that the values of some other unvisited states can also be updated.

Shiyu Zhao 11 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 12 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 13 / 71
Objective function

Introduce in a more formal way:


• Let vπ (s) and v̂(s, w) be the true state value and a function for
approximation.
• Our goal is to find an optimal w so that v̂(s, w) can best approximate vπ (s)
for every s.
• This is a policy evaluation problem. Later we will extend to policy
improvement.
To find the optimal w, we need two steps.
• The first step is to define an objective function.
• The second step is to derive algorithms optimizing the objective function.

Shiyu Zhao 14 / 71
Objective function

The objective function is

J(w) = E[(vπ (S) − v̂(S, w))2 ].

• Our goal is to find the best w that can minimize J(w).


• The expectation is with respect to the random variable S ∈ S. What is the
probability distribution of S?
• This is often confusing because we have not discussed the probability
distribution of states so far in this book.
• There are several ways to define the probability distribution of S.

Shiyu Zhao 15 / 71
Objective function

The first way is to use a uniform distribution.


• That is to treat all the states to be equally important by setting the
probability of each state as 1/|S|.
• In this case, the objective function becomes
1 X
J(w) = E[(vπ (S) − v̂(S, w))2 ] = (vπ (s) − v̂(s, w))2 .
|S| s∈S

• Drawback:
• The states may not be equally important. For example, some states may
be rarely visited by a policy. Hence, this way does not consider the real
dynamics of the Markov process under the given policy.

Shiyu Zhao 16 / 71
Objective function

The second way is to use the stationary distribution.


• Stationary distribution is an important concept that will be frequently used in
this course. In short, it describes the long-run behavior of a Markov process.
• Let {dπ (s)}s∈S denote the stationary distribution of the Markov process
P
under policy π. By definition, dπ (s) ≥ 0 and s∈S dπ (s) = 1.
• The objective function can be rewritten as
X
J(w) = E[(vπ (S) − v̂(S, w))2 ] = dπ (s)(vπ (s) − v̂(s, w))2 .
s∈S

This function is a weighted squared error.


• Since more frequently visited states have higher values of dπ (s), their weights
in the objective function are also higher than those rarely visited states.

Shiyu Zhao 17 / 71
Objective function – Stationary distribution

More explanation about stationary distribution:


• Distribution: Distribution of the state
• Stationary: Long-run behavior
• Summary: after the agent runs a long time following a policy, the probability
that the agent is at any state can be described by this distribution.
Remarks:
• Stationary distribution is also called steady-state distribution, or limiting
distribution.
• It is critical to understand the value function approximation method.
• It is also important for the policy gradient method in the next lecture.

Shiyu Zhao 18 / 71
Objective function - Stationary distribution

Illustrative example:
• Given a policy shown in the figure.
• Let nπ (s) denote the number of times that s has been visited in a very long
episode generated by π.
• Then, dπ (s) can be approximated by
nπ (s)
dπ (s) ≈ P
s0 ∈S nπ (s0 )
1 2
0.8

Percentage of each state visited


0.6
1
s1
s2
0.4
s3
s4

0.2
2

0
0 200 400 600 800 1000
Step index

Figure: Long-run behavior of an -greedy policy with  = 0.5.

Shiyu Zhao 19 / 71
Objective function - Stationary distribution

The converged values can be predicted because they are the entries of dπ :

dTπ = dTπ Pπ

For this example, we have Pπ as


 
0.3 0.1 0.6 0
 
 0.1 0.3 0 0.6 
Pπ =  .
 0.1 0 0.3 0.6 
 

0 0.1 0.1 0.8

It can be calculated that the left eigenvector for the eigenvalue of one is
h iT
dπ = 0.0345, 0.1084, 0.1330, 0.7241

A comprehensive introduction can be found in the book.

Shiyu Zhao 20 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 21 / 71
Optimization algorithms

While we have the objective function, the next step is to optimize it.
• To minimize the objective function J(w), we can use the gradient-descent
algorithm:

wk+1 = wk − αk ∇w J(wk )

The true gradient is

∇w J(w) = ∇w E[(vπ (S) − v̂(S, w))2 ]


= E[∇w (vπ (S) − v̂(S, w))2 ]
= 2E[(vπ (S) − v̂(S, w))(−∇w v̂(S, w))]
= −2E[(vπ (S) − v̂(S, w))∇w v̂(S, w)]

The true gradient above involves the calculation of an expectation.

Shiyu Zhao 22 / 71
Optimization algorithms

We can use the stochastic gradient to replace the true gradient:

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

where st is a sample of S. Here, 2αt is merged to αt .


• The samples are expected to satisfy the stationary distribution. In practice,
they may not satisfy.
• This algorithm is not implementable because it requires the true state value
vπ , which is the unknown to be estimated.
• We can replace vπ (st ) with an approximation so that the algorithm is
implementable.

Shiyu Zhao 23 / 71
Optimization algorithms

In particular,
• First, Monte Carlo learning with function approximation
Let gt be the discounted return starting from st in the episode. Then, gt can
be used to approximate vπ (st ). The algorithm becomes

wt+1 = wt + αt (gt − v̂(st , wt ))∇w v̂(st , wt ).

• Second, TD learning with function approximation


By the spirit of TD learning, rt+1 + γv̂(st+1 , wt ) can be viewed as an
approximation of vπ (st ). Then, the algorithm becomes

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Shiyu Zhao 24 / 71
Optimization algorithms

Pseudocode: TD learning with function approximation

Initialization: A function v̂(s, w) that is a differentiable in w. Initial parameter


w0 .
Aim: Approximate the true state values of a given policy π.

For each episode generated following the policy π, do


For each step (st , rt+1 , st+1 ), do
In the general case,
wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )
In the linear case,
wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st )
 

It can only estimate the state values of a given policy, but it is important to
understand other algorithms introduced later.

Shiyu Zhao 25 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 26 / 71
Selection of function approximators

An important question that has not been answered: How to select the function
v̂(s, w)?
• The first approach, which was widely used before, is to use a linear function

v̂(s, w) = φT (s)w

Here, φ(s) is the feature vector, which can be a polynomial basis, Fourier
basis, ... (see my book for details). We have seen in the motivating example
and will see again in the illustrative examples later.
• The second approach, which is widely used nowadays, is to use a neural
network as a nonlinear function approximator.
The input of the NN is the state, the output is v̂(s, w), and the network
parameter is w.

Shiyu Zhao 27 / 71
Linear function approximation

In the linear case where v̂(s, w) = φT (s)w, we have

∇w v̂(s, w) = φ(s).

Substituting the gradient into the TD algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

yields

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),


 

which is the algorithm of TD learning with linear function approximation. It is


called TD-Linear in our course in short.

Shiyu Zhao 28 / 71
Linear function approximation

• Disadvantages of linear function approximation:


• Difficult to select appropriate feature vectors.
• Advantages of linear function approximation:
• The theoretical properties of the TD algorithm in the linear case can be
much better understood than in the nonlinear case.
• Linear function approximation is still powerful in the sense that the tabular
representation is merely a special case of linear function approximation.

Shiyu Zhao 29 / 71
Linear function approximation

We next show that the tabular representation is a special case of linear function
approximation.
In this way, the tabular and function representations are unified!
• First, consider the special feature vector for state s:

φ(s) = es ∈ R|S| ,

where es is a vector with the sth entry as 1 and the others as 0.


• In this case,
v̂(s, w) = eTs w = w(s),

where w(s) is the sth entry of w.

Shiyu Zhao 30 / 71
Linear function approximation

Recall that the TD-Linear algorithm is

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),


 

• When φ(st ) = es , the above algorithm becomes

wt+1 = wt + αt (rt+1 + γwt (st+1 ) − wt (st )) est .

This is a vector equation that merely updates the st th entry of wt .


• Multiplying eTst on both sides of the equation gives

wt+1 (st ) = wt (st ) + αt (rt+1 + γwt (st+1 ) − wt (st )) ,

which is exactly the tabular TD algorithm.

Shiyu Zhao 31 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 32 / 71
Illustrative examples

Consider a 5x5 grid-world example:


1 2 3 4 5

• Given a policy: π(a|s) = 0.2 for any s, a


• Our aim is to estimate the state values of this policy (policy evaluation
problem).
• There are 25 state values in total. We next show that we can use less than
25 parameters to approximate these state values.
• Set rforbidden = rboundary = −1, rtarget = 1, and γ = 0.9.

Shiyu Zhao 33 / 71
Illustrative examples

Ground truth:
• The true state values and the 3D visualization
1 2 3 4 5 1 2 3 4 5

1 1 -3.8 -3.8 -3.6 -3.1 -3.2

2 2 -3.8 -3.8 -3.8 -3.1 -2.9

3 3 -3.6 -3.9 -3.4 -3.2 -2.9

4 4 -3.9 -3.6 -3.4 -2.9 -3.2

5 5 -4.5 -4.2 -3.4 -3.4 -3.5

Experience samples:
• 500 episodes were generated following the given policy.
• Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution.

Shiyu Zhao 34 / 71
Illustrative examples

For comparison, the results given by the tabular TD algorithm (called


TD-Table in short):
4
TD-Table: =0.005

State value error (RMSE)


3

0
0 100 200 300 400 500
Episode index

Shiyu Zhao 35 / 71
Illustrative examples

We next show the results by the TD-Linear algorithm.


Feature vector selection:
 
1
3
 
φ(s) = 
 x ∈R .

y

In this case, the approximated state value is


 
w1
T
 
v̂(s, w) = φ (s)w = [1, x, y]  w2
  = w1 + w2 x + w3 y.

w3

Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the
elements does not matter.

Shiyu Zhao 36 / 71
Illustrative examples

Results by the TD-Linear algorithm:

5
TD-Linear: =0.0005

State value error (RMSE)


4

0
0 100 200 300 400 500
Episode index

• The trend is right, but there are errors due to limited approximation ability!
• We are trying to use a plane to approximate a non-plane surface!

Shiyu Zhao 37 / 71
Illustrative examples

To enhance the approximation ability, we can use high-order feature vectors


and hence more parameters.
• For example, we can consider

φ(s) = [1, x, y, x2 , y 2 , xy]T ∈ R6 .

In this case,

v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy

which corresponds to a quadratic surface.


• We can further increase the dimension of the feature vector:

φ(s) = [1, x, y, x2 , y 2 , xy, x3 , y 3 , x2 y, xy 2 ]T ∈ R10 .

Shiyu Zhao 38 / 71
Illustrative examples

Results by the TD-Linear algorithm with higher-order feature vectors:


5
TD-Linear: =0.0005

State value error (RMSE)


4

0
0 100 200 300 400 500
Episode index

The above figure: φ(s) ∈ R6


3.5
TD-Linear: =0.0005

State value error (RMSE)


3

2.5

1.5

0.5

0
0 100 200 300 400 500
Episode index

The above figure: φ(s) ∈ R10

More examples and types of features are given in the book.


Shiyu Zhao 39 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 40 / 71
Summary of the story

Up to now, we finished the story of TD learning with value function


approximation.
1) This story started from the objective function:

J(w) = E[(vπ (S) − v̂(S, w))2 ].

The objective function suggests that it is a policy evaluation problem.


2) The gradient-descent algorithm is

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

3) The true value function, which is unknown, in the algorithm is replaced by


an approximation, leading to the algorithm:

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Although this story is helpful to understand the basic idea, it is not


mathematically rigorous.
Shiyu Zhao 41 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 42 / 71
Theoretical analysis (optional)

• The algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

does not minimize the following objective function:

J(w) = E[(vπ (S) − v̂(S, w))2 ]

Shiyu Zhao 43 / 71
Theoretical analysis (optional)

Different objective functions:


• Objective function 1: True value error

JE (w) = E[(vπ (S) − v̂(S, w))2 ] = kv̂(w) − vπ k2D

• Objective function 2: Bellman error


.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D ,
.
where Tπ (x) = rπ + γPπ x

Shiyu Zhao 44 / 71
Theoretical analysis (optional)

• Objective function 3: Projected Bellman error

JP BE (w) = kv̂(w) − M Tπ (v̂(w))k2D ,

where M is a projection matrix.

The TD-Linear algorithm minimizes the projected Bellman error.


Details can be found in the book.

Shiyu Zhao 45 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 46 / 71
Sarsa with function approximation

So far, we merely considered the problem of state value estimation. That is we


hope
v̂ ≈ vπ

To search for optimal policies, we need to estimate action values.

The Sarsa algorithm with value function approximation is


h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ).

This is the same as the algorithm we introduced previously in this lecture


except that v̂ is replaced by q̂.

Shiyu Zhao 47 / 71
Sarsa with function approximation

To search for optimal policies, we can combine policy evaluation and policy
improvement.

Pseudocode: Sarsa with function approximation

Aim: Search a policy that can lead the agent to the target from an initial state-action pair
(s0 , a0 ).

For each episode, do


If the current st is not the target state, do
Take action at following πt (st ), generate rt+1 , st+1 , and then take action at+1
following πt (st+1 )
Value update (parameter update): h
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) −
i
q̂(st , at , wt ) ∇w q̂(st , at , wt )
Policy update:
ε
πt+1 (a|st ) = 1− |A(s)| (|A(s)|−1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
ε
πt+1 (a|st ) = |A(s)| otherwise

Shiyu Zhao 48 / 71
Sarsa with function approximation

Illustrative example:
• Sarsa with linear function approximation.
• γ = 0.9,  = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.

1 2 3 4 5
0
Total reward

1
-500

-1000 2

0 100 200 300 400 500


3
Episode length

500
4

5
0
0 100 200 300 400 500
Episode index

For details, please see the book.

Shiyu Zhao 49 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 50 / 71
Q-learning with function approximation

Similar to Sarsa, tabular Q-learning can also be extended to the case of value
function approximation.

The q-value update rule is


h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ),
a∈A(st+1 )

which is the same as Sarsa except that q̂(st+1 , at+1 , wt ) is replaced by


maxa∈A(st+1 ) q̂(st+1 , a, wt ).

Shiyu Zhao 51 / 71
Q-learning with function approximation

Pseudocode: Q-learning with function approximation (on-policy version)

Initialization: Initial parameter vector w0 . Initial policy π0 . Small ε > 0.


Aim: Search a good policy that can lead the agent to the target from an initial state-action
pair (s0 , a0 ).

For each episode, do


If the current st is not the target state, do
Take action at following πt (st ), and generate rt+1 , st+1
Value update (parameter update): h
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) −
a∈A(st+1 )
i
q̂(st , at , wt ) ∇w q̂(st , at , wt )
Policy update:
ε
πt+1 (a|st ) = 1− |A(s)| (|A(s)|−1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
ε
πt+1 (a|st ) = |A(s)| otherwise

Shiyu Zhao 52 / 71
Q-learning with function approximation

Illustrative example:
• Q-learning with linear function approximation.
• γ = 0.9,  = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.

1 2 3 4 5
0
Total reward

-2000
2
-4000
0 100 200 300 400 500
3
Episode length

1000 4

5
0
0 100 200 300 400 500
Episode index

Shiyu Zhao 53 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 54 / 71
Deep Q-learning

Deep Q-learning or deep Q-network (DQN):


• One of the earliest and most successful algorithms that introduce deep neural
networks into RL.
• The role of neural networks is to be a nonlinear function approximator.
• Different from the following algorithm:
h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )

because of the way of training a network.

Shiyu Zhao 55 / 71
Deep Q-learning

Deep Q-learning aims to minimize the objective function/loss function:


" 2 #
0
J(w) = E R + γ max0 q̂(S , a, w) − q̂(S, A, w) ,
a∈A(S )

where (S, A, R, S 0 ) are random variables.


• This is actually the Bellman optimality error. That is because
 
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a
a∈A(St+1 )

The value of R + γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should be zero in the


expectation sense

Shiyu Zhao 56 / 71
Deep Q-learning

How to minimize the objective function? Gradient-descent!


• How to calculate the gradient of the objective function? Tricky!
• That is because, in this objective function
" 2 #
0
J(w) = E R + γ max0 q̂(S , a, w) − q̂(S, A, w) ,
a∈A(S )

the parameter w not only appears in q̂(S, A, w) but also in


.
y = R + γ max0 q̂(S 0 , a, w)
a∈A(S )

• Note that
∇w y 6= γ max0 ∇w q̂(S 0 , a, w)
a∈A(S )

• To solve this problem, we can assume that w in y is fixed (at least for a
while) when we calculate the gradient.

Shiyu Zhao 57 / 71
Deep Q-learning

To do that, we can introduce two networks.


• One is a main network representing q̂(s, a, w)
• The other is a target network q̂(s, a, wT ).
The objective function in this case degenerates to
" 2 #
0
J =E R + γ max0 q̂(S , a, wT ) − q̂(S, A, w) ,
a∈A(S )

where wT is the target network parameter.

Shiyu Zhao 58 / 71
Deep Q-learning

When wT is fixed, the gradient of J can be easily obtained as


  
∇w J = E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ∇w q̂(S, A, w) .
a∈A(S )

• The basic idea of deep Q-learning is to use the gradient-descent algorithm to


minimize the objective function.
• However, such an optimization process evolves some important techniques
that deserve special attention.

Shiyu Zhao 59 / 71
Deep Q-learning - Two networks

First technique:
• Two networks, a main network and a target network.
Why is it used?
• The mathematical reason has been explained when we calculate the gradient.
Implementation details:
• Let w and wT denote the parameters of the main and target networks,
respectively. They are set to be the same initially.
• In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the
replay buffer (will be explained later).
• The inputs of the networks include state s and action a. The target output is
.
yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). Then, we directly minimize the TD error
or called loss function (yT − q̂(s, a, w))2 over the mini-batch {(s, a, yT )}.

Shiyu Zhao 60 / 71
Deep Q-learning - Experience replay

Another technique:
• Experience replay
Question: What is experience replay?
Answer:
• After we have collected some experience samples, we do NOT use these
samples in the order they were collected.
.
• Instead, we store them in a set, called replay buffer B = {(s, a, r, s0 )}
• Every time we train the neural network, we can draw a mini-batch of random
samples from the replay buffer.
• The draw of samples, or called experience replay, should follow a uniform
distribution (why? no prior knowledge). Can we use stationary distribution
here like before? No, since no policy is given.

Shiyu Zhao 61 / 71
Deep Q-learning - Experience replay

Question: Why is experience replay necessary in deep Q-learning? Why does


the replay must follow a uniform distribution?
Answer: The answers lie in the objective function.
" 2 #
0
J =E R + γ max0 q̂(S , a, w) − q̂(S, A, w)
a∈A(S )

• R ∼ p(R|S, A), S 0 ∼ p(S 0 |S, A): R and S are determined by the system
model.
• (S, A) ∼ d: (S, A) is an index and treated as a single random variable
• The distribution of the state-action pair (S, A) is assumed to be uniform
(can we assume other distributions such as stationary distribution?).

Shiyu Zhao 62 / 71
Deep Q-learning - Experience replay

Answer (continued):
• However, the samples are not uniformly collected because they are generated
consequently by certain policies.
• To break the correlation between consequent samples, we can use the
experience replay technique by uniformly drawing samples from the replay
buffer.
• This is the mathematical reason why experience replay is necessary and why
the experience replay must be uniform.

Shiyu Zhao 63 / 71
Deep Q-learning - Experience replay

Revisit the tabular case:


• Question: Why does not tabular Q-learning require experience replay?
• Answer: Because it does not require any distribution of S or A.
• Question: Why does Deep Q-learning involve distributions?
• Answer: Because we need to define a scalar objective function
J(w) = E[∗], where E is for all (S, A).
• The tabular case aims to solve a set of equations for all (s, a) (Bellman
optimality equation), whereas the deep case aims to optimize a scalar
objective function.
• Question: Can we use experience replay in tabular Q-learning?
• Answer: Yes, we can. And more sample efficient (why?)

Shiyu Zhao 64 / 71
Deep Q-learning

Pseudocode: Deep Q-learning (off-policy version)

Aim: Learn an optimal target network to approximate the optimal action values from the
experience samples generated by a behavior policy πb .

Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}


For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-batch
{(s, a, yT )}
Set wT = w every C iterations

Remarks:
• Why no policy update?

• Why not using the policy update equation that we derived?

• The network input and output are different from the DQN paper.

Shiyu Zhao 65 / 71
Deep Q-learning

Illustrative example:
• This example aims to learn optimal action values for every state-action pair.
• Once the optimal action values are obtained, the optimal greedy policy can
be obtained immediately.

Shiyu Zhao 66 / 71
Deep Q-learning

Setup:
• One single episode is used to train the network.
• This episode is generated by an exploratory behavior policy shown in Figure
(a).
• The episode only has 1,000 steps! The tabular Q-learning requires 100,000
steps.
• A shallow neural network with one single hidden layer is used as a nonlinear
approximator of q̂(s, a, w). The hidden layer has 100 neurons.
See details in the book.

Shiyu Zhao 67 / 71
Deep Q-learning

1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

The behavior policy. An episode of 1,000 steps. The obtained policy.

5 10

State value error (RMSE)


TD error / loss function

4 8

3 6

2 4

1 2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

The TD error converges to zero. The state estimation error converges to zero.

Shiyu Zhao 68 / 71
Deep Q-learning

What if we only use a single episode of 100 steps? Insufficient data


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

The behavior policy. An episode of 100 steps. The final policy.

7 8

State value error (RMSE)


6
TD error / loss function

5 7

4
6
3

2
5
1

0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

The TD error converges to zero. The state error does not converge to zero.
Shiyu Zhao 69 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 70 / 71
Summary

This lecture introduces the method of value function approximation.


• First, understand the basic idea.
• Second, understand the basic algorithms.

Shiyu Zhao 71 / 71

You might also like