0% found this document useful (0 votes)

18 views72 pages

L8-Value Function Approximation

Lecture 8 focuses on Value Function Approximation, discussing the transition from tabular to function representation for state and action values. It outlines key concepts, including the objective function for state value estimation, optimization algorithms, and various methods such as Sarsa and Q-learning with function approximation. The lecture emphasizes the benefits of function approximation, such as storage efficiency and generalization ability, while also addressing the challenges of maintaining accuracy in approximations.

Uploaded by

schakraborty.ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views72 pages

L8-Value Function Approximation

Uploaded by

schakraborty.ec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 8: Value Function Approximation

Shiyu Zhao
Outline

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Shiyu Zhao 1 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 2 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 3 / 71
Motivating examples: from table to function

So far in this book, state and action values are represented by tables.

• For example, state value:

State s1 s2 ··· sn
True value vπ (s1 ) vπ (s2 ) ··· vπ (sn )
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )

• For example, action value:

a1 a2 a3 a4 a5
s1 qπ (s1 , a1 ) qπ (s1 , a2 ) qπ (s1 , a3 ) qπ (s1 , a4 ) qπ (s1 , a5 )
. . . . . .
. . . . . .
. . . . . .
s9 qπ (s9 , a1 ) qπ (s9 , a2 ) qπ (s9 , a3 ) qπ (s9 , a4 ) qπ (s9 , a5 )

• Advantage: intuitive and easy to analyze

• Disadvantage: difficult to handle large or continuous state or action spaces.
Two aspects: 1) storage; 2) generalization ability
Shiyu Zhao 4 / 71
Motivating examples: from table to function

Consider an example:
• Suppose there are finite states s1 , . . . , sn .
• Their state values are vπ (s1 ), . . . , vπ (sn ), where π is a given policy.
• Suppose n is very large and we hope to use a simple curve to approximate
these dots to save storage.

Shiyu Zhao 5 / 71
Motivating examples: from table to function

For example, we can use a simple straight line to fit the dots.

v̂(s) = as + b
v̂(s)

s1 s2 s3 s4 · · · sn s

Suppose the equation of the straight line is

" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w
| {z } b
φT (s) | {z }
w

w is the parameter vector; φ(s) the feature vector of s; v̂(s, w) is linear in w.

Shiyu Zhao 6 / 71
Motivating examples: from table to function

Difference between the tabular and function methods:

Difference 1: How to retrieve the value of a state
• When the values are represented by a table, we can directly read the value in
the table.
• When the values are represented by a function, we need to input the state
index s into the function and calculate the function value.

s v̂(s, w)
w

function

For example, s → φ(s) → v̂(s, w) = φT (s)w

Benefit: storage. We do not need to store |S| state values. We only need to
store a lower-dimensional w.

Shiyu Zhao 7 / 71
Motivating examples: from table to function

Difference between the tabular and function methods:

Difference 2: How to update the value of a state
• When the values are represented by a table, we can directly rewrite the value
in the table.
• When the values are represented by a function, we must update w to change
the values indirectly.
How to update w to find optimal state values will be addressed in detail later.

Shiyu Zhao 8 / 71
Motivating examples: from table to function

Difference between the tabular and function methods:

Difference 2: How to update the value of a state

v̂(s) v̂(s)
update v̂(s3 )

s1 s2 s3 s s1 s2 s3 s

(a) Tabular method

v̂(s) v̂(s)
update w for s3

s1 s2 s3 s s1 s2 s3 s

(b) Function approximation method

Benefit: generalization ability. When we update v̂(s) by changing w, the

values of the neighboring states are also changed.
Shiyu Zhao 9 / 71
Motivating examples: from table to function

The benefits are not free. It comes with a cost: the state values can not be
represented accurately. This is why this method is called approximation.
We can fit the points more precisely using high-order curves.
 
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1]   = φT (s)w.
 
b
| {z }  
φT (s) c
| {z }
w

In this case,
• The dimensions of w and φ(s) increase; the values may be fitted more
accurately.
• Although v̂(s, w) is nonlinear in s, it is linear in w. The nonlinearity is
contained in φ(s).

Shiyu Zhao 10 / 71
Motivating examples: from table to function

Quick summary:
• Idea: Approximate the state and action values using parameterized
functions: v̂(s, w) ≈ vπ (s) where w ∈ Rm is the parameter vector.
• Key difference: How to retrieve and change the value of v(s)
• Advantages:
1) Storage: The dimension of w may be much smaller than |S|.
2) Generalization: When a state s is visited, the parameter w is updated
so that the values of some other unvisited states can also be updated.

Shiyu Zhao 11 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 12 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 13 / 71
Objective function

Introduce in a more formal way:

• Let vπ (s) and v̂(s, w) be the true state value and a function for
approximation.
• Our goal is to find an optimal w so that v̂(s, w) can best approximate vπ (s)
for every s.
• This is a policy evaluation problem. Later we will extend to policy
improvement.
To find the optimal w, we need two steps.
• The first step is to define an objective function.
• The second step is to derive algorithms optimizing the objective function.

Shiyu Zhao 14 / 71
Objective function

The objective function is

J(w) = E[(vπ (S) − v̂(S, w))2 ].

• Our goal is to find the best w that can minimize J(w).

• The expectation is with respect to the random variable S ∈ S. What is the
probability distribution of S?
• This is often confusing because we have not discussed the probability
distribution of states so far in this book.
• There are several ways to define the probability distribution of S.

Shiyu Zhao 15 / 71
Objective function

The first way is to use a uniform distribution.

• That is to treat all the states to be equally important by setting the
probability of each state as 1/|S|.
• In this case, the objective function becomes
1 X
J(w) = E[(vπ (S) − v̂(S, w))2 ] = (vπ (s) − v̂(s, w))2 .
|S| s∈S

• Drawback:
• The states may not be equally important. For example, some states may
be rarely visited by a policy. Hence, this way does not consider the real
dynamics of the Markov process under the given policy.

Shiyu Zhao 16 / 71
Objective function

The second way is to use the stationary distribution.

• Stationary distribution is an important concept that will be frequently used in
this course. In short, it describes the long-run behavior of a Markov process.
• Let {dπ (s)}s∈S denote the stationary distribution of the Markov process
P
under policy π. By definition, dπ (s) ≥ 0 and s∈S dπ (s) = 1.
• The objective function can be rewritten as
X
J(w) = E[(vπ (S) − v̂(S, w))2 ] = dπ (s)(vπ (s) − v̂(s, w))2 .
s∈S

This function is a weighted squared error.

• Since more frequently visited states have higher values of dπ (s), their weights
in the objective function are also higher than those rarely visited states.

Shiyu Zhao 17 / 71
Objective function – Stationary distribution

More explanation about stationary distribution:

• Distribution: Distribution of the state
• Stationary: Long-run behavior
• Summary: after the agent runs a long time following a policy, the probability
that the agent is at any state can be described by this distribution.
Remarks:
• Stationary distribution is also called steady-state distribution, or limiting
distribution.
• It is critical to understand the value function approximation method.
• It is also important for the policy gradient method in the next lecture.

Shiyu Zhao 18 / 71
Objective function - Stationary distribution

Illustrative example:
• Given a policy shown in the figure.
• Let nπ (s) denote the number of times that s has been visited in a very long
episode generated by π.
• Then, dπ (s) can be approximated by
nπ (s)
dπ (s) ≈ P
s0 ∈S nπ (s0 )
1 2
0.8

Percentage of each state visited

0.6
1
s1
s2
0.4
s3
s4

0.2
2

0
0 200 400 600 800 1000
Step index

Figure: Long-run behavior of an -greedy policy with = 0.5.

Shiyu Zhao 19 / 71
Objective function - Stationary distribution

The converged values can be predicted because they are the entries of dπ :

dTπ = dTπ Pπ

For this example, we have Pπ as

 
0.3 0.1 0.6 0
 
 0.1 0.3 0 0.6 
Pπ =  .
 0.1 0 0.3 0.6 
 

0 0.1 0.1 0.8

It can be calculated that the left eigenvector for the eigenvalue of one is
h iT
dπ = 0.0345, 0.1084, 0.1330, 0.7241

A comprehensive introduction can be found in the book.

Shiyu Zhao 20 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 21 / 71
Optimization algorithms

While we have the objective function, the next step is to optimize it.
• To minimize the objective function J(w), we can use the gradient-descent
algorithm:

wk+1 = wk − αk ∇w J(wk )

The true gradient is

∇w J(w) = ∇w E[(vπ (S) − v̂(S, w))2 ]

= E[∇w (vπ (S) − v̂(S, w))2 ]
= 2E[(vπ (S) − v̂(S, w))(−∇w v̂(S, w))]
= −2E[(vπ (S) − v̂(S, w))∇w v̂(S, w)]

The true gradient above involves the calculation of an expectation.

Shiyu Zhao 22 / 71
Optimization algorithms

We can use the stochastic gradient to replace the true gradient:

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

where st is a sample of S. Here, 2αt is merged to αt .

• The samples are expected to satisfy the stationary distribution. In practice,
they may not satisfy.
• This algorithm is not implementable because it requires the true state value
vπ , which is the unknown to be estimated.
• We can replace vπ (st ) with an approximation so that the algorithm is
implementable.

Shiyu Zhao 23 / 71
Optimization algorithms

In particular,
• First, Monte Carlo learning with function approximation
Let gt be the discounted return starting from st in the episode. Then, gt can
be used to approximate vπ (st ). The algorithm becomes

wt+1 = wt + αt (gt − v̂(st , wt ))∇w v̂(st , wt ).

• Second, TD learning with function approximation

By the spirit of TD learning, rt+1 + γv̂(st+1 , wt ) can be viewed as an
approximation of vπ (st ). Then, the algorithm becomes

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Shiyu Zhao 24 / 71
Optimization algorithms

Pseudocode: TD learning with function approximation

Initialization: A function v̂(s, w) that is a differentiable in w. Initial parameter

w0 .
Aim: Approximate the true state values of a given policy π.

For each episode generated following the policy π, do

For each step (st , rt+1 , st+1 ), do
In the general case,
wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )
In the linear case,
wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st )

It can only estimate the state values of a given policy, but it is important to
understand other algorithms introduced later.

Shiyu Zhao 25 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 26 / 71
Selection of function approximators

An important question that has not been answered: How to select the function
v̂(s, w)?
• The first approach, which was widely used before, is to use a linear function

v̂(s, w) = φT (s)w

Here, φ(s) is the feature vector, which can be a polynomial basis, Fourier
basis, ... (see my book for details). We have seen in the motivating example
and will see again in the illustrative examples later.
• The second approach, which is widely used nowadays, is to use a neural
network as a nonlinear function approximator.
The input of the NN is the state, the output is v̂(s, w), and the network
parameter is w.

Shiyu Zhao 27 / 71
Linear function approximation

In the linear case where v̂(s, w) = φT (s)w, we have

∇w v̂(s, w) = φ(s).

Substituting the gradient into the TD algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

yields

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),

which is the algorithm of TD learning with linear function approximation. It is

called TD-Linear in our course in short.

Shiyu Zhao 28 / 71
Linear function approximation

• Disadvantages of linear function approximation:

• Difficult to select appropriate feature vectors.
• Advantages of linear function approximation:
• The theoretical properties of the TD algorithm in the linear case can be
much better understood than in the nonlinear case.
• Linear function approximation is still powerful in the sense that the tabular
representation is merely a special case of linear function approximation.

Shiyu Zhao 29 / 71
Linear function approximation

We next show that the tabular representation is a special case of linear function
approximation.
In this way, the tabular and function representations are unified!
• First, consider the special feature vector for state s:

φ(s) = es ∈ R|S| ,

where es is a vector with the sth entry as 1 and the others as 0.

• In this case,
v̂(s, w) = eTs w = w(s),

where w(s) is the sth entry of w.

Shiyu Zhao 30 / 71
Linear function approximation

Recall that the TD-Linear algorithm is

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),

• When φ(st ) = es , the above algorithm becomes

wt+1 = wt + αt (rt+1 + γwt (st+1 ) − wt (st )) est .

This is a vector equation that merely updates the st th entry of wt .

• Multiplying eTst on both sides of the equation gives

wt+1 (st ) = wt (st ) + αt (rt+1 + γwt (st+1 ) − wt (st )) ,

which is exactly the tabular TD algorithm.

Shiyu Zhao 31 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 32 / 71
Illustrative examples

Consider a 5x5 grid-world example:

1 2 3 4 5

• Given a policy: π(a|s) = 0.2 for any s, a

• Our aim is to estimate the state values of this policy (policy evaluation
problem).
• There are 25 state values in total. We next show that we can use less than
25 parameters to approximate these state values.
• Set rforbidden = rboundary = −1, rtarget = 1, and γ = 0.9.

Shiyu Zhao 33 / 71
Illustrative examples

Ground truth:
• The true state values and the 3D visualization
1 2 3 4 5 1 2 3 4 5

1 1 -3.8 -3.8 -3.6 -3.1 -3.2

2 2 -3.8 -3.8 -3.8 -3.1 -2.9

3 3 -3.6 -3.9 -3.4 -3.2 -2.9

4 4 -3.9 -3.6 -3.4 -2.9 -3.2

5 5 -4.5 -4.2 -3.4 -3.4 -3.5

Experience samples:
• 500 episodes were generated following the given policy.
• Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution.

Shiyu Zhao 34 / 71
Illustrative examples

For comparison, the results given by the tabular TD algorithm (called

TD-Table in short):
4
TD-Table: =0.005

State value error (RMSE)

0
0 100 200 300 400 500
Episode index

Shiyu Zhao 35 / 71
Illustrative examples

We next show the results by the TD-Linear algorithm.

Feature vector selection:
 
1
3
 
φ(s) = 
 x ∈R .

y

In this case, the approximated state value is

 
w1
T
 
v̂(s, w) = φ (s)w = [1, x, y]  w2
  = w1 + w2 x + w3 y.

w3

Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the
elements does not matter.

Shiyu Zhao 36 / 71
Illustrative examples

Results by the TD-Linear algorithm:

5
TD-Linear: =0.0005

State value error (RMSE)

0
0 100 200 300 400 500
Episode index

• The trend is right, but there are errors due to limited approximation ability!
• We are trying to use a plane to approximate a non-plane surface!

Shiyu Zhao 37 / 71
Illustrative examples

To enhance the approximation ability, we can use high-order feature vectors

and hence more parameters.
• For example, we can consider

φ(s) = [1, x, y, x2 , y 2 , xy]T ∈ R6 .

In this case,

v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy

which corresponds to a quadratic surface.

• We can further increase the dimension of the feature vector:

φ(s) = [1, x, y, x2 , y 2 , xy, x3 , y 3 , x2 y, xy 2 ]T ∈ R10 .

Shiyu Zhao 38 / 71
Illustrative examples

Results by the TD-Linear algorithm with higher-order feature vectors:

5
TD-Linear: =0.0005

State value error (RMSE)

0
0 100 200 300 400 500
Episode index

The above figure: φ(s) ∈ R6

3.5
TD-Linear: =0.0005

State value error (RMSE)

2.5

1.5

0.5

0
0 100 200 300 400 500
Episode index

The above figure: φ(s) ∈ R10

More examples and types of features are given in the book.

Shiyu Zhao 39 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 40 / 71
Summary of the story

Up to now, we finished the story of TD learning with value function

approximation.
1) This story started from the objective function:

J(w) = E[(vπ (S) − v̂(S, w))2 ].

The objective function suggests that it is a policy evaluation problem.

2) The gradient-descent algorithm is

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

3) The true value function, which is unknown, in the algorithm is replaced by

an approximation, leading to the algorithm:

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Although this story is helpful to understand the basic idea, it is not

mathematically rigorous.
Shiyu Zhao 41 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 42 / 71
Theoretical analysis (optional)

• The algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

does not minimize the following objective function:

J(w) = E[(vπ (S) − v̂(S, w))2 ]

Shiyu Zhao 43 / 71
Theoretical analysis (optional)

Different objective functions:

• Objective function 1: True value error

JE (w) = E[(vπ (S) − v̂(S, w))2 ] = kv̂(w) − vπ k2D

• Objective function 2: Bellman error

.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D ,
.
where Tπ (x) = rπ + γPπ x

Shiyu Zhao 44 / 71
Theoretical analysis (optional)

• Objective function 3: Projected Bellman error

JP BE (w) = kv̂(w) − M Tπ (v̂(w))k2D ,

where M is a projection matrix.

The TD-Linear algorithm minimizes the projected Bellman error.

Details can be found in the book.

Shiyu Zhao 45 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 46 / 71
Sarsa with function approximation

So far, we merely considered the problem of state value estimation. That is we

hope
v̂ ≈ vπ

To search for optimal policies, we need to estimate action values.

The Sarsa algorithm with value function approximation is

h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ).

This is the same as the algorithm we introduced previously in this lecture

except that v̂ is replaced by q̂.

Shiyu Zhao 47 / 71
Sarsa with function approximation

To search for optimal policies, we can combine policy evaluation and policy
improvement.

Pseudocode: Sarsa with function approximation

Aim: Search a policy that can lead the agent to the target from an initial state-action pair
(s0 , a0 ).

For each episode, do

If the current st is not the target state, do
Take action at following πt (st ), generate rt+1 , st+1 , and then take action at+1
following πt (st+1 )
Value update (parameter update): h
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) −
i
q̂(st , at , wt ) ∇w q̂(st , at , wt )
Policy update:
ε
πt+1 (a|st ) = 1− |A(s)| (|A(s)|−1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
ε
πt+1 (a|st ) = |A(s)| otherwise

Shiyu Zhao 48 / 71
Sarsa with function approximation

Illustrative example:
• Sarsa with linear function approximation.
• γ = 0.9, = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.

1 2 3 4 5
0
Total reward

1
-500

-1000 2

0 100 200 300 400 500

3
Episode length

500
4

5
0
0 100 200 300 400 500
Episode index

For details, please see the book.

Shiyu Zhao 49 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 50 / 71
Q-learning with function approximation

Similar to Sarsa, tabular Q-learning can also be extended to the case of value
function approximation.

The q-value update rule is

h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ),
a∈A(st+1 )

which is the same as Sarsa except that q̂(st+1 , at+1 , wt ) is replaced by

maxa∈A(st+1 ) q̂(st+1 , a, wt ).

Shiyu Zhao 51 / 71
Q-learning with function approximation

Pseudocode: Q-learning with function approximation (on-policy version)

Initialization: Initial parameter vector w0 . Initial policy π0 . Small ε > 0.

Aim: Search a good policy that can lead the agent to the target from an initial state-action
pair (s0 , a0 ).

For each episode, do

If the current st is not the target state, do
Take action at following πt (st ), and generate rt+1 , st+1
Value update (parameter update): h
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) −
a∈A(st+1 )
i
q̂(st , at , wt ) ∇w q̂(st , at , wt )
Policy update:
ε
πt+1 (a|st ) = 1− |A(s)| (|A(s)|−1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
ε
πt+1 (a|st ) = |A(s)| otherwise

Shiyu Zhao 52 / 71
Q-learning with function approximation

Illustrative example:
• Q-learning with linear function approximation.
• γ = 0.9, = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.

1 2 3 4 5
0
Total reward

-2000
2
-4000
0 100 200 300 400 500
3
Episode length

1000 4

5
0
0 100 200 300 400 500
Episode index

Shiyu Zhao 53 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 54 / 71
Deep Q-learning

Deep Q-learning or deep Q-network (DQN):

• One of the earliest and most successful algorithms that introduce deep neural
networks into RL.
• The role of neural networks is to be a nonlinear function approximator.
• Different from the following algorithm:
h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )

because of the way of training a network.

Shiyu Zhao 55 / 71
Deep Q-learning

Deep Q-learning aims to minimize the objective function/loss function:

" 2 #
0
J(w) = E R + γ max0 q̂(S , a, w) − q̂(S, A, w) ,
a∈A(S )

where (S, A, R, S 0 ) are random variables.

• This is actually the Bellman optimality error. That is because

q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a
a∈A(St+1 )

The value of R + γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should be zero in the

expectation sense

Shiyu Zhao 56 / 71
Deep Q-learning

How to minimize the objective function? Gradient-descent!

• How to calculate the gradient of the objective function? Tricky!
• That is because, in this objective function
" 2 #
0
J(w) = E R + γ max0 q̂(S , a, w) − q̂(S, A, w) ,
a∈A(S )

the parameter w not only appears in q̂(S, A, w) but also in

.
y = R + γ max0 q̂(S 0 , a, w)
a∈A(S )

• Note that
∇w y 6= γ max0 ∇w q̂(S 0 , a, w)
a∈A(S )

• To solve this problem, we can assume that w in y is fixed (at least for a
while) when we calculate the gradient.

Shiyu Zhao 57 / 71
Deep Q-learning

To do that, we can introduce two networks.

• One is a main network representing q̂(s, a, w)
• The other is a target network q̂(s, a, wT ).
The objective function in this case degenerates to
" 2 #
0
J =E R + γ max0 q̂(S , a, wT ) − q̂(S, A, w) ,
a∈A(S )

where wT is the target network parameter.

Shiyu Zhao 58 / 71
Deep Q-learning

When wT is fixed, the gradient of J can be easily obtained as

∇w J = E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ∇w q̂(S, A, w) .
a∈A(S )

• The basic idea of deep Q-learning is to use the gradient-descent algorithm to

minimize the objective function.
• However, such an optimization process evolves some important techniques
that deserve special attention.

Shiyu Zhao 59 / 71
Deep Q-learning - Two networks

First technique:
• Two networks, a main network and a target network.
Why is it used?
• The mathematical reason has been explained when we calculate the gradient.
Implementation details:
• Let w and wT denote the parameters of the main and target networks,
respectively. They are set to be the same initially.
• In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the
replay buffer (will be explained later).
• The inputs of the networks include state s and action a. The target output is
.
yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). Then, we directly minimize the TD error
or called loss function (yT − q̂(s, a, w))2 over the mini-batch {(s, a, yT )}.

Shiyu Zhao 60 / 71
Deep Q-learning - Experience replay

Another technique:
• Experience replay
Question: What is experience replay?
Answer:
• After we have collected some experience samples, we do NOT use these
samples in the order they were collected.
.
• Instead, we store them in a set, called replay buffer B = {(s, a, r, s0 )}
• Every time we train the neural network, we can draw a mini-batch of random
samples from the replay buffer.
• The draw of samples, or called experience replay, should follow a uniform
distribution (why? no prior knowledge). Can we use stationary distribution
here like before? No, since no policy is given.

Shiyu Zhao 61 / 71
Deep Q-learning - Experience replay

Question: Why is experience replay necessary in deep Q-learning? Why does

the replay must follow a uniform distribution?
Answer: The answers lie in the objective function.
" 2 #
0
J =E R + γ max0 q̂(S , a, w) − q̂(S, A, w)
a∈A(S )

• R ∼ p(R|S, A), S 0 ∼ p(S 0 |S, A): R and S are determined by the system
model.
• (S, A) ∼ d: (S, A) is an index and treated as a single random variable
• The distribution of the state-action pair (S, A) is assumed to be uniform
(can we assume other distributions such as stationary distribution?).

Shiyu Zhao 62 / 71
Deep Q-learning - Experience replay

Answer (continued):
• However, the samples are not uniformly collected because they are generated
consequently by certain policies.
• To break the correlation between consequent samples, we can use the
experience replay technique by uniformly drawing samples from the replay
buffer.
• This is the mathematical reason why experience replay is necessary and why
the experience replay must be uniform.

Shiyu Zhao 63 / 71
Deep Q-learning - Experience replay

Revisit the tabular case:

• Question: Why does not tabular Q-learning require experience replay?
• Answer: Because it does not require any distribution of S or A.
• Question: Why does Deep Q-learning involve distributions?
• Answer: Because we need to define a scalar objective function
J(w) = E[∗], where E is for all (S, A).
• The tabular case aims to solve a set of equations for all (s, a) (Bellman
optimality equation), whereas the deep case aims to optimize a scalar
objective function.
• Question: Can we use experience replay in tabular Q-learning?
• Answer: Yes, we can. And more sample efficient (why?)

Shiyu Zhao 64 / 71
Deep Q-learning

Pseudocode: Deep Q-learning (off-policy version)

Aim: Learn an optimal target network to approximate the optimal action values from the
experience samples generated by a behavior policy πb .

Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}

For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-batch
{(s, a, yT )}
Set wT = w every C iterations

Remarks:
• Why no policy update?

• Why not using the policy update equation that we derived?

• The network input and output are different from the DQN paper.

Shiyu Zhao 65 / 71
Deep Q-learning

Illustrative example:
• This example aims to learn optimal action values for every state-action pair.
• Once the optimal action values are obtained, the optimal greedy policy can
be obtained immediately.

Shiyu Zhao 66 / 71
Deep Q-learning

Setup:
• One single episode is used to train the network.
• This episode is generated by an exploratory behavior policy shown in Figure
(a).
• The episode only has 1,000 steps! The tabular Q-learning requires 100,000
steps.
• A shallow neural network with one single hidden layer is used as a nonlinear
approximator of q̂(s, a, w). The hidden layer has 100 neurons.
See details in the book.

Shiyu Zhao 67 / 71
Deep Q-learning

1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

The behavior policy. An episode of 1,000 steps. The obtained policy.

5 10

State value error (RMSE)

TD error / loss function

4 8

3 6

2 4

1 2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

The TD error converges to zero. The state estimation error converges to zero.

Shiyu Zhao 68 / 71
Deep Q-learning

What if we only use a single episode of 100 steps? Insufficient data

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

The behavior policy. An episode of 100 steps. The final policy.

7 8

State value error (RMSE)

6
TD error / loss function

5 7

4
6
3

2
5
1

0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

The TD error converges to zero. The state error does not converge to zero.
Shiyu Zhao 69 / 71
Outline

1 Motivating examples: from table to function

2 Algorithm for state value estimation

Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis (optional)

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 70 / 71
Summary

This lecture introduces the method of value function approximation.

• First, understand the basic idea.
• Second, understand the basic algorithms.

Shiyu Zhao 71 / 71

Value Function Approximation in RL
No ratings yet
Value Function Approximation in RL
39 pages
Value Function Approximation SEO Guide
No ratings yet
Value Function Approximation SEO Guide
59 pages
Function Approximation in Reinforcement Learning
No ratings yet
Function Approximation in Reinforcement Learning
35 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
Value Function Approximation Guide
No ratings yet
Value Function Approximation Guide
56 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
Function Approximation in RL Methods
No ratings yet
Function Approximation in RL Methods
58 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Convex Optimization Exam Guide
No ratings yet
Convex Optimization Exam Guide
11 pages
Module 6
No ratings yet
Module 6
47 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Linear Quadratic Control
No ratings yet
Linear Quadratic Control
7 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
RL - 05 Function Approximation and Deep Reinforcement Learning
No ratings yet
RL - 05 Function Approximation and Deep Reinforcement Learning
50 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Lecture15 ValueFunctionIteration
No ratings yet
Lecture15 ValueFunctionIteration
30 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Optimization
No ratings yet
Optimization
89 pages
Lec 18
No ratings yet
Lec 18
6 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
Policy-Based RL Overview by Shusen Wang
No ratings yet
Policy-Based RL Overview by Shusen Wang
46 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
Stochastic Models for Engineers
No ratings yet
Stochastic Models for Engineers
18 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
Linear Algebra in Deep Learning
No ratings yet
Linear Algebra in Deep Learning
25 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Mod3 Slides
No ratings yet
Mod3 Slides
199 pages
Pseudo Spectral Methods
No ratings yet
Pseudo Spectral Methods
313 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
Notes Vfi sp2024
No ratings yet
Notes Vfi sp2024
11 pages
RL Unit 4
No ratings yet
RL Unit 4
9 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Formula Sheet: Section 1 - Deterministic Dynamic Programming
No ratings yet
Formula Sheet: Section 1 - Deterministic Dynamic Programming
10 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
14 pages
L9 - Policy Gradient Methods
No ratings yet
L9 - Policy Gradient Methods
43 pages
L7 Temporal Difference Learning
No ratings yet
L7 Temporal Difference Learning
56 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Lec 4
No ratings yet
Lec 4
16 pages
Exercises 6
No ratings yet
Exercises 6
9 pages
Notes ValueFunctionIteration
No ratings yet
Notes ValueFunctionIteration
10 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
L4-Value Iteration and Policy Iteration
No ratings yet
L4-Value Iteration and Policy Iteration
38 pages
Class 10 Polynomial Assignment
No ratings yet
Class 10 Polynomial Assignment
2 pages
Soybean Extract Soy Isoflavones 40% SI20250102
No ratings yet
Soybean Extract Soy Isoflavones 40% SI20250102
1 page
A Short History of Phonology
No ratings yet
A Short History of Phonology
7 pages
CDP+ Version 10 (Guide)
No ratings yet
CDP+ Version 10 (Guide)
124 pages
Valuation Report - by Country of Origin / Commodity
No ratings yet
Valuation Report - by Country of Origin / Commodity
68 pages
Applied Sciences: Cycling Biomechanics and Its Relationship To Performance
No ratings yet
Applied Sciences: Cycling Biomechanics and Its Relationship To Performance
15 pages
Iec 60079-0 Int 2-2014-10
100% (1)
Iec 60079-0 Int 2-2014-10
7 pages
21 3 Z TRNSFM N Difrnce Eqn
No ratings yet
21 3 Z TRNSFM N Difrnce Eqn
28 pages
Mid Day Meal Scheme Upsc Notes 38
No ratings yet
Mid Day Meal Scheme Upsc Notes 38
4 pages
Dance Showdown 2024 Guidelines
No ratings yet
Dance Showdown 2024 Guidelines
9 pages
Tenses-Quick Recap
No ratings yet
Tenses-Quick Recap
9 pages
2059 w01 QP 2
No ratings yet
2059 w01 QP 2
10 pages
Faithful Worship and Scripture Insights
No ratings yet
Faithful Worship and Scripture Insights
2 pages
Surveying 1 (Final Exam) I. Multiple Choice
100% (1)
Surveying 1 (Final Exam) I. Multiple Choice
3 pages
Instant Download Understanding Nanomaterials Malkiat S. Johal PDF All Chapters
100% (1)
Instant Download Understanding Nanomaterials Malkiat S. Johal PDF All Chapters
44 pages
Story
No ratings yet
Story
13 pages
LESSON 2 Part 1
No ratings yet
LESSON 2 Part 1
7 pages
CCS in Hydrogen Production Insights
100% (1)
CCS in Hydrogen Production Insights
36 pages
Notes On Genetic Algorithm
No ratings yet
Notes On Genetic Algorithm
6 pages
Grade 9 Agriculture and Nutrition1
No ratings yet
Grade 9 Agriculture and Nutrition1
11 pages
Recognition of Zoo Rules, 1992
No ratings yet
Recognition of Zoo Rules, 1992
20 pages
ANOVA
No ratings yet
ANOVA
21 pages
Cleveland State University Mechanical Engineering Hanz Richter, PHD
No ratings yet
Cleveland State University Mechanical Engineering Hanz Richter, PHD
9 pages
Optimal Detection in AWGN Channels
No ratings yet
Optimal Detection in AWGN Channels
10 pages
Number Play Class6 25MCQs
No ratings yet
Number Play Class6 25MCQs
7 pages
Bettis Actuator Selection Procedure
No ratings yet
Bettis Actuator Selection Procedure
4 pages
1) Determine The Magnitude of F1 and Its Direction So That The Resultant Force Is Directed Vertically Upward and Has A Magnitude of 800 N
No ratings yet
1) Determine The Magnitude of F1 and Its Direction So That The Resultant Force Is Directed Vertically Upward and Has A Magnitude of 800 N
9 pages
Athena Users1 PDF
No ratings yet
Athena Users1 PDF
436 pages
Methods of Dehydration - Notes
No ratings yet
Methods of Dehydration - Notes
12 pages
The National Academies Press: Track Design Handbook For Light Rail Transit, Second Edition (2012)
No ratings yet
The National Academies Press: Track Design Handbook For Light Rail Transit, Second Edition (2012)
8 pages

L8-Value Function Approximation

Uploaded by

L8-Value Function Approximation

Uploaded by

Lecture 8: Value Function Approximation

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

1 Motivating examples: from table to function

2 Algorithm for state value estimation

3 Sarsa with function approximation

4 Q-learning with function approximation

1 Motivating examples: from table to function

2 Algorithm for state value estimation

3 Sarsa with function approximation

4 Q-learning with function approximation

• For example, state value:

• For example, action value:

• Advantage: intuitive and easy to analyze

Suppose the equation of the straight line is

w is the parameter vector; φ(s) the feature vector of s; v̂(s, w) is linear in w.

Difference between the tabular and function methods:

For example, s → φ(s) → v̂(s, w) = φT (s)w

Difference between the tabular and function methods:

Difference between the tabular and function methods:

(a) Tabular method

(b) Function approximation method

Benefit: generalization ability. When we update v̂(s) by changing w, the

1 Motivating examples: from table to function

2 Algorithm for state value estimation

3 Sarsa with function approximation

4 Q-learning with function approximation

1 Motivating examples: from table to function

2 Algorithm for state value estimation

3 Sarsa with function approximation

4 Q-learning with function approximation

Introduce in a more formal way:

The objective function is

J(w) = E[(vπ (S) − v̂(S, w))2 ].

• Our goal is to find the best w that can minimize J(w).

The first way is to use a uniform distribution.

The second way is to use the stationary distribution.

This function is a weighted squared error.

More explanation about stationary distribution:

Percentage of each state visited

Figure: Long-run behavior of an -greedy policy with  = 0.5.

For this example, we have Pπ as

0 0.1 0.1 0.8

A comprehensive introduction can be found in the book.

1 Motivating examples: from table to function

2 Algorithm for state value estimation

3 Sarsa with function approximation

4 Q-learning with function approximation

The true gradient is

∇w J(w) = ∇w E[(vπ (S) − v̂(S, w))2 ]

The true gradient above involves the calculation of an expectation.

We can use the stochastic gradient to replace the true gradient:

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

where st is a sample of S. Here, 2αt is merged to αt .

wt+1 = wt + αt (gt − v̂(st , wt ))∇w v̂(st , wt ).

• Second, TD learning with function approximation

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Pseudocode: TD learning with function approximation

Initialization: A function v̂(s, w) that is a differentiable in w. Initial parameter

For each episode generated following the policy π, do

1 Motivating examples: from table to function

2 Algorithm for state value estimation

3 Sarsa with function approximation

4 Q-learning with function approximation

In the linear case where v̂(s, w) = φT (s)w, we have

Substituting the gradient into the TD algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),

which is the algorithm of TD learning with linear function approximation. It is

• Disadvantages of linear function approximation:

where es is a vector with the sth entry as 1 and the others as 0.

where w(s) is the sth entry of w.

Recall that the TD-Linear algorithm is

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),

• When φ(st ) = es , the above algorithm becomes

wt+1 = wt + αt (rt+1 + γwt (st+1 ) − wt (st )) est .

Figure: Long-run behavior of an -greedy policy with = 0.5.