L8-Value Function Approximation
L8-Value Function Approximation
Shiyu Zhao
Outline
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
Shiyu Zhao 1 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 2 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 3 / 71
Motivating examples: from table to function
So far in this book, state and action values are represented by tables.
State s1 s2 ··· sn
True value vπ (s1 ) vπ (s2 ) ··· vπ (sn )
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )
a1 a2 a3 a4 a5
s1 qπ (s1 , a1 ) qπ (s1 , a2 ) qπ (s1 , a3 ) qπ (s1 , a4 ) qπ (s1 , a5 )
. . . . . .
. . . . . .
. . . . . .
s9 qπ (s9 , a1 ) qπ (s9 , a2 ) qπ (s9 , a3 ) qπ (s9 , a4 ) qπ (s9 , a5 )
Consider an example:
• Suppose there are finite states s1 , . . . , sn .
• Their state values are vπ (s1 ), . . . , vπ (sn ), where π is a given policy.
• Suppose n is very large and we hope to use a simple curve to approximate
these dots to save storage.
Shiyu Zhao 5 / 71
Motivating examples: from table to function
For example, we can use a simple straight line to fit the dots.
v̂(s) = as + b
v̂(s)
s1 s2 s3 s4 · · · sn s
Shiyu Zhao 6 / 71
Motivating examples: from table to function
s v̂(s, w)
w
function
Shiyu Zhao 7 / 71
Motivating examples: from table to function
Shiyu Zhao 8 / 71
Motivating examples: from table to function
v̂(s) v̂(s)
update v̂(s3 )
s1 s2 s3 s s1 s2 s3 s
v̂(s) v̂(s)
update w for s3
s1 s2 s3 s s1 s2 s3 s
The benefits are not free. It comes with a cost: the state values can not be
represented accurately. This is why this method is called approximation.
We can fit the points more precisely using high-order curves.
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1] = φT (s)w.
b
| {z }
φT (s) c
| {z }
w
In this case,
• The dimensions of w and φ(s) increase; the values may be fitted more
accurately.
• Although v̂(s, w) is nonlinear in s, it is linear in w. The nonlinearity is
contained in φ(s).
Shiyu Zhao 10 / 71
Motivating examples: from table to function
Quick summary:
• Idea: Approximate the state and action values using parameterized
functions: v̂(s, w) ≈ vπ (s) where w ∈ Rm is the parameter vector.
• Key difference: How to retrieve and change the value of v(s)
• Advantages:
1) Storage: The dimension of w may be much smaller than |S|.
2) Generalization: When a state s is visited, the parameter w is updated
so that the values of some other unvisited states can also be updated.
Shiyu Zhao 11 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 12 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 13 / 71
Objective function
Shiyu Zhao 14 / 71
Objective function
Shiyu Zhao 15 / 71
Objective function
• Drawback:
• The states may not be equally important. For example, some states may
be rarely visited by a policy. Hence, this way does not consider the real
dynamics of the Markov process under the given policy.
Shiyu Zhao 16 / 71
Objective function
Shiyu Zhao 17 / 71
Objective function – Stationary distribution
Shiyu Zhao 18 / 71
Objective function - Stationary distribution
Illustrative example:
• Given a policy shown in the figure.
• Let nπ (s) denote the number of times that s has been visited in a very long
episode generated by π.
• Then, dπ (s) can be approximated by
nπ (s)
dπ (s) ≈ P
s0 ∈S nπ (s0 )
1 2
0.8
0.2
2
0
0 200 400 600 800 1000
Step index
Shiyu Zhao 19 / 71
Objective function - Stationary distribution
The converged values can be predicted because they are the entries of dπ :
dTπ = dTπ Pπ
It can be calculated that the left eigenvector for the eigenvalue of one is
h iT
dπ = 0.0345, 0.1084, 0.1330, 0.7241
Shiyu Zhao 20 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 21 / 71
Optimization algorithms
While we have the objective function, the next step is to optimize it.
• To minimize the objective function J(w), we can use the gradient-descent
algorithm:
wk+1 = wk − αk ∇w J(wk )
Shiyu Zhao 22 / 71
Optimization algorithms
Shiyu Zhao 23 / 71
Optimization algorithms
In particular,
• First, Monte Carlo learning with function approximation
Let gt be the discounted return starting from st in the episode. Then, gt can
be used to approximate vπ (st ). The algorithm becomes
Shiyu Zhao 24 / 71
Optimization algorithms
It can only estimate the state values of a given policy, but it is important to
understand other algorithms introduced later.
Shiyu Zhao 25 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 26 / 71
Selection of function approximators
An important question that has not been answered: How to select the function
v̂(s, w)?
• The first approach, which was widely used before, is to use a linear function
v̂(s, w) = φT (s)w
Here, φ(s) is the feature vector, which can be a polynomial basis, Fourier
basis, ... (see my book for details). We have seen in the motivating example
and will see again in the illustrative examples later.
• The second approach, which is widely used nowadays, is to use a neural
network as a nonlinear function approximator.
The input of the NN is the state, the output is v̂(s, w), and the network
parameter is w.
Shiyu Zhao 27 / 71
Linear function approximation
∇w v̂(s, w) = φ(s).
yields
Shiyu Zhao 28 / 71
Linear function approximation
Shiyu Zhao 29 / 71
Linear function approximation
We next show that the tabular representation is a special case of linear function
approximation.
In this way, the tabular and function representations are unified!
• First, consider the special feature vector for state s:
φ(s) = es ∈ R|S| ,
Shiyu Zhao 30 / 71
Linear function approximation
Shiyu Zhao 31 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 32 / 71
Illustrative examples
Shiyu Zhao 33 / 71
Illustrative examples
Ground truth:
• The true state values and the 3D visualization
1 2 3 4 5 1 2 3 4 5
Experience samples:
• 500 episodes were generated following the given policy.
• Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution.
Shiyu Zhao 34 / 71
Illustrative examples
0
0 100 200 300 400 500
Episode index
Shiyu Zhao 35 / 71
Illustrative examples
Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the
elements does not matter.
Shiyu Zhao 36 / 71
Illustrative examples
5
TD-Linear: =0.0005
0
0 100 200 300 400 500
Episode index
• The trend is right, but there are errors due to limited approximation ability!
• We are trying to use a plane to approximate a non-plane surface!
Shiyu Zhao 37 / 71
Illustrative examples
In this case,
v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy
Shiyu Zhao 38 / 71
Illustrative examples
0
0 100 200 300 400 500
Episode index
2.5
1.5
0.5
0
0 100 200 300 400 500
Episode index
5 Deep Q-learning
6 Summary
Shiyu Zhao 40 / 71
Summary of the story
5 Deep Q-learning
6 Summary
Shiyu Zhao 42 / 71
Theoretical analysis (optional)
• The algorithm
Shiyu Zhao 43 / 71
Theoretical analysis (optional)
Shiyu Zhao 44 / 71
Theoretical analysis (optional)
Shiyu Zhao 45 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 46 / 71
Sarsa with function approximation
Shiyu Zhao 47 / 71
Sarsa with function approximation
To search for optimal policies, we can combine policy evaluation and policy
improvement.
Aim: Search a policy that can lead the agent to the target from an initial state-action pair
(s0 , a0 ).
Shiyu Zhao 48 / 71
Sarsa with function approximation
Illustrative example:
• Sarsa with linear function approximation.
• γ = 0.9, = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.
1 2 3 4 5
0
Total reward
1
-500
-1000 2
500
4
5
0
0 100 200 300 400 500
Episode index
Shiyu Zhao 49 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 50 / 71
Q-learning with function approximation
Similar to Sarsa, tabular Q-learning can also be extended to the case of value
function approximation.
Shiyu Zhao 51 / 71
Q-learning with function approximation
Shiyu Zhao 52 / 71
Q-learning with function approximation
Illustrative example:
• Q-learning with linear function approximation.
• γ = 0.9, = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.
1 2 3 4 5
0
Total reward
-2000
2
-4000
0 100 200 300 400 500
3
Episode length
1000 4
5
0
0 100 200 300 400 500
Episode index
Shiyu Zhao 53 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 54 / 71
Deep Q-learning
Shiyu Zhao 55 / 71
Deep Q-learning
Shiyu Zhao 56 / 71
Deep Q-learning
• Note that
∇w y 6= γ max0 ∇w q̂(S 0 , a, w)
a∈A(S )
• To solve this problem, we can assume that w in y is fixed (at least for a
while) when we calculate the gradient.
Shiyu Zhao 57 / 71
Deep Q-learning
Shiyu Zhao 58 / 71
Deep Q-learning
Shiyu Zhao 59 / 71
Deep Q-learning - Two networks
First technique:
• Two networks, a main network and a target network.
Why is it used?
• The mathematical reason has been explained when we calculate the gradient.
Implementation details:
• Let w and wT denote the parameters of the main and target networks,
respectively. They are set to be the same initially.
• In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the
replay buffer (will be explained later).
• The inputs of the networks include state s and action a. The target output is
.
yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). Then, we directly minimize the TD error
or called loss function (yT − q̂(s, a, w))2 over the mini-batch {(s, a, yT )}.
Shiyu Zhao 60 / 71
Deep Q-learning - Experience replay
Another technique:
• Experience replay
Question: What is experience replay?
Answer:
• After we have collected some experience samples, we do NOT use these
samples in the order they were collected.
.
• Instead, we store them in a set, called replay buffer B = {(s, a, r, s0 )}
• Every time we train the neural network, we can draw a mini-batch of random
samples from the replay buffer.
• The draw of samples, or called experience replay, should follow a uniform
distribution (why? no prior knowledge). Can we use stationary distribution
here like before? No, since no policy is given.
Shiyu Zhao 61 / 71
Deep Q-learning - Experience replay
• R ∼ p(R|S, A), S 0 ∼ p(S 0 |S, A): R and S are determined by the system
model.
• (S, A) ∼ d: (S, A) is an index and treated as a single random variable
• The distribution of the state-action pair (S, A) is assumed to be uniform
(can we assume other distributions such as stationary distribution?).
Shiyu Zhao 62 / 71
Deep Q-learning - Experience replay
Answer (continued):
• However, the samples are not uniformly collected because they are generated
consequently by certain policies.
• To break the correlation between consequent samples, we can use the
experience replay technique by uniformly drawing samples from the replay
buffer.
• This is the mathematical reason why experience replay is necessary and why
the experience replay must be uniform.
Shiyu Zhao 63 / 71
Deep Q-learning - Experience replay
Shiyu Zhao 64 / 71
Deep Q-learning
Aim: Learn an optimal target network to approximate the optimal action values from the
experience samples generated by a behavior policy πb .
Remarks:
• Why no policy update?
• The network input and output are different from the DQN paper.
Shiyu Zhao 65 / 71
Deep Q-learning
Illustrative example:
• This example aims to learn optimal action values for every state-action pair.
• Once the optimal action values are obtained, the optimal greedy policy can
be obtained immediately.
Shiyu Zhao 66 / 71
Deep Q-learning
Setup:
• One single episode is used to train the network.
• This episode is generated by an exploratory behavior policy shown in Figure
(a).
• The episode only has 1,000 steps! The tabular Q-learning requires 100,000
steps.
• A shallow neural network with one single hidden layer is used as a nonlinear
approximator of q̂(s, a, w). The hidden layer has 100 neurons.
See details in the book.
Shiyu Zhao 67 / 71
Deep Q-learning
1 2 3 4 5 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
5 10
4 8
3 6
2 4
1 2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
The TD error converges to zero. The state estimation error converges to zero.
Shiyu Zhao 68 / 71
Deep Q-learning
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
7 8
5 7
4
6
3
2
5
1
0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
The TD error converges to zero. The state error does not converge to zero.
Shiyu Zhao 69 / 71
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 70 / 71
Summary
Shiyu Zhao 71 / 71