0% found this document useful (0 votes)

21 views5 pages

Solutions - REINFORCE and Linear Function Approximation

The document discusses the REINFORCE algorithm and its variance reduction techniques through the introduction of a baseline, which helps minimize the variance of the policy gradient estimator without introducing bias. It also covers the importance sampling identity for estimating expectations and provides practical implementations for Tabular REINFORCE and Linear Q-learning, highlighting their performance differences in learning efficiency and variance. Empirical results indicate that while REINFORCE is unbiased, it has high variance and slower learning compared to Q-learning, which can learn faster but may introduce bias.

Uploaded by

turkmenyigit2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

Solutions - REINFORCE and Linear Function Approximation

Uploaded by

turkmenyigit2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Solutions: REINFORCE and Linear Function

Approximation
Problem 1: Baseline in REINFORCE (Variance Reduction)
The policy gradient for an episodic return R(τ ) is given by

T −1
∇θ J(θ) = \Eτ ∼πθ [ ∑ ∇θ log πθ (at ∣st ) Gt ],
t=0

T −1
where Gt = ∑t′ =t rt′ . We can introduce a baselineb(s) (any function of state) by replacing Gt with Gt −
b(st ) . Crucially, adding b(st ) does not change the expectation of the gradient, because

\E[∇θ log πθ (a∣s) b(s)] = \Es [b(s)\Ea∼π [∇θ log π(a∣s)]] = \Es [b(s) ∇θ ∑ π(a∣s)] = 0,
a

since ∑a π(a∣s) = 1 (the “score function” has zero mean). Thus the baseline introduces no bias 1 2 .

Subtracting b(s) reduces variance. To see this, write Xt = ∇θ log π(at ∣st ) and Yt = Gt . The variance of
Xt (Yt − b) is

2
\Var[Xt (Yt − b)] = \E[Xt2 (Yt − b)2 ] − (\E[Xt Yt ]) ,

since \E[Xt b] = b \E[Xt ] = 0 . Expanding gives

\E[Xt2 Yt2 ] − 2b \E[Xt2 Yt ] + b2 \E[Xt2 ] − (\E[Xt Yt ])2 .

As a function of b , this is minimized when

\E[Xt2 Yt ]
b∗ = ,
\E[Xt2 ]

i.e. b∗ = \E[Yt ] if Xt and Yt are uncorrelated. In practice this means the optimal baseline is the state-
value V π (s) ≈ \E[Gt ∣st = s] 3 2 . Subtracting b(st ) = V π (st ) (or an estimate thereof) thus yields the
minimal variance of the gradient estimator.

In summary, one obtains the REINFORCE with baseline update:

T −1
∇θ J(θ) = \Eτ [ ∑ ∇θ log πθ (at ∣st ) (Gt − b(st ))],
t=0

1
with \E[∇θ log π(at ∣st ) b(st )]
= 0 , so the estimate remains unbiased 1 2 . Choosing b(st ) =
\E[Gt ∣st ] minimizes the variance of (Gt − b)2 in expectation 3 . This derivation shows formally that
subtracting a suitable baseline reduces the variance of the policy gradient estimator without bias 1 2 .

Problem 2: Importance Sampling Identity and Simulation

LetX ∼ p1 and Y ∼ p2 be (discrete or continuous) random variables with densities p1 (x), p2 (x) and
suppose p2 (x) > 0 whenever p1 (x) > 0 . For any function ϕ(x) , we have

p1 (x) p1 (Y )
\EX∼p1 [ϕ(X)] = ∫ ϕ(x) p1 (x) dx = ∫ ϕ(x) p2 (x) dx = \EY ∼p2 [ϕ(Y ) ].
p2 (x) p2 (Y )

This is the importance sampling identity 4 . Equivalently, defining the weight w(x) = p1 (x)/p2 (x) , we
1 N
estimate \Ep1 [ϕ] by N ϕ(Yi )w(Yi ) for samples Yi ∼ p2 . In our assignment we simulate instead
∑i=1
from p1 and use weights w(x) = p2 (x)/p1 (x) to estimate \Ep2 [ϕ] . The same identity applies by symmetry
(simply swap p1 and p2 in the derivation).

By the strong law of large numbers, the importance-weighted average converges almost surely to the true
expectation as N → ∞ 5 . In practice one sees that both the simple empirical average (sampling from
p1 ) and the weighted average converge to their respective target means, but with differing variance. The
weighted estimator remains unbiased (converges to \Ep2 [ϕ] ), though it can exhibit larger variance if the
weights vary widely.

4 5 p1 = N (0, 1) , estimating \E[X] under p1 vs.\ \E[X]

Figure: Simulation of 200 samples from
under p2 = N (1, 1) . The yellow curve is the empirical average of X ∼ p1 (converging to 0), and the
orange curve is the importance-weighted average (converging to 1). (Dotted lines show the true means.)

# importance_sampling.py
import numpy as np
import matplotlib.pyplot as plt

# Define distributions p1 ~ N(0,1), p2 ~ N(1,1) and function f(x)=x

p1_pdf = lambda x: 1/np.sqrt(2*np.pi) * np.exp(-0.5*x**2)
p2_pdf = lambda x: 1/np.sqrt(2*np.pi) * np.exp(-0.5*(x-1)**2)
N = 200
X = np.random.normal(0, 1, size=N) # samples from p1
w = p2_pdf(X) / p1_pdf(X) # importance weights p2/p1
f = X # here f(X)=X

# Compute cumulative averages

emp_avg = np.cumsum(f) / np.arange(1, N+1)
imp_avg = np.cumsum(f * w) / np.arange(1, N+1)

# Plot empirical vs importance-weighted averages

plt.figure(figsize=(6,4))
plt.plot(emp_avg, label='Empirical mean from $p_1$', color='orange')

2
plt.plot(imp_avg, label='Importance-weighted for $p_2$', color='red')
plt.hlines(0, 0, N, linestyles='--', colors='blue', label='True $\E_{p_1}[X]$')
plt.hlines(1, 0, N, linestyles='--', colors='gray', label='True $\E_{p_2}[X]$')
plt.legend()
plt.xlabel("Number of samples")
plt.ylabel("Average value")
plt.title("Importance Sampling Averages ($p_1\\to p_2$)")
plt.show()

In the plot above, the orange curve shows the ordinary empirical mean of samples X ∼ p1 (converging to
0 = \Ep1 [X] ), while the red curve shows the importance-weighted estimate (converging to 1 = \Ep2 [X] ).
The figure illustrates convergence behavior: by ≈ 200 samples both estimators are close to their true
values, confirming the law of large numbers for weighted samples 5 . (Note the importance-weighted
estimator fluctuates more, reflecting higher variance due to the weights.)

Problem 3: Implementing Tabular REINFORCE and Linear Q-

learning
We fill in the notebook’s TODOs as follows (each snippet goes into the indicated method):

• Tabular REINFORCE – In the TabularREINFORCE class:

• choose_action(self, state) : sample an action from the softmax policy. Insert:

policy = self.get_policy(state)
action = np.random.choice(self.env.action_space, p=policy)
return action

• train(self, num_episodes) : run Monte Carlo policy-gradient updates. Replace the raise
NotImplementedError with:

rewards_per_episode = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
trajectory = []
rewards = []
# Generate one episode
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action))
rewards.append(reward)
state = next_state
# Compute discounted returns

3
G = 0
returns = []
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
# Update policy parameters
for (state, action), G in zip(trajectory, returns):
pi = self.get_policy(state)
grad_log = -pi
grad_log[action] += 1
self.theta[state] += self.lr * G * grad_log
rewards_per_episode.append(sum(rewards))
return rewards_per_episode

This implements the REINFORCE rule using ∇θ log π(a∣s) = (one-hot–policy).

• Linear Q-learning – In the LinearApproxQlearning class:

• state_action_to_feature(self, state, action) : return a one-hot feature of length

width*height*|A| . Insert:

x, y = state
idx = (x * self.env.height + y) * len(self.env.action_space) + action
feature = np.zeros(self.feature_dim)
feature[idx] = 1
return feature

• choose_action(self, state) : ε-greedy on Q-values. Insert:

if np.random.rand() < self.epsilon:

return np.random.choice(self.env.action_space)
q_values = [self.get_q_value(state, a) for a in self.env.action_space]
return int(np.argmax(q_values))

• train(self, num_episodes) : Q-learning with TD updates. Replace raise with:

rewards = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
total_reward = 0
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
total_reward += reward

4
# Q-learning target
if not done:
next_qs = [self.get_q_value(next_state, a) for a in
self.env.action_space]
best_next_q = max(next_qs)
else:
best_next_q = 0
current_q = self.get_q_value(state, action)
td_error = (reward + self.gamma * best_next_q) - current_q
# Gradient update for linear Q
phi = self.state_action_to_feature(state, action)
self.weights += self.lr * td_error * phi
state = next_state
self.epsilon *= self.epsilon_decay
rewards.append(total_reward)
return rewards

This updates weights by δ = r + γ maxa Q(s′ , a) − Q(s, a) with linear features.

After inserting these code blocks into the notebook and running 2000 episodes, we observe performance
differences. For example, in our trials Tabular REINFORCE learned a modest policy (the average total
reward stabilized around −13 ), reflecting the high variance of pure Monte Carlo updates. The Linear Q-
learning agent, by contrast, achieved a positive reward (average ≈ +3 ), indicating it more consistently
reached the goal. These results agree with known properties: REINFORCE (policy-gradient) is unbiased but
tends to have high variance and slow learning 6 , whereas Q-learning (even with linear approximation) can
learn faster from rewards (though it can introduce bias or instability if not tuned). We also experimented
with parameters (e.g. higher learning rates or different γ ) and grid layouts: increasing the discount γ made
the agents aim for longer returns, while a larger learning rate sped initial learning but risked instability.
Overall, our empirical curves show that (with a suitable baseline) the policy-gradient method converges but
more slowly, whereas the value-based method with function approximation often learns a better policy
sooner under these settings.

Sources: The unbiasedness of baselines and optimal baseline choice are discussed in policy-gradient
literature 1 2 3 . Importance sampling identity and convergence follow standard Monte Carlo theory
4 5 . The high variance of REINFORCE is noted in RL theory 6 .

1 3 Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients

https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

2 Policy Gradient Methods

https://rll.berkeley.edu/deeprlcoursesp17/docs/lec2.pdf

4 5 moodle.umontpellier.fr
https://moodle.umontpellier.fr/mod/resource/view.php?id=751393

6 Sutton & Barto summary chap 13 - Policy Gradient Methods | lcalem

https://lcalem.github.io/blog/2019/03/21/sutton-chap13

Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
Function Approximation in Reinforcement Learning
No ratings yet
Function Approximation in Reinforcement Learning
35 pages
RL Unit 4
No ratings yet
RL Unit 4
9 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Advanced Online Learning Algorithms
No ratings yet
Advanced Online Learning Algorithms
125 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Value Function Approximation SEO Guide
No ratings yet
Value Function Approximation SEO Guide
59 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
8 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Exam Prep Exercises034534123124
No ratings yet
Exam Prep Exercises034534123124
20 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Statistical Reinforcement Learning and Decision Making
No ratings yet
Statistical Reinforcement Learning and Decision Making
157 pages
Importance Sampling in RL
No ratings yet
Importance Sampling in RL
13 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Module 6
No ratings yet
Module 6
47 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
16 RL
No ratings yet
16 RL
51 pages
Lecture 2 - Exploration and Control - Slides
No ratings yet
Lecture 2 - Exploration and Control - Slides
51 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
RL Unit
No ratings yet
RL Unit
595 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
30 pages
08 PG Methods
No ratings yet
08 PG Methods
83 pages
Instrumental Variable Algorithms Explained
No ratings yet
Instrumental Variable Algorithms Explained
31 pages
Chapter 4 Solutions: Reinforcement Learning
No ratings yet
Chapter 4 Solutions: Reinforcement Learning
5 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
30 pages
Stabilizing Off Policy QLearning
No ratings yet
Stabilizing Off Policy QLearning
19 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
Hidden Markov Models Overview
No ratings yet
Hidden Markov Models Overview
15 pages
Week12 Summary Detail
No ratings yet
Week12 Summary Detail
10 pages
Value-Policy Integration in RL
No ratings yet
Value-Policy Integration in RL
21 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
Bayesian Inference & Gaussian Processes
No ratings yet
Bayesian Inference & Gaussian Processes
2 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
RL Concepts
No ratings yet
RL Concepts
36 pages
RL 5
No ratings yet
RL 5
26 pages
Offline Reinforcement Learning Overview
No ratings yet
Offline Reinforcement Learning Overview
26 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
EEE321 Homework 2 (Clearly Justify All Answers.) : (Due 13 October 2022)
No ratings yet
EEE321 Homework 2 (Clearly Justify All Answers.) : (Due 13 October 2022)
1 page
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
5 pages
Kinetic and Morphological Study of Palladium Electrodeposits Onto Indium Tin Oxide (ITO) Substrates
No ratings yet
Kinetic and Morphological Study of Palladium Electrodeposits Onto Indium Tin Oxide (ITO) Substrates
11 pages
Pinus Radiata
No ratings yet
Pinus Radiata
22 pages
NEET 2026 AdvancedLevel Full Question Paper
No ratings yet
NEET 2026 AdvancedLevel Full Question Paper
36 pages
FPO Success: Nonoi Paria Bamboo Story
No ratings yet
FPO Success: Nonoi Paria Bamboo Story
8 pages
Counterfeit Detection Guide
No ratings yet
Counterfeit Detection Guide
12 pages
Operación Solenoides AX4N-AX4S PDF
No ratings yet
Operación Solenoides AX4N-AX4S PDF
7 pages
Importance of Hand Washing Explained
No ratings yet
Importance of Hand Washing Explained
1 page
List Kebutuhan Ipsrs 2021
No ratings yet
List Kebutuhan Ipsrs 2021
5 pages
What Scientific Concept Would Improve Everybody's Cognitive Toolkit?
No ratings yet
What Scientific Concept Would Improve Everybody's Cognitive Toolkit?
12 pages
UAE's First Grid Connected Solar Photovoltaic Power Plant
No ratings yet
UAE's First Grid Connected Solar Photovoltaic Power Plant
2 pages
Cont Ed MS VR-Forces v4
No ratings yet
Cont Ed MS VR-Forces v4
35 pages
Bbi 01 RBS v5
No ratings yet
Bbi 01 RBS v5
2 pages
Human Anatomy,: First Edition Mckinley & O'Loughlin
No ratings yet
Human Anatomy,: First Edition Mckinley & O'Loughlin
40 pages
Project Proposal For "Sustainable Livelihood Linked With Biodiversity Conservation For Tribal Women" in Mayurbhanj District of Odisha."
100% (6)
Project Proposal For "Sustainable Livelihood Linked With Biodiversity Conservation For Tribal Women" in Mayurbhanj District of Odisha."
15 pages
Tutorial For Using HyperMesh
No ratings yet
Tutorial For Using HyperMesh
12 pages
DC IR Drop Analysis 1691057965
No ratings yet
DC IR Drop Analysis 1691057965
10 pages
A Study On Rooftop Tower Construction For Selection of An Appropriate Location To Minimize Additional Stress On Host Structure
No ratings yet
A Study On Rooftop Tower Construction For Selection of An Appropriate Location To Minimize Additional Stress On Host Structure
14 pages
Class 12 Chapter 8 English Poetry Solution Bihar Board
No ratings yet
Class 12 Chapter 8 English Poetry Solution Bihar Board
7 pages
End of Year Test A: Listening
No ratings yet
End of Year Test A: Listening
10 pages
Infineum - Used Oil Elemental Analysistable 1599925859367
No ratings yet
Infineum - Used Oil Elemental Analysistable 1599925859367
2 pages
Newlab 1300 Cloud and Pour Point: Automatic Analysers: Newlab Range
No ratings yet
Newlab 1300 Cloud and Pour Point: Automatic Analysers: Newlab Range
2 pages
Gaurang Resume 143
No ratings yet
Gaurang Resume 143
3 pages
Quantitative Techniques For Business II
100% (1)
Quantitative Techniques For Business II
4 pages
United States Design Patent (10) Patent No.: Nagashima
No ratings yet
United States Design Patent (10) Patent No.: Nagashima
8 pages
Strategic Matrix Upd
No ratings yet
Strategic Matrix Upd
12 pages
Sentence Combining Techniques
No ratings yet
Sentence Combining Techniques
4 pages
"Lions, Harry Potter, and Bubble Gum"
No ratings yet
"Lions, Harry Potter, and Bubble Gum"
44 pages
Industrial Drives & PLC: Lab Manual For Academic Session Spring 2020
No ratings yet
Industrial Drives & PLC: Lab Manual For Academic Session Spring 2020
18 pages
Namibia IP Rights Bulletin
No ratings yet
Namibia IP Rights Bulletin
124 pages
Philosophy On The Nature of Man
No ratings yet
Philosophy On The Nature of Man
2 pages
Selection of Pipe Supports
No ratings yet
Selection of Pipe Supports
1 page
Sample PDF of STD 11th Perfect Physics Notes Book Science Maharashtra Board PDF
58% (12)
Sample PDF of STD 11th Perfect Physics Notes Book Science Maharashtra Board PDF
36 pages

Solutions - REINFORCE and Linear Function Approximation

Uploaded by

Solutions - REINFORCE and Linear Function Approximation

Uploaded by

Solutions: REINFORCE and Linear Function

since \E[Xt b] = b \E[Xt ] = 0 . Expanding gives

\E[Xt2 Yt2 ] − 2b \E[Xt2 Yt ] + b2 \E[Xt2 ] − (\E[Xt Yt ])2 .

As a function of b , this is minimized when

In summary, one obtains the REINFORCE with baseline update:

Problem 2: Importance Sampling Identity and Simulation

4 5 p1 = N (0, 1) , estimating \E[X] under p1 vs.\ \E[X]

# Define distributions p1 ~ N(0,1), p2 ~ N(1,1) and function f(x)=x

# Compute cumulative averages

# Plot empirical vs importance-weighted averages

Problem 3: Implementing Tabular REINFORCE and Linear Q-

• Tabular REINFORCE – In the TabularREINFORCE class:

This implements the REINFORCE rule using ∇θ log π(a∣s) = (one-hot–policy).

• Linear Q-learning – In the LinearApproxQlearning class:

• state_action_to_feature(self, state, action) : return a one-hot feature of length

• choose_action(self, state) : ε-greedy on Q-values. Insert:

if np.random.rand() < self.epsilon:

• train(self, num_episodes) : Q-learning with TD updates. Replace raise with:

This updates weights by δ = r + γ maxa Q(s′ , a) − Q(s, a) with linear features.

1 3 Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients

2 Policy Gradient Methods

6 Sutton & Barto summary chap 13 - Policy Gradient Methods | lcalem

You might also like