0% found this document useful (0 votes)

46 views40 pages

Lec 5 Policy Gradients

The document discusses policy gradient methods in reinforcement learning. It covers: 1) Evaluating the policy gradient using samples generated by running the policy to estimate returns and improve the policy. 2) Addressing high variance in the policy gradient by exploiting causality with baselines and analyzing variance. 3) Deriving an off-policy policy gradient using importance sampling and discussing practical considerations for implementing policy gradients.

Uploaded by

ghauch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views40 pages

Lec 5 Policy Gradients

Uploaded by

ghauch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Policy Gradients

CS 285
Instructor: Sergey Levine
UC Berkeley
The goal of reinforcement learning
we’ll come back to partially observed later
The goal of reinforcement learning

infinite horizon case finite horizon case

Evaluating the objective
Direct policy differentiation
a convenient identity
Direct policy differentiation
Evaluating the policy gradient

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Understanding Policy Gradients
Evaluating the policy gradient
Comparison to maximum likelihood

training supervised
data learning
Example: Gaussian policies
What did we just do?

good stuff is made more likely

bad stuff is made less likely
simply formalizes the notion of “trial and error”!
Partial observability
What is wrong with the policy gradient?

high variance
Review
• Evaluating the RL objective
• Generate samples
fit a model to
• Evaluating the policy gradient estimate return

• Log-gradient trick generate

samples (i.e.
• Generate samples run the policy)

• Understanding the policy gradient improve the

• Formalization of trial-and-error policy

• Partial observability
• Works just fine
• What is wrong with policy gradient?
Reducing Variance
Reducing variance

“reward to go”
a convenient identity
Baselines

but… are we allowed to do that??

subtracting a baseline is unbiased in expectation!

average reward is not the best baseline, but it’s pretty good!
Analyzing variance

This is just expected reward, but weighted

by gradient magnitudes!
Review
• The high variance of policy gradient
fit a model to
• Exploiting causality estimate return

• Future doesn’t affect the past generate

samples (i.e.
• Baselines run the policy)

• Unbiased! improve the

policy
• Analyzing variance
• Can derive optimal baselines
Off-Policy Policy Gradients
Policy gradient is on-policy

• Neural networks change only a little bit

with each gradient step
• On-policy learning can be extremely
inefficient!
Off-policy learning & importance sampling
importance sampling
Deriving the policy gradient with IS
a convenient identity
The off-policy policy gradient

if we ignore this, we get

a policy iteration algorithm
(more on this in a later lecture)
A first-order approximation for IS (preview)

We’ll see why this is

reasonable
later in the course!
Implementing Policy Gradients
Policy gradient with automatic differentiation
Policy gradient with automatic differentiation
Pseudocode example (with discrete actions):

Maximum likelihood:
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)
Policy gradient with automatic differentiation
Pseudocode example (with discrete actions):

Policy gradient:
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)

q_values
Policy gradient in practice
• Remember that the gradient has high variance
• This isn’t the same as supervised learning!
• Gradients will be really noisy!
• Consider using much larger batches
• Tweaking learning rates is very hard
• Adaptive step size rules like ADAM can be OK-ish
• We’ll learn about policy gradient-specific learning rate
adjustment methods later!
Review
• Policy gradient is on-policy fit a model to
estimate return
• Can derive off-policy variant
• Use importance sampling generate
samples (i.e.
• Exponential scaling in T run the policy)

• Can ignore state portion improve the

(approximation) policy

• Can implement with automatic

differentiation – need to know what
to backpropagate
• Practical considerations: batch size,
learning rates, optimizers
Advanced Policy Gradients
What else is wrong with the policy gradient?

(image from Peters & Schaal 2008)

Essentially the same

problem as this:
Covariant/natural policy gradient
Covariant/natural policy gradient

(figure from Peters & Schaal 2008)

see Schulman, L., Moritz, Jordan, Abbeel (2015) Trust region policy optimization
Advanced policy gradient topics

• What more is there?

• Next time: introduce value functions and Q-functions
• Later in the class: more on natural gradient and automatic step size
adjustment
Example: policy gradient with importance sampling

• Incorporate example
demonstrations using
importance sampling
• Neural network policies

Levine, Koltun ‘13

Example: trust region policy optimization
• Natural gradient with
automatic step
adjustment
• Discrete and
continuous actions
• Code available (see
Duan et al. ‘16)

Schulman, Levine, Moritz, Jordan, Abbeel. ‘15

Policy gradients suggested readings
• Classic papers
• Williams (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning: introduces REINFORCE algorithm
• Baxter & Bartlett (2001). Infinite-horizon policy-gradient estimation: temporally
decomposed policy gradient (not the first paper on this! see actor-critic section later)
• Peters & Schaal (2008). Reinforcement learning of motor skills with policy gradients:
very accessible overview of optimal baselines and natural gradient
• Deep reinforcement learning policy gradient papers
• Levine & Koltun (2013). Guided policy search: deep RL with importance sampled policy
gradient (unrelated to later discussion of guided policy search)
• Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region policy optimization: deep RL
with natural policy gradient and adaptive step size
• Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximal policy optimization
algorithms: deep RL with importance sampled policy gradient

9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
cs224r L03 MDP PG
No ratings yet
cs224r L03 MDP PG
30 pages
cs224r L04 Actor Critic
No ratings yet
cs224r L04 Actor Critic
89 pages
Intro to Policy Optimization
No ratings yet
Intro to Policy Optimization
10 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
RL 3
No ratings yet
RL 3
31 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Silver 14
No ratings yet
Silver 14
9 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
ml4r 2025 06
No ratings yet
ml4r 2025 06
16 pages
Policy-Based RL Overview by Shusen Wang
No ratings yet
Policy-Based RL Overview by Shusen Wang
46 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
RL 5
No ratings yet
RL 5
26 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Policy Optimization in Reinforcement Learning
No ratings yet
Policy Optimization in Reinforcement Learning
62 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
3.2 The Objective For On-Policy Prediction
No ratings yet
3.2 The Objective For On-Policy Prediction
23 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Value-Policy Integration in RL
No ratings yet
Value-Policy Integration in RL
21 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
3.10 Policy Gradient For Continuing Tasks
No ratings yet
3.10 Policy Gradient For Continuing Tasks
13 pages
Natural Actor-Critic Reinforcement Learning
No ratings yet
Natural Actor-Critic Reinforcement Learning
12 pages
Lecture 7: Policy Gradient: David Silver
No ratings yet
Lecture 7: Policy Gradient: David Silver
41 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
PowerPoint Presentation
No ratings yet
PowerPoint Presentation
35 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
Lecture 6 Structuring of Policies-Part-1
No ratings yet
Lecture 6 Structuring of Policies-Part-1
36 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Policy Approximation Document
No ratings yet
Policy Approximation Document
2 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Handout 5
No ratings yet
Handout 5
72 pages
13 RL 4
No ratings yet
13 RL 4
48 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Home Work of Reinforcement Learning Policy Based Theory
No ratings yet
Home Work of Reinforcement Learning Policy Based Theory
10 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
Threshold Ing
No ratings yet
Threshold Ing
17 pages
Oligomenorrhea Diagram
No ratings yet
Oligomenorrhea Diagram
2 pages
LOTO Training
No ratings yet
LOTO Training
9 pages
Icepsa Eurotrans SHPD 10w40
No ratings yet
Icepsa Eurotrans SHPD 10w40
1 page
5 Effect of Repetitive Transcranial Magnetic Stimulation in Decreasing Muscle Tone of Spastic Hemipleg
No ratings yet
5 Effect of Repetitive Transcranial Magnetic Stimulation in Decreasing Muscle Tone of Spastic Hemipleg
6 pages
Glyoxal For Use in Oilfield Applications
0% (1)
Glyoxal For Use in Oilfield Applications
23 pages
This Marvelous Bean-Adopting Coffee Into Old Regime French Culture and Diet
No ratings yet
This Marvelous Bean-Adopting Coffee Into Old Regime French Culture and Diet
32 pages
Role of Nurse in Cardiac or Respiratory Arrest
No ratings yet
Role of Nurse in Cardiac or Respiratory Arrest
8 pages
Using MPU 9250 IMU in Vertical Orientation
No ratings yet
Using MPU 9250 IMU in Vertical Orientation
3 pages
LZX RT424524 Datasheet en
No ratings yet
LZX RT424524 Datasheet en
4 pages
Spectrum Wallboard Installation Manual V2
No ratings yet
Spectrum Wallboard Installation Manual V2
13 pages
Year 9 Science Space and Geology Revision
No ratings yet
Year 9 Science Space and Geology Revision
5 pages
Fishing Boat Designs 2. V-Bottom Boats of Planked and Plywood Construction
100% (2)
Fishing Boat Designs 2. V-Bottom Boats of Planked and Plywood Construction
72 pages
Mels Constuction Limitada: Commercial Management Mechanical Completion Certificate
No ratings yet
Mels Constuction Limitada: Commercial Management Mechanical Completion Certificate
1 page
Adjectives: 7 Grade Pgs. 336-347
No ratings yet
Adjectives: 7 Grade Pgs. 336-347
26 pages
Auditory, Visual, and Physical Distractions in The Workplace
100% (4)
Auditory, Visual, and Physical Distractions in The Workplace
11 pages
Bajaj Two Wheelers Company Overview
No ratings yet
Bajaj Two Wheelers Company Overview
22 pages
Advisor Circulary Painting Marking Ang Lighting of Vehicles On An Airport
No ratings yet
Advisor Circulary Painting Marking Ang Lighting of Vehicles On An Airport
14 pages
LEGO Set Retirement Dates
No ratings yet
LEGO Set Retirement Dates
21 pages
CH 8-The-Prisoner-of-Zenda PDF
No ratings yet
CH 8-The-Prisoner-of-Zenda PDF
8 pages
207 13 25 31 Primaax Series PDF
No ratings yet
207 13 25 31 Primaax Series PDF
8 pages
Gospel of Eve
No ratings yet
Gospel of Eve
15 pages
Focus Tig 200 Ac/Dc PFC
No ratings yet
Focus Tig 200 Ac/Dc PFC
2 pages
BS Iec 61935-2-25-2015
No ratings yet
BS Iec 61935-2-25-2015
14 pages
Premium 1.25" Color Filter Set
No ratings yet
Premium 1.25" Color Filter Set
1 page
Forensic Chemistry Part 3
No ratings yet
Forensic Chemistry Part 3
3 pages
Pi̇pe Works
No ratings yet
Pi̇pe Works
186 pages
Ablative Armor or Bio-Armor - v5
No ratings yet
Ablative Armor or Bio-Armor - v5
2 pages
SP12 2y3 PC PP 993 007
100% (1)
SP12 2y3 PC PP 993 007
21 pages
Vlsi Unit 4 Notes
No ratings yet
Vlsi Unit 4 Notes
20 pages

Lec 5 Policy Gradients

Uploaded by

Lec 5 Policy Gradients

Uploaded by

Policy Gradients

infinite horizon case finite horizon case

good stuff is made more likely

• Log-gradient trick generate

• Understanding the policy gradient improve the

but… are we allowed to do that??

subtracting a baseline is unbiased in expectation!

This is just expected reward, but weighted

• Future doesn’t affect the past generate

• Unbiased! improve the

• Neural networks change only a little bit

if we ignore this, we get

We’ll see why this is

• Can ignore state portion improve the

• Can implement with automatic

(image from Peters & Schaal 2008)

Essentially the same

(figure from Peters & Schaal 2008)

• What more is there?

Levine, Koltun ‘13

Schulman, Levine, Moritz, Jordan, Abbeel. ‘15

You might also like