0% found this document useful (0 votes)
46 views40 pages

Lec 5 Policy Gradients

The document discusses policy gradient methods in reinforcement learning. It covers: 1) Evaluating the policy gradient using samples generated by running the policy to estimate returns and improve the policy. 2) Addressing high variance in the policy gradient by exploiting causality with baselines and analyzing variance. 3) Deriving an off-policy policy gradient using importance sampling and discussing practical considerations for implementing policy gradients.

Uploaded by

ghauch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views40 pages

Lec 5 Policy Gradients

The document discusses policy gradient methods in reinforcement learning. It covers: 1) Evaluating the policy gradient using samples generated by running the policy to estimate returns and improve the policy. 2) Addressing high variance in the policy gradient by exploiting causality with baselines and analyzing variance. 3) Deriving an off-policy policy gradient using importance sampling and discussing practical considerations for implementing policy gradients.

Uploaded by

ghauch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Policy Gradients

CS 285
Instructor: Sergey Levine
UC Berkeley
The goal of reinforcement learning
we’ll come back to partially observed later
The goal of reinforcement learning

infinite horizon case finite horizon case


Evaluating the objective
Direct policy differentiation
a convenient identity
Direct policy differentiation
Evaluating the policy gradient

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Understanding Policy Gradients
Evaluating the policy gradient
Comparison to maximum likelihood

training supervised
data learning
Example: Gaussian policies
What did we just do?

good stuff is made more likely


bad stuff is made less likely
simply formalizes the notion of “trial and error”!
Partial observability
What is wrong with the policy gradient?

high variance
Review
• Evaluating the RL objective
• Generate samples
fit a model to
• Evaluating the policy gradient estimate return

• Log-gradient trick generate


samples (i.e.
• Generate samples run the policy)

• Understanding the policy gradient improve the


• Formalization of trial-and-error policy

• Partial observability
• Works just fine
• What is wrong with policy gradient?
Reducing Variance
Reducing variance

“reward to go”
a convenient identity
Baselines

but… are we allowed to do that??

subtracting a baseline is unbiased in expectation!

average reward is not the best baseline, but it’s pretty good!
Analyzing variance

This is just expected reward, but weighted


by gradient magnitudes!
Review
• The high variance of policy gradient
fit a model to
• Exploiting causality estimate return

• Future doesn’t affect the past generate


samples (i.e.
• Baselines run the policy)

• Unbiased! improve the


policy
• Analyzing variance
• Can derive optimal baselines
Off-Policy Policy Gradients
Policy gradient is on-policy

• Neural networks change only a little bit


with each gradient step
• On-policy learning can be extremely
inefficient!
Off-policy learning & importance sampling
importance sampling
Deriving the policy gradient with IS
a convenient identity
The off-policy policy gradient

if we ignore this, we get


a policy iteration algorithm
(more on this in a later lecture)
A first-order approximation for IS (preview)

We’ll see why this is


reasonable
later in the course!
Implementing Policy Gradients
Policy gradient with automatic differentiation
Policy gradient with automatic differentiation
Pseudocode example (with discrete actions):

Maximum likelihood:
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)
Policy gradient with automatic differentiation
Pseudocode example (with discrete actions):

Policy gradient:
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)

q_values
Policy gradient in practice
• Remember that the gradient has high variance
• This isn’t the same as supervised learning!
• Gradients will be really noisy!
• Consider using much larger batches
• Tweaking learning rates is very hard
• Adaptive step size rules like ADAM can be OK-ish
• We’ll learn about policy gradient-specific learning rate
adjustment methods later!
Review
• Policy gradient is on-policy fit a model to
estimate return
• Can derive off-policy variant
• Use importance sampling generate
samples (i.e.
• Exponential scaling in T run the policy)

• Can ignore state portion improve the


(approximation) policy

• Can implement with automatic


differentiation – need to know what
to backpropagate
• Practical considerations: batch size,
learning rates, optimizers
Advanced Policy Gradients
What else is wrong with the policy gradient?

(image from Peters & Schaal 2008)

Essentially the same


problem as this:
Covariant/natural policy gradient
Covariant/natural policy gradient

(figure from Peters & Schaal 2008)

see Schulman, L., Moritz, Jordan, Abbeel (2015) Trust region policy optimization
Advanced policy gradient topics

• What more is there?


• Next time: introduce value functions and Q-functions
• Later in the class: more on natural gradient and automatic step size
adjustment
Example: policy gradient with importance sampling

• Incorporate example
demonstrations using
importance sampling
• Neural network policies

Levine, Koltun ‘13


Example: trust region policy optimization
• Natural gradient with
automatic step
adjustment
• Discrete and
continuous actions
• Code available (see
Duan et al. ‘16)

Schulman, Levine, Moritz, Jordan, Abbeel. ‘15


Policy gradients suggested readings
• Classic papers
• Williams (1992). Simple statistical gradient-following algorithms for connectionist
reinforcement learning: introduces REINFORCE algorithm
• Baxter & Bartlett (2001). Infinite-horizon policy-gradient estimation: temporally
decomposed policy gradient (not the first paper on this! see actor-critic section later)
• Peters & Schaal (2008). Reinforcement learning of motor skills with policy gradients:
very accessible overview of optimal baselines and natural gradient
• Deep reinforcement learning policy gradient papers
• Levine & Koltun (2013). Guided policy search: deep RL with importance sampled policy
gradient (unrelated to later discussion of guided policy search)
• Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region policy optimization: deep RL
with natural policy gradient and adaptive step size
• Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximal policy optimization
algorithms: deep RL with importance sampled policy gradient

You might also like