0% found this document useful (0 votes)

67 views28 pages

Policy Gradient Methods Guide

1) Policy gradient methods aim to maximize the expected reward of a policy by estimating the gradient of the policy parameters. 2) The policy gradient can be estimated using the score function gradient estimator, which samples trajectories and uses the rewards as importance weights that adjust the policy parameters in the direction of the rewards. 3) For episodic tasks, the policy gradient decomposes into terms corresponding to each timestep, allowing the gradient to incorporate temporal structure by crediting earlier timesteps with later rewards.

Uploaded by

jimmyjoe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views28 pages

Policy Gradient Methods Guide

Uploaded by

jimmyjoe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Policy Gradient Methods

February 13, 2017

Policy Optimization Problems

maximize Eπ [expression]
π

PT −1
I Fixed-horizon episodic: t=0 rt
T −1
Average-cost: limT →∞ T1 t=0
P
I rt
P∞ t
I Infinite-horizon discounted: t=0 γ rt
PTterminal −1
I Variable-length undiscounted: t=0 rt
P∞
I Infinite-horizon undiscounted: t=0 rt
Episodic Setting

a0 a1 aT-1

s0 s1 s2 sT

μ0 r0 r1 rT-1

Environment
P

Objective:

maximize η(π), where

η(π) = E [r0 + r1 + · · · + rT −1 | π]
Parameterized Policies

I A family of policies indexed by parameter vector θ ∈ Rd

I Deterministic: a = π(s, θ)
I Stochastic: π(a | s, θ)
I Analogous to classification or regression with input s, output a.
I Discrete action space: network outputs vector of probabilities
I Continuous action space: network outputs mean and diagonal covariance of
Gaussian
Policy Gradient Methods: Overview

Problem:

maximize E [R | πθ ]

Intuitions: collect a bunch of trajectories, and ...

1. Make the good trajectories more probable1
2. Make the good actions more probable
3. Push the actions towards good actions (DPG2 , SVG3 )

1
R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000.
2
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. “Deterministic Policy Gradient Algorithms”. ICML. 2014.
3
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXiv
preprint arXiv:1510.09142 (2015).
Score Function Gradient Estimator
I Consider an expectation Ex∼p(x | θ) [f (x)]. Want to compute gradient wrt θ
Z
∇θ Ex [f (x)] = ∇θ dx p(x | θ)f (x)
Z
= dx ∇θ p(x | θ)f (x)
∇θ p(x | θ)
Z
= dx p(x | θ) f (x)
p(x | θ)
Z
= dx p(x | θ)∇θ log p(x | θ)f (x)

= Ex [f (x)∇θ log p(x | θ)].

I Last expression gives us an unbiased gradient estimator. Just sample

4
T. Jie and P. Abbeel. “On a connection between importance sampling and the likelihood ratio policy gradient”. Advances in Neural Information
Processing Systems. 2010, pp. 1000–1008.
Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

I Let’s say that f (x) measures how good the sample x is.
I Moving in the direction ĝi pushes up the logprob of the
sample, in proportion to how good it is
I Valid even if f (x) is discontinuous, and unknown, or sample
space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

Score Function Gradient Estimator for Policies
I Now random variable x is a whole trajectory τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , sT −1 , aT −1 , rT −1 , sT )

∇θ Eτ [R(τ )] = Eτ [∇θ log p(τ | θ)R(τ )]

I Just need to write out p(τ | θ):

I Interpretation: using good trajectories (high R) as supervised examples in classification / regression

Policy Gradient: Use Temporal Structure
I Previous slide:
−1 −1
" T
! T
!#
X X
∇θ Eτ [R] = Eτ rt ∇θ log π(at | st , θ)
t=0 t=0

I We can repeat the same argument to derive the gradient estimator for a single reward
term rt 0 .
 
t0
X
∇θ E [rt 0 ] = E rt 0 ∇θ log π(at | st , θ)
t=0

I Sum this formula over t, we obtain

 0

T
X −1 t
X
∇θ E [R] = E  rt 0 ∇θ log π(at | st , θ)
t=0 t=0
"T −1 T −1
#
X X
=E ∇θ log π(at | st , θ) rt 0
t=0 t 0 =t
Policy Gradient: Introduce Baseline

I Further reduce variance by introducing a baseline b(s)

"T −1 T −1
!#
X X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ) rt 0 − b(st )
t=0 t 0 =t

I For any choice of b, gradient estimator is unbiased.

I Near optimal choice is expected return,
b(st ) ≈ E [rt + rt+1 + rt+2 + · · · + rT −1 ]
I Interpretation:
P −1 increase logprob of action at proportionally to how much
returns Tt 0 =t rt 0 are better than expected
Baseline—Derivation

Eτ [∇θ log π(at | st , θ)b(st )]

h i
= Es0:t ,a0:(t−1) Es(t+1):T ,at:(T −1) [∇θ log π(at | st , θ)b(st )] (break up expectation)
h i
= Es0:t ,a0:(t−1) b(st )Es(t+1):T ,at:(T −1) [∇θ log π(at | st , θ)] (pull baseline term out)
= Es0:t ,a0:(t−1) [b(st )Eat [∇θ log π(at | st , θ)]] (remove irrelevant vars.)
= Es0:t ,a0:(t−1) [b(st ) · 0]

Last equality because 0 = ∇θ Eat ∼π(· | st ) [1] = Eat ∼π(· | st ) [∇θ log πθ (at | st )]
Discounts for Variance Reduction

I Introduce discount factor γ, which ignores delayed effects between actions

and rewards
"T −1 T −1
!#
0
X X
∇θ Eτ [R] ≈ Eτ ∇θ log π(at | st , θ) γ t −t rt 0 − b(st )
t=0 t 0 =t

Now, we want b(st ) ≈ E rt + γrt+1 + γ 2 rt+2 + · · · + γ T −1−t rT −1

I
“Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline b

for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current policy
At each timestep P in each trajectory, compute
−1 t 0 −t
the return Rt = tT0 =t γ rt 0 , and
the advantage estimate Ât = Rt − b(st ).
Re-fit the baseline, by minimizing kb(st ) − Rt k2 ,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ĝ ,
which is a sum of terms ∇θ log π(at | st , θ)Ât .
(Plug ĝ into SGD or ADAM)
end for
Practical Implementation with Autodiff
P
I Usual formula t ∇θ log π(at | st ; θ)Ât is inefficient—want to batch data
I Define “surrogate” function using data from currecnt batch
X
L(θ) = log π(at | st ; θ)Ât
t

I Then policy gradient estimator ĝ = ∇θ L(θ)

I Can also include value function fit error
X
2
L(θ) = log π(at | st ; θ)Ât − kV (st ) − R̂t k
t
Value Functions

Q π,γ (s, a) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s, a0 = a

Called Q-function or state-action-value function

V π,γ (s) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s

= Ea∼π [Q π,γ (s, a)]

Called state-value function

Aπ,γ (s, a) = Q π,γ (s, a) − V π,γ (s)

Called advantage function
Policy Gradient Formulas with Value Functions
I Recall:
"T −1 T −1
!#
X X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ) rt 0 − b(st )
t=0 t 0 =t
"T −1 T −1
!#
X X 0
−t
≈ Eτ ∇θ log π(at | st , θ) γt rt 0 − b(st )
t=0 t 0 =t

I Using value functions

"T −1 #
X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ)Q π (st , at )
t=0
"T −1 #
X
π
= Eτ ∇θ log π(at | st , θ)A (st , at )
t=0
"T −1 #
X
≈ Eτ ∇θ log π(at | st , θ)Aπ,γ (st , at )
t=0

I Can plug in “advantage estimator” Â for Aπ,γ

I Advantage estimators have the form Return − V (s)
Value Functions in the Future

I Baseline accounts for and removes the effect of past actions

I Can also use the value function to estimate future rewards
(1)
R̂t = rt + γV (st+1 ) cut off at one timestep
(2)
R̂t = rt + γrt+1 + γ 2 V (st+2 ) cut off at two timesteps
...
(∞)
R̂t = rt + γrt+1 + γ 2 rt+2 + . . . ∞ timesteps (no V )
Value Functions in the Future

I Subtracting out baselines, we get advantage estimators

(1)
Ât = rt + γV (st+1 )−V (st )
(2)
Ât = rt + rt+1 + γ 2 V (st+2 )−V (st )
...
(∞)
Ât = rt + γrt+1 + γ 2 rt+2 + . . . −V (st )

(1) (∞)
I Ât has low variance but high bias, Ât has high variance but low bias.
I Using intermediate k (say, 20) gives an intermediate amount of bias and variance
Discounts: Connection to MPC

I MPC:

maximize Q ∗,T (s, a) ≈ maximize Q ∗,γ (s, a)

a a

I Discounted policy gradient

Ea∼π [Q π,γ (s, a)∇θ log π(a | s; θ)] = 0 when a ∈ arg max Q π,γ (s, a)
Application: Robot Locomotion
Finite-Horizon Methods: Advantage Actor-Critic
I A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: “async” is only
for speed, doesn’t improve performance)
I Pseudocode
for iteration=1, 2, . . . do
Agent acts for T timesteps (e.g., T = 20),
For each timestep t, compute
R̂t = rt + γrt+1 + · · · + γ T −t+1 rT −1 + γ T −t V (st )
Ât = R̂t − V (st )

R̂t is target value function, in regression problem

Ât is estimated advantage function h i
PT
Compute loss gradient g = ∇θ t=1 − log πθ (at | st )Ât + c(V (s) − R̂t )2
g is plugged into a stochastic gradient descent variant, e.g., Adam.
end for
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016)
A3C Video
A3C Results

16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders
DQN DQN DQN DQN
14000 1-step Q 500 1-step Q 20 10000 1-step Q 1400 1-step Q
12000 1-step SARSA 1-step SARSA 1-step SARSA 1200 1-step SARSA
n-step Q n-step Q n-step Q n-step Q
10000 A3C 400 A3C 10 8000 A3C 1000 A3C
Score

Score

Score
8000 300 0 6000 800
6000 200 10 DQN 4000 600
4000 1-step Q 400
100 20 1-step SARSA 2000
2000 n-step Q 200
A3C
0 0 30 0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
Further Reading

I A nice intuitive explanation of policy gradients:

[Link]
I R. J. Williams. “Simple statistical gradient-following algorithms for
connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient
methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000
I My thesis has a decent self-contained introduction to policy gradient
methods: [Link]
I A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al.
“Asynchronous methods for deep reinforcement learning”. (2016)

Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
RL 5
No ratings yet
RL 5
26 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Policy-Based RL Overview by Shusen Wang
No ratings yet
Policy-Based RL Overview by Shusen Wang
46 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Intro to Policy Optimization
No ratings yet
Intro to Policy Optimization
10 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
Policy Optimization in Reinforcement Learning
No ratings yet
Policy Optimization in Reinforcement Learning
62 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Paper RL
No ratings yet
Paper RL
61 pages
Conservative Policy Iteration Guide
No ratings yet
Conservative Policy Iteration Guide
75 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Unifying Parametric Policy Search Methods
No ratings yet
Unifying Parametric Policy Search Methods
9 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
08 PG Methods
No ratings yet
08 PG Methods
83 pages
L9 - Policy Gradient Methods
No ratings yet
L9 - Policy Gradient Methods
43 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Value Function Approximation SEO Guide
No ratings yet
Value Function Approximation SEO Guide
59 pages
Chapter 11
No ratings yet
Chapter 11
17 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Doubly Robust Off-Policy RL
No ratings yet
Doubly Robust Off-Policy RL
14 pages
Module 6
No ratings yet
Module 6
47 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Benefits of Barefoot Running Techniques
No ratings yet
Benefits of Barefoot Running Techniques
21 pages
Mental Magnetism Course by Harry Lorayne PDF
100% (10)
Mental Magnetism Course by Harry Lorayne PDF
424 pages
User Manual: Question? Contact Philips
No ratings yet
User Manual: Question? Contact Philips
16 pages
Operating Instructions: DMR-EH55
No ratings yet
Operating Instructions: DMR-EH55
84 pages
Rear Drive System: Service Instructions
No ratings yet
Rear Drive System: Service Instructions
6 pages
A Comparative Study of Categorical Variable Encoding Techniques
No ratings yet
A Comparative Study of Categorical Variable Encoding Techniques
4 pages
60's Telecaster Parts Guide
No ratings yet
60's Telecaster Parts Guide
4 pages
Classic Series '60s Telecaster® Service Diagrams PDF
No ratings yet
Classic Series '60s Telecaster® Service Diagrams PDF
4 pages
DotNetNuke 7.0.6 SuperUser Manual
No ratings yet
DotNetNuke 7.0.6 SuperUser Manual
1,413 pages
Markdown Cheatsheet
No ratings yet
Markdown Cheatsheet
1 page
Muddy Waters Discography Guide
No ratings yet
Muddy Waters Discography Guide
9 pages
10 Must Know Jazz Guitar Licks Ebook
100% (4)
10 Must Know Jazz Guitar Licks Ebook
8 pages
Integrating ISO 26000 in Management Systems
0% (1)
Integrating ISO 26000 in Management Systems
1 page
Lanci I Lancanici
No ratings yet
Lanci I Lancanici
160 pages
RMG Skills Development Framework Guide
No ratings yet
RMG Skills Development Framework Guide
114 pages
Nguyen Ha Vi Khanh - Assessment 3
No ratings yet
Nguyen Ha Vi Khanh - Assessment 3
13 pages
SAP Engineering Change Management
No ratings yet
SAP Engineering Change Management
87 pages
Library Science Short Questions
No ratings yet
Library Science Short Questions
13 pages
Sandvik CM 1208i Mobile Jaw Brochure
No ratings yet
Sandvik CM 1208i Mobile Jaw Brochure
2 pages
Fuzzy Topsis Thesis
100% (2)
Fuzzy Topsis Thesis
6 pages
Entrepreneurship Skills Course
100% (2)
Entrepreneurship Skills Course
83 pages
FS2 Learing Ep. 1 The Teacher We Remember
No ratings yet
FS2 Learing Ep. 1 The Teacher We Remember
39 pages
MS5002
100% (8)
MS5002
8 pages
SIP Report
No ratings yet
SIP Report
27 pages
5.3 ARM301 RRL - Lesson3 SummarizingLiteratureSources
No ratings yet
5.3 ARM301 RRL - Lesson3 SummarizingLiteratureSources
36 pages
Leadership, Compensation & Job Satisfaction
No ratings yet
Leadership, Compensation & Job Satisfaction
8 pages
Short Circuit Ratings for 3VA Breakers
No ratings yet
Short Circuit Ratings for 3VA Breakers
1 page
w1 DLL Orgman August 22 24 To 26
100% (4)
w1 DLL Orgman August 22 24 To 26
3 pages
Summer Vacation Homework 2022 Class 10
No ratings yet
Summer Vacation Homework 2022 Class 10
3 pages
SR5 TOOL Adept Powers List + Description, Compiled
No ratings yet
SR5 TOOL Adept Powers List + Description, Compiled
5 pages
Verka Milk Plant Report
71% (7)
Verka Milk Plant Report
42 pages
Schneider Transformers 800kva Wiring Diagram
No ratings yet
Schneider Transformers 800kva Wiring Diagram
8 pages
Variational Inference For SDE
No ratings yet
Variational Inference For SDE
28 pages
BNBC Part 03 - General Building Requirements, Control and Regulation
100% (2)
BNBC Part 03 - General Building Requirements, Control and Regulation
103 pages
Class 10 Non-Finites MCQs PDF Download
No ratings yet
Class 10 Non-Finites MCQs PDF Download
3 pages
In-Game Weapon Performance Analysis
No ratings yet
In-Game Weapon Performance Analysis
23 pages
Non-Financial Indicators in Business Evaluation
No ratings yet
Non-Financial Indicators in Business Evaluation
7 pages
Activate FEH
No ratings yet
Activate FEH
5 pages
Troubleshooting Water Loss in Engine Tank
No ratings yet
Troubleshooting Water Loss in Engine Tank
4 pages
Generator Subtransient Reactance Explained
No ratings yet
Generator Subtransient Reactance Explained
2 pages
bPAC SDK
No ratings yet
bPAC SDK
28 pages
Copper Ion Adsorption with Tea Waste
No ratings yet
Copper Ion Adsorption with Tea Waste
6 pages

Policy Gradient Methods Guide

Uploaded by

Policy Gradient Methods Guide

Uploaded by

Policy Gradient Methods

February 13, 2017

maximize η(π), where

I A family of policies indexed by parameter vector θ ∈ Rd

Intuitions: collect a bunch of trajectories, and ...

= Ex [f (x)∇θ log p(x | θ)].

I Last expression gives us an unbiased gradient estimator. Just sample

ĝi = f (xi )∇θ log p(xi | θ)

ĝi = f (xi )∇θ log p(xi | θ)

ĝi = f (xi )∇θ log p(xi | θ)

∇θ Eτ [R(τ )] = Eτ [∇θ log p(τ | θ)R(τ )]

I Just need to write out p(τ | θ):

I Interpretation: using good trajectories (high R) as supervised examples in classification / regression

I Sum this formula over t, we obtain

I Further reduce variance by introducing a baseline b(s)

I For any choice of b, gradient estimator is unbiased.

Eτ [∇θ log π(at | st , θ)b(st )]

I Introduce discount factor γ, which ignores delayed effects between actions

Now, we want b(st ) ≈ E rt + γrt+1 + γ 2 rt+2 + · · · + γ T −1−t rT −1

Initialize policy parameter θ, baseline b

I Then policy gradient estimator ĝ = ∇θ L(θ)

Q π,γ (s, a) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s, a0 = a

Called Q-function or state-action-value function

V π,γ (s) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s

= Ea∼π [Q π,γ (s, a)]

Aπ,γ (s, a) = Q π,γ (s, a) − V π,γ (s)

I Using value functions

I Can plug in “advantage estimator” Â for Aπ,γ

I Baseline accounts for and removes the effect of past actions

I Subtracting out baselines, we get advantage estimators

maximize Q ∗,T (s, a) ≈ maximize Q ∗,γ (s, a)

I Discounted policy gradient

R̂t is target value function, in regression problem

I A nice intuitive explanation of policy gradients:

You might also like