0% found this document useful (0 votes)

11 views83 pages

08 PG Methods

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views83 pages

08 PG Methods

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Policy Gradients

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Oct 28, 30, Nov 05, 2021

Agenda Introduction REINFORCE Bias/Variance

Agenda

§ Get started with the policy gradient methods.

§ Get familiar with naive REINFORCE algorithm and its advantages
and disadvantages.
§ Getting familair with different variance reduction techniques.
§ Actor-Critic methods.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 2 / 38
Agenda Introduction REINFORCE Bias/Variance

Resources

§ Deep Reinforcement Learning by Sergey Levine [Link]

§ OpenAI Spinning Up [Link]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 3 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [SB]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 4 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [SB]

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 4 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [Sergey Levine, UC Berkeley]

§ In the middle is the ‘policy network’ which can directly learn a

parameterized policy πθ (a|s) (sometimes denoted as π(a|s; θ)) and
provides the probability distribution over all actions given the state s
and parameterized by θ.
§ To distinguish it from the parameter vector w in value function
approximator v̂(s; w), the notation θ is used.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 5 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [Sergey Levine, UC Berkeley]

§ Goal in RL Problem is to maximize the total reward “in expectation”

over long run.
§ A trajectory τ is defined as,
τ = (s1 , a1 , s2 , a2 , s3 , a3 , · · · )

§ The probability of a trajectory is given by the joint probability of the

state-action pairs.
T
Y
pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st ) (1)
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 6 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

§ Proof of the above relation,

p(sT +1 , sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )p(sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )p(sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )p(aT |sT , sT −1 , aT −1 , · · · , s1 , a1 )p(sT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )πθ (aT |sT ) p(sT , sT −1 , aT −1 , · · · , s1 , a1 ) (2)
§ The boxed part of the equation is very simi-
lar to the left hand side. So, using similar argument repetitively, we get,

p(sT +1 , sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )πθ (aT |sT )p(sT |sT −1 , aT −1 )πθ (aT −1 |sT −1 )
p(sT −1 , sT −2 , aT −2 · · · , s1 , a1 )
T
Y
= p(s1 ) p(st+1 |st , at )πθ (at |st ) (3)
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 7 / 38
Agenda Introduction REINFORCE Bias/Variance

The Goal of Reinforcement Learning

Figure credit: [Sergey Levine, UC Berkeley]

§ We will sometimes denote the probability as pθ (τ ), i.e.,

T
Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ The goal can be written as, " #

X
∗
θ = arg max Eτ ∼pθ (τ ) r(st , at )
θ t
| {z }
J(θ)

§ Note that, for the time being, we are not considering discount. We
will come back to that.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 8 / 38
Agenda Introduction REINFORCE Bias/Variance

Evaluating the Objective

§ We will see how we can optimize this objective - the expected value
of the total reward under the trajectory distribution induced by the
policy θ.
§ But before that let us see how we can evaluate the objective in model
free setting.
" #
X
J(θ) = Eτ ∼pθ (τ ) r(st , at ) (4)
t

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 9 / 38
Agenda Introduction REINFORCE Bias/Variance

Evaluating the Objective

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 9 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.
§ Compute the gradient and take steps in the direction of the gradient.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.
§ Compute the gradient and take steps in the direction of the gradient.
 
r(τ )
zX }| {
θ ∗ = arg max Eτ ∼pθ (τ ) 
 
 r(st , at )

θ t

| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ

§ How to compute this complicated looking gradient!

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ

§ How to compute this complicated looking gradient! The

log-derivative trick is our rescue.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)

§ So using eqn. (5) we get the gradient of the objective as,

Z Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ = pθ (τ )∇θ log pθ (τ )r(τ )dτ

= Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )] (6)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)

§ So using eqn. (5) we get the gradient of the objective as,

Z Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ = pθ (τ )∇θ log pθ (τ )r(τ )dτ

= Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )] (6)

§ Remember that
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T

Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T

Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T

X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T

Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T

X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides,

∇θ log pθ (τ ) =

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T

Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T

X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides,

:0
∇θ log pθ (τ ) =

logp(s
1)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T

Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T

X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides, T :0

:0X
∇θ log pθ (τ ) = |s

logp(s
1)+ logp(s
t+1 t
, a t)

t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T

Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T

X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides, T :0X T

:0X
∇θ log pθ (τ ) = |s ∇θ log πθ (at |st )

logp(s
1)+ logp(s
t+1 t
, a t ) +

t=1 t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Thus,
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Thus,
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ So, to get the estimate of the gradient we take samples and average
not only the sum of rewards but also average the sum of the gradients
of the policy values.
N
" T T
#
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Thus,
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ And the last bit is to update θ along the gradient direction.

θ ← θ + α∇θ J(θ) (7)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance

Fitting in Generic RL Pipeline

 
N T T
1 X X X
∇θ J(θ) ≈  ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t ) 
N i=1 t=1 t=1

θ ← θ + α∇θ J(θ)

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 14 / 38
Agenda Introduction REINFORCE Bias/Variance

Fitting in Generic RL Pipeline

 
N T T
1 X X X
∇θ J(θ) ≈  ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t ) 
N i=1 t=1 t=1

θ ← θ + α∇θ J(θ)

REINFORCE Algorithm
1 Sample {ri } from πθ (at |st ) (run the
policy)
2 ∇θ J(θ)
" ≈ #
N T T
1 P P P
N
∇ θ log π θ (ai,t |si,t ) r(si,t , ai,t )
i=1 t=1 t=1
3 θ ← θ + α∇θ J(θ)
Figure credit: [Sergey Levine, UC Berkeley] 4 Repeat

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 14 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

 
N T H T
1 X X X HH
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
HH
H
§ What is given by log πθ (ai,t |si,t )? - It is log of the probability of
action ai,t at state si,t under the distribution parameterized by θ.
§ This gives the likelihood, i.e., how likely, we are to see ai,t as the
action, if our policy is defined by the current θ that we have.
§ Computing the gradient and taking a step along the direction of the
gradient, changes θ in such a way that the likelihood of the action
ai,t increases.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 15 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

T
§ Now consider the case, when it is getting multiplied by r(si,t , ai,t ).
P
t=1
§ Those actions with high rewards are getting more likely.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 15 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.

§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.

§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.

§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.

§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Bias and Variance in Estimation

§ One way to work with values we do not know is to estimate them by

experimenting repeatedly.
§ Monte-Carlo methods provide the estimate of the true value and we
have used Monte-Carlo methods to estimate the value functions and
to estimate the gradient of the expected return.
§ The estimator is a function of the data which itself are random
variables. So the estimated value is subject to many possible
outcomes if employed repeatedly, i.e., if you conduct the experiment
multiple times, in general, the estimator will provide different values.
§ An estimator is good if,
I On average the estimated values are close to the true value for
different trials - (Bias)
I The estimates do not vary much in each trial - (variance)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 17 / 38
Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance

§ Let us consider estimating a constant value (say temperature of this

room) by some sensors which are not perfect. Consider the
observations.
x[n] = θ+w[n] n = 0, 1, · · · , N −1. w[n] is WGN with variance = σ 2 .

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance

§ Let us consider estimating a constant value (say temperature of this

room) by some sensors which are not perfect. Consider the
observations.
x[n] = θ+w[n] n = 0, 1, · · · , N −1. w[n] is WGN with variance = σ 2 .

§ A reasonable estimator is the average value of x[n] i.e.,

−1
NP
θ̂ = N1 x[n]
n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X 1 X
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X
= = θ+0 =θ
N n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance

§ Let us see what happens with a modified estimator, x[n] i.e.,

N −1
1 P
θ̌ = 2N x[n]
n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance

§ Let us see what happens with a modified estimator, x[n] i.e.,

N −1
1 P
θ̌ = 2N x[n]
n=0
§ It is easy to see that E[θ̌] = 21 θ.
§ So the bias is b = E[θ̌] − θ = − 21 θ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance

§ That an estimator is unbiased does not necessarily mean that it is a

good estimator. It is reasonable to check by repeating the experiment
how the results differ in successive trials.
§ Thus the variance of the estimate is another measure of goodness of
the estimator. And the aim will be to see how small we can make
var(θ̂).
§ Let us take the following 3 estimators for θ and see the variances of
all these.
N −1
θ̂b = x[0] 1 X
θ̂c = x[n]
θ̂a = 0 E(θ̂b ) = E(x[0]) N n=0
E(θ̂a ) = 0 = E(θ + w[0]) E(θ̂c ) = θ (already seen)
var(θ̂a ) = 0 =θ+0=θ var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ]
var(θ̂b ) = var(x[0]) = σ 2 (Continued on next slide.)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 20 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0
§ Now,
−1
N
" N −1 −1
#
X X NX 2
var w[n] = E w[n] − E w[n]
n=0 n=0 n=0
 0 
−1 −1 z }| { 2
" N −1 #
NX N
X X 2
=E  w[n] − E w[n]  =E w[n]
n=0 n=0 n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance

§ Using the above in eqn. (8)

N −1 N −1
1 X 1 X
var(θ̂c ) = 2 var w[n] = 2 var(w[n]) (WGN)
N n=0
N n=0
N σ2 σ2
= =
N2 N
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])

(why?) − (Hint: What is random here?)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])

(why?) − (Hint: What is random here?)

:0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)

(E[ −
θ̂] E[θ̂])

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])

(why?) − (Hint: What is random here?)

:0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)

(E[ −
θ̂] E[θ̂])
= var(θ̂) + bias2 (θ̂)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])

(why?) − (Hint: What is random here?)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])

(why?) − (Hint: What is random here?)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])

(why?) − (Hint: What is random here?)

:0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)

(E[ −
θ̂] E[θ̂])
= var(θ̂) + bias2 (θ̂)
§ So the mean square error in estimation, is composed of errors due to
the variance of the esstimator as well as the bias.
§ Recall MC evaluation
Gt = Rt+1 + γRt+2 + · · · + γ T −1 RT and vπ (s) = E [Gt |St = s]
N
1 X (i)
v̂π (s) = Gt (St = s)
N i=1
§ So v̂π (s) is an unbiased estimator but with variance (inversely
proportional to number of samples N .)
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Bias and Variance of MC and TD

§ One key contribution of variance in MC evaluation comes from the

randomness at each timestep.
§ This is not the case in TD as the Gt is estimated by bootstrapping,
Ĝt = Rt+1 + γ V̂ (St+1 )

§ This makes the estimator suffer less from variance as randomness

comes from only one random step taken. The rest is deterministic.
§ But this introduces bias. The estimate always have the deterministic
additive component γ V̂ (St+1 )

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 23 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ We have seen, " T #

X T
X
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ Inside each trajectory, a lot of randomness is there.

§ We can derive versions of this formula that eliminate terms to reduce
variance.
§ Let us apply the log derivative trick (∇θ log pθ (τ ) = ∇θ log πθ (at |st )) to
P
compute the gradient for a single" reward term. ! #
Xt
∇θ Eτ [r(st , at )] = Eτ ∼pθ (τ ) ∇θ log πθ (at0 |st0 ) r(st , at ) (9)
t0 =1

§ Note that the sum goes up to t. Why?

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 24 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ We have seen, " T #

X T
X
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ Inside each trajectory, a lot of randomness is there.

§ Note that the sum goes up to t. Why? - The reward at timestep t

depends on actions till t0 ≤ t. - Causality

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 24 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Summing over time we get "(with some reordering of the sums,

# last)
T t X X
∇θ Eτ [r(τ )] = Eτ ∼pθ (τ ) r(st , at ) ∇θ log πθ (at0 |st0 )
t=1 t0 =1
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st0 , at0 ) (10)
t=1 t0 =t

§ With less randomness inside each trajectory the variance is less, but
what about bias?

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 25 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

T
hX T
X i
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1
T
hX T
X i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st0 , at0 )
t=1 t0 =1
T X
hX T i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 )
t=1 t0 =1
f (t,t0 )
T X
X T z }| {
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 ) (11)
t=1 t0 =1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 26 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Let us consider the term,

Eτ ∼pθ (τ ) f (t, t0 ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 )

(12)

§ We will show that for the case of t0 < t (reward coming before the
action is performed) the above term is zero.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 26 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

Z
Eτ ∼pθ (τ ) f (t, t0 ) = p(τ )f (t, t0 )d(τ )

Z
= p(s1 , a1 , · · · , st , at , · · · , st0 , at0 , · · · )f (t, t0 )

d(s1 , a1 , · · · , st , at , · · · , st0 , at0 , · · · )

Z
= p(st , at , st0 , at0 )f (t, t0 )d(st , at , st0 , at0 ) (13)

§ The above
Z Zcomes from the property
Z Zbelow.
f (X)P (X, Y )dY dX = f (X)P (X)P (Y |X)dY dX
X Y X Y

Z

Z * 1

= f (X)P (X) |X)dY dX
P (Y
X Y

Z
= f (X)P (X)dX (14)
X

§ Taking X = {st , at , st0 , at0 } and Y the rest.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 27 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Till now we have, Z
Eτ ∼pθ (τ ) f (t, t0 ) = p(st , at , st0 , at0 )f (t, t0 )d(st , at , st0 , at0 )

(15)

§ We will now use a variation Zof iterated expectation.

EA,B [f (A, B)] = P (A, B)f (A, B)dBdA
Z
= P (B|A)P (A)f (A, B)dBdA
Z Z
= P (A) P (B|A)f (A, B)dB dA
Z
= P (A)EB [f (A, B)|A] dA

= EA EB [f (A, B)|A]

§ Taking A = st0 , at0 and B = st , at , eqn. (15) can be written as,

Eτ ∼pθ (τ ) f (t, t0 ) = E E [f (t, t0 )|st0 , at0 ]

(16)
st0 ,at0 st ,at

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 28 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Putting the value of 0
f (t, t ) back in eqn. (16),
we get,
Eτ ∼pθ (τ ) f (t, t0 ) = E E [f (t, t0 )|st0 , at0 ] (17)
st0 ,at0 st ,at

= E E [∇θ log πθ (at |st )r(st0 , at0 )|st0 , at0 ]
st0 ,at0 st ,at

= E r(st0 , at0 ) E [∇θ log πθ (at |st )|st0 , at0 ]
st0 ,at0 st ,at

§ Let us take a closer look at

Z the inner expectation,
E [∇θ log πθ (at |st )|st0 , at0 ] = P (st , at |st0 , at0 )∇θ log πθ (at |st )d(at , st ) (18)
st ,at

Reducing Variance in Policy Gradient Estimate

§ Now we will use a neat trick known as ‘Expected
GradLog
Probability’ (EGLP) lemma which says E ∇θ log pθ (x) = 0.
∇θ pθ (x)
Z Z

E ∇θ log pθ (x) = pθ (x)∇θ log pθ (x)dx = pθ (x) dx
x∼pθ (x) pθ (x)
Z Z
= ∇θ pθ (x)dx = ∇θ pθ (x)dx = ∇θ 1 = 0

§ Thus the inner expectation in eqn. (19) is 0. This, in turn, means

eqn. (17), (16) and (15) are all 0.
§ That is, Eτ ∼pθ (τ ) f (t, t0 ) = 0 for t > t0 .

§ Now for t ≤ t0 , P (st , at |st0 , at0 ) can not be broken down to

P (at |st )P (st |st0 , at0 ), as past state (st ) will get conditioned on future
state and actions (st0 , at0 ) violating the Markov property.
§ So, Eτ ∼pθ (τ ) f (t, t0 ) 6= 0 for t ≤ t0 .

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 30 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ So we began with,
T X
X T
Eτ ∼pθ (τ ) f (t, t0 )

∇θ J(θ) = (20)
t=1 t0 =1

and have shown that

(
0
= 0 if t0 < t
Eτ ∼pθ (τ ) f (t, t )
6= 0 if t0 ≥ t

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ So we began with,
T X
X T
Eτ ∼pθ (τ ) f (t, t0 )

∇θ J(θ) = (20)
t=1 t0 =1

and have shown that

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ So we began with,
T X
X T
Eτ ∼pθ (τ ) f (t, t0 )

∇θ J(θ) = (20)
t=1 t0 =1

and have shown that

(
0
= 0 if t0 < t
Eτ ∼pθ (τ ) f (t, t )
6= 0 if t0 ≥ t
§ So, the gradient of the total expected return can "be written as, #
T T XX T T XX
Eτ ∼pθ (τ ) f (t, t0 ) = Eτ ∼pθ (τ ) f (t, t0 )

∇θ J(θ) =
t=1 t0 =t t=1 t0 =t
T
hX T
X i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at ) (21)
t=1 t0 =t
§ This is the ‘reward to go’ formulation we have seen earlier and which
has less variance. But this also is same as the total expected reward
expression which is unbiased. So this is unbiased and less variance
estimator of the total expected reward.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely.

§ Bad stuff is made less likely.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely.

§ Bad stuff is made less likely.
§ What if all have high reward?

Figure credit: [Sergey Levine, UC Berkeley]

" T T
#
X X
∇θ J(θ) = E [∇θ log pθ (τ )r(τ )] = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ ) τ ∼pθ (τ )
t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely.

§ Bad stuff is made less likely.
§ What if all have high reward?

Figure credit: [Sergey Levine, UC Berkeley]

" T T
#
X X
∇θ J(θ) = E [∇θ log pθ (τ )r(τ )] = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ ) τ ∼pθ (τ )
t=1 t=1
∇θ J(θ) = E [∇θ log pθ (τ )[r(τ ) − b]]
τ ∼pθ (τ )

§ Will it remain unbiased?

§ Only if E [∇θ log pθ (τ )b] = b E [∇θ log pθ (τ )] = 0
τ ∼pθ (τ ) τ ∼pθ (τ )
§ And E [∇θ log pθ (τ )] = 0 by EGLP Lemma.
τ ∼pθ (τ )

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines
§ So subtracting a constant baseline keeps the estimate unbiased.
§ A reasonable choice of baseline is average reward across the
N
trajectories, b = N1
P
r(τ )
i=1
§ What about variance?

∇θ J(θ) = E ∇θ log pθ (τ )[r(τ ) − b]
τ ∼pθ (τ )
2 2
var = E ∇θ log pθ (τ )[r(τ ) − b] − E ∇θ log pθ (τ )[r(τ ) − b]
τ ∼pθ (τ ) τ ∼pθ (τ )
2 2
= E ∇θ log pθ (τ )[r(τ ) − b] − E ∇θ log pθ (τ )r(τ )
τ ∼pθ (τ ) τ ∼pθ (τ )
2
∂ E ∇θ log pθ (τ )[r(τ ) − b]
∂var τ ∼pθ (τ )
= −0
∂b h ∂b
2 i
∂ E ∇θ log pθ (τ ) r2 (τ ) − 2r(τ )b + b2
τ ∼pθ (τ )
=
∂b
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 33 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

h 2 i
∂ E ∇θ log pθ (τ ) r2 (τ ) − 2r(τ )b + b2
∂var τ ∼pθ (τ )
=
∂b ∂b
2 2
=0−2 E ∇θ log pθ (τ ) r(τ ) + 2b E ∇θ log pθ (τ )
τ ∼pθ (τ ) τ ∼pθ (τ )

§ For minimum variance,

∂var
=0
∂b
2 2
− E ∇θ log pθ (τ ) r(τ ) + b E ∇θ log pθ (τ ) =0
τ ∼pθ (τ ) τ ∼pθ (τ )
2
E ∇θ log pθ (τ ) r(τ )
τ ∼pθ (τ )
b= 2
E ∇θ log pθ (τ )
τ ∼pθ (τ )

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 34 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

§ It would be good to have the true value of Q to be used in the

equation.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

§ It would be good to have the true value of Q to be used in the

equation.
§ But that is not available to us.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

§ It would be good to have the true value of Q to be used in the

equation.
§ But that is not available to us.
§ Other alternatives are to estimate this value using methods that we
have seen earlier - MC evaluation, Bootstrapped evaluation (TD),
using function approximation for these.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1

§ We can also use a baseline version of this.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T
hX i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1

§ We can also use a baseline version of this.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T
hX i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1
T
hX i
= E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1

§ We can also use a baseline version of this.

§ This is called the ‘Advantage function’.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

§ We can also use a baseline version of this.

§ This is called the ‘Advantage function’.
§ A(st , at ) can be approximated following the methods we used earlier
(single sample backup or bootstrapping)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T
hX i
∇θ J(θ) = E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1
N X
T
1 X
≈ ∇θ log πθ (ai,t |si,t ) Aθ (si,t , ai,t )
N i=1 t=1

§ Qθ (st , at ) ≈ r(st , at ) + V θ (st+1 )

§ Aθ (st , at ) ≈ r(st , at ) + V θ (st+1 ) − V θ (st )
§ So we can use a neural network which learns to produce V (s)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 37 / 38
Agenda Introduction REINFORCE Bias/Variance

Actor-Critic

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 38 / 38

cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Sol3 2015
No ratings yet
Sol3 2015
8 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
CS 182 Berkeley 2021 Discussion 2
No ratings yet
CS 182 Berkeley 2021 Discussion 2
9 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Lecture 11
No ratings yet
Lecture 11
51 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning - Unit 7 - Week 4
No ratings yet
Reinforcement Learning - Unit 7 - Week 4
2 pages
Deep Learning Optimization Course
No ratings yet
Deep Learning Optimization Course
59 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Answer Key
No ratings yet
Answer Key
12 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lecture Slides 3 - Bias Variance and Regularisation For Neural Networks - 2021
No ratings yet
Lecture Slides 3 - Bias Variance and Regularisation For Neural Networks - 2021
29 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Function Approximation in RL Methods
No ratings yet
Function Approximation in RL Methods
58 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Regression
No ratings yet
Regression
39 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
L1-Understanding Diffusion Models A Unified Persp
No ratings yet
L1-Understanding Diffusion Models A Unified Persp
27 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Statistical Learning Solution Guide
No ratings yet
Statistical Learning Solution Guide
147 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
(Textbook) (Solution) The Elements of Statistical Learning
No ratings yet
(Textbook) (Solution) The Elements of Statistical Learning
147 pages
Revision IIT
No ratings yet
Revision IIT
13 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
L7 Temporal Difference Learning
No ratings yet
L7 Temporal Difference Learning
56 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Homework - 1
No ratings yet
Homework - 1
10 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
RL Unit 4
No ratings yet
RL Unit 4
9 pages
Sol3 2020
No ratings yet
Sol3 2020
5 pages
RL Unit
No ratings yet
RL Unit
595 pages
Elements of Statistical Learning Solutions
100% (3)
Elements of Statistical Learning Solutions
112 pages
AIT2001 Final
No ratings yet
AIT2001 Final
8 pages
Optimization
No ratings yet
Optimization
89 pages
Offline Reinforcement Learning Overview
No ratings yet
Offline Reinforcement Learning Overview
26 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Learning Theory
No ratings yet
Learning Theory
19 pages
DL 1
No ratings yet
DL 1
10 pages
Addendum Bias Variance
No ratings yet
Addendum Bias Variance
3 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
Ar514 MDP
No ratings yet
Ar514 MDP
27 pages
Overfitting: Extracting Too Much
No ratings yet
Overfitting: Extracting Too Much
17 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Computer Graphics Lab (CSP-305) Worksheet-3
No ratings yet
Computer Graphics Lab (CSP-305) Worksheet-3
4 pages
CXC Past Paper 2022
No ratings yet
CXC Past Paper 2022
17 pages
ALMIRA Resume
No ratings yet
ALMIRA Resume
3 pages
Rubrics For Poem
No ratings yet
Rubrics For Poem
1 page
Design Research DKFS Karowa Bridge 04
No ratings yet
Design Research DKFS Karowa Bridge 04
64 pages
A World of Regions
100% (4)
A World of Regions
20 pages
PTE Patch 5.0 FAQ and Troubleshooting Guide
No ratings yet
PTE Patch 5.0 FAQ and Troubleshooting Guide
4 pages
How To Automatically Install FreeBSD Using WDS and PXELINUX - IRTNOG
No ratings yet
How To Automatically Install FreeBSD Using WDS and PXELINUX - IRTNOG
6 pages
Verkhoshansky Super Methods of SPP
100% (13)
Verkhoshansky Super Methods of SPP
22 pages
2013 Physical Education Examination Paper
No ratings yet
2013 Physical Education Examination Paper
31 pages
Addison Public Library Homework Help
100% (1)
Addison Public Library Homework Help
4 pages
RPG Character Folio Template
No ratings yet
RPG Character Folio Template
2 pages
Huawei Device Brand Visual Identity Guidelines 2012
No ratings yet
Huawei Device Brand Visual Identity Guidelines 2012
64 pages
Yr 3 - Balinese Folk Tale - Day 1
No ratings yet
Yr 3 - Balinese Folk Tale - Day 1
26 pages
(고2) 2025년 3월 모의고사 (변형문제) (1054문제) nnnnn
No ratings yet
(고2) 2025년 3월 모의고사 (변형문제) (1054문제) nnnnn
188 pages
Ethics in Debates
No ratings yet
Ethics in Debates
5 pages
IBS Communication and Engagement Strategies
100% (1)
IBS Communication and Engagement Strategies
10 pages
Finite Math Final Review
No ratings yet
Finite Math Final Review
7 pages
Q3 English ActivitySheets
No ratings yet
Q3 English ActivitySheets
4 pages
CRE 260 Sto Tomas
No ratings yet
CRE 260 Sto Tomas
23 pages
Mastering Effective Communication Skills
No ratings yet
Mastering Effective Communication Skills
61 pages
Al Falah Lifting Plan Guide
100% (2)
Al Falah Lifting Plan Guide
2 pages
Control System Questions Ias Mains
No ratings yet
Control System Questions Ias Mains
9 pages
Learning To learn-Report-Hautamäki
100% (1)
Learning To learn-Report-Hautamäki
103 pages
MRV Report
No ratings yet
MRV Report
40 pages
Environmental Reporting: Present Status and Some Suggestive Guidelines For Listed Manufacturing Companies in Bangladesh
No ratings yet
Environmental Reporting: Present Status and Some Suggestive Guidelines For Listed Manufacturing Companies in Bangladesh
11 pages
Some Trends in The Use of Concrete: Indian Scenario
No ratings yet
Some Trends in The Use of Concrete: Indian Scenario
6 pages
Basic Policy Brief Template Example
80% (5)
Basic Policy Brief Template Example
2 pages
Immanuel Kant - Opus Postumum
100% (1)
Immanuel Kant - Opus Postumum
367 pages
l2 - The Trial Lens Set
No ratings yet
l2 - The Trial Lens Set
5 pages