0% found this document useful (0 votes)
11 views83 pages

08 PG Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views83 pages

08 PG Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Policy Gradients

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Oct 28, 30, Nov 05, 2021


Agenda Introduction REINFORCE Bias/Variance

Agenda

§ Get started with the policy gradient methods.


§ Get familiar with naive REINFORCE algorithm and its advantages
and disadvantages.
§ Getting familair with different variance reduction techniques.
§ Actor-Critic methods.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 2 / 38
Agenda Introduction REINFORCE Bias/Variance

Resources

§ Deep Reinforcement Learning by Sergey Levine [Link]


§ OpenAI Spinning Up [Link]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 3 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [SB]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 4 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [SB]

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 4 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [Sergey Levine, UC Berkeley]

§ In the middle is the ‘policy network’ which can directly learn a


parameterized policy πθ (a|s) (sometimes denoted as π(a|s; θ)) and
provides the probability distribution over all actions given the state s
and parameterized by θ.
§ To distinguish it from the parameter vector w in value function
approximator v̂(s; w), the notation θ is used.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 5 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [Sergey Levine, UC Berkeley]

§ Goal in RL Problem is to maximize the total reward “in expectation”


over long run.
§ A trajectory τ is defined as,
τ = (s1 , a1 , s2 , a2 , s3 , a3 , · · · )

§ The probability of a trajectory is given by the joint probability of the


state-action pairs.
T
Y
pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st ) (1)
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 6 / 38
Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting


§ Proof of the above relation,

p(sT +1 , sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )p(sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )p(sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )p(aT |sT , sT −1 , aT −1 , · · · , s1 , a1 )p(sT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )πθ (aT |sT ) p(sT , sT −1 , aT −1 , · · · , s1 , a1 ) (2)
§ The boxed part of the equation is very simi-
lar to the left hand side. So, using similar argument repetitively, we get,

p(sT +1 , sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )πθ (aT |sT )p(sT |sT −1 , aT −1 )πθ (aT −1 |sT −1 )
p(sT −1 , sT −2 , aT −2 · · · , s1 , a1 )
T
Y
= p(s1 ) p(st+1 |st , at )πθ (at |st ) (3)
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 7 / 38
Agenda Introduction REINFORCE Bias/Variance

The Goal of Reinforcement Learning

Figure credit: [Sergey Levine, UC Berkeley]

§ We will sometimes denote the probability as pθ (τ ), i.e.,


T
Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ The goal can be written as, " #


X

θ = arg max Eτ ∼pθ (τ ) r(st , at )
θ t
| {z }
J(θ)

§ Note that, for the time being, we are not considering discount. We
will come back to that.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 8 / 38
Agenda Introduction REINFORCE Bias/Variance

Evaluating the Objective

§ We will see how we can optimize this objective - the expected value
of the total reward under the trajectory distribution induced by the
policy θ.
§ But before that let us see how we can evaluate the objective in model
free setting.
" #
X
J(θ) = Eτ ∼pθ (τ ) r(st , at ) (4)
t

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 9 / 38
Agenda Introduction REINFORCE Bias/Variance

Evaluating the Objective

§ We will see how we can optimize this objective - the expected value
of the total reward under the trajectory distribution induced by the
policy θ.
§ But before that let us see how we can evaluate the objective in model
free setting.
" #
X 1 XX
J(θ) = Eτ ∼pθ (τ ) r(st , at ) ≈ r(si,t , ai,t ) (4)
t
N t i

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 9 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective


§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective


§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.
§ Compute the gradient and take steps in the direction of the gradient.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective


§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.
§ Compute the gradient and take steps in the direction of the gradient.
 
r(τ )
zX }| {
θ ∗ = arg max Eτ ∼pθ (τ ) 
 
 r(st , at )

θ t

| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective


§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.
§ Compute the gradient and take steps in the direction of the gradient.
 
r(τ )
zX }| {
θ ∗ = arg max Eτ ∼pθ (τ ) 
 
 r(st , at )

θ t

| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ

§ How to compute this complicated looking gradient!

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective


§ Now that we have seen how to evaluate the objective, the next step is
to maximize it.
§ Compute the gradient and take steps in the direction of the gradient.
 
r(τ )
zX }| {
θ ∗ = arg max Eτ ∼pθ (τ ) 
 
 r(st , at )

θ t

| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ

§ How to compute this complicated looking gradient! The


log-derivative trick is our rescue.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)

§ So using eqn. (5) we get the gradient of the objective as,


Z Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ = pθ (τ )∇θ log pθ (τ )r(τ )dτ

= Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )] (6)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)

§ So using eqn. (5) we get the gradient of the objective as,


Z Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ = pθ (τ )∇θ log pθ (τ )r(τ )dτ

= Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )] (6)

§ Remember that
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T


Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T


Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T


X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T


Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T


X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides,


∇θ log pθ (τ ) =

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T


Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T


X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides,


:0
∇θ log pθ (τ ) = 

logp(s
1)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T


Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T


X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides, T :0


:0X 
∇θ log pθ (τ ) =  |s

logp(s
1)+ logp(s 
t+1 t
 , a t)
 
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Till now we have the following,
θ ∗ = arg max J(θ); J(θ) = Eτ ∼pθ (τ ) [r(τ )]
θ
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]

§ We have also seen, T


Y
pθ (τ ) = pθ (s1 , a1 , s2 , a2 , · · · , sT , aT , sT +1 ) = p(s1 ) p(st+1 |st , at )πθ (at |st )
t=1

§ Taking log both sides, T


X T
X
log pθ (τ ) = log p(s1 ) + log p(st+1 |st , at ) + log πθ (at |st )
t=1 t=1

§ Taking ∇θ both sides, T :0X T


:0X 
∇θ log pθ (τ ) =  |s ∇θ log πθ (at |st )

logp(s
1)+ logp(s 
t+1 t
 , a t ) +
 
t=1 t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Thus,
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Thus,
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ So, to get the estimate of the gradient we take samples and average
not only the sum of rewards but also average the sum of the gradients
of the policy values.
N
" T T
#
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick


§ Thus,
∇θ J(θ) = Eτ ∼pθ (τ ) [∇θ log pθ (τ )r(τ )]
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ So, to get the estimate of the gradient we take samples and average
not only the sum of rewards but also average the sum of the gradients
of the policy values.
N
" T T
#
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1

§ And the last bit is to update θ along the gradient direction.


θ ← θ + α∇θ J(θ) (7)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance

Fitting in Generic RL Pipeline

 
N T T
1 X X X
∇θ J(θ) ≈  ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t ) 
N i=1 t=1 t=1

θ ← θ + α∇θ J(θ)

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 14 / 38
Agenda Introduction REINFORCE Bias/Variance

Fitting in Generic RL Pipeline

 
N T T
1 X X X
∇θ J(θ) ≈  ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t ) 
N i=1 t=1 t=1

θ ← θ + α∇θ J(θ)

REINFORCE Algorithm
1 Sample {ri } from πθ (at |st ) (run the
policy)
2 ∇θ J(θ)
" ≈ #
N T T
1 P P P
N
∇ θ log π θ (ai,t |si,t ) r(si,t , ai,t )
i=1 t=1 t=1
3 θ ← θ + α∇θ J(θ)
Figure credit: [Sergey Levine, UC Berkeley] 4 Repeat

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 14 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look


 
N T H T
1 X X X HH
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
HH
 H
§ What is given by log πθ (ai,t |si,t )? - It is log of the probability of
action ai,t at state si,t under the distribution parameterized by θ.
§ This gives the likelihood, i.e., how likely, we are to see ai,t as the
action, if our policy is defined by the current θ that we have.
§ Computing the gradient and taking a step along the direction of the
gradient, changes θ in such a way that the likelihood of the action
ai,t increases.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 15 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look


 
N T H T
1 X X X HH
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
HH
 H
§ What is given by log πθ (ai,t |si,t )? - It is log of the probability of
action ai,t at state si,t under the distribution parameterized by θ.
§ This gives the likelihood, i.e., how likely, we are to see ai,t as the
action, if our policy is defined by the current θ that we have.
§ Computing the gradient and taking a step along the direction of the
gradient, changes θ in such a way that the likelihood of the action
ai,t increases.
N T T
" #
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1

T
§ Now consider the case, when it is getting multiplied by r(si,t , ai,t ).
P
t=1
§ Those actions with high rewards are getting more likely.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 15 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.


§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.


§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.


§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely.


§ Bad stuff is made less likely.
§ Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance

Bias and Variance in Estimation

§ One way to work with values we do not know is to estimate them by


experimenting repeatedly.
§ Monte-Carlo methods provide the estimate of the true value and we
have used Monte-Carlo methods to estimate the value functions and
to estimate the gradient of the expected return.
§ The estimator is a function of the data which itself are random
variables. So the estimated value is subject to many possible
outcomes if employed repeatedly, i.e., if you conduct the experiment
multiple times, in general, the estimator will provide different values.
§ An estimator is good if,
I On average the estimated values are close to the true value for
different trials - (Bias)
I The estimates do not vary much in each trial - (variance)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 17 / 38
Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ

§ Let us consider estimating a constant value (say temperature of this


room) by some sensors which are not perfect. Consider the
observations.
x[n] = θ+w[n] n = 0, 1, · · · , N −1. w[n] is WGN with variance = σ 2 .

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ

§ Let us consider estimating a constant value (say temperature of this


room) by some sensors which are not perfect. Consider the
observations.
x[n] = θ+w[n] n = 0, 1, · · · , N −1. w[n] is WGN with variance = σ 2 .

§ A reasonable estimator is the average value of x[n] i.e.,


−1
NP
θ̂ = N1 x[n]
n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X  1 X 
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X 
= = θ+0 =θ
N n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X  1 X 
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X 
= = θ+0 =θ
N n=0

§ Let us see what happens with a modified estimator, x[n] i.e.,


N −1
1 P
θ̌ = 2N x[n]
n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X  1 X 
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X 
= = θ+0 =θ
N n=0

§ Let us see what happens with a modified estimator, x[n] i.e.,


N −1
1 P
θ̌ = 2N x[n]
n=0
§ It is easy to see that E[θ̌] = 21 θ.
§ So the bias is b = E[θ̌] − θ = − 21 θ

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance

§ That an estimator is unbiased does not necessarily mean that it is a


good estimator. It is reasonable to check by repeating the experiment
how the results differ in successive trials.
§ Thus the variance of the estimate is another measure of goodness of
the estimator. And the aim will be to see how small we can make
var(θ̂).
§ Let us take the following 3 estimators for θ and see the variances of
all these.
N −1
θ̂b = x[0] 1 X
θ̂c = x[n]
θ̂a = 0 E(θ̂b ) = E(x[0]) N n=0
E(θ̂a ) = 0 = E(θ + w[0]) E(θ̂c ) = θ (already seen)
var(θ̂a ) = 0 =θ+0=θ var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ]
var(θ̂b ) = var(x[0]) = σ 2 (Continued on next slide.)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 20 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0
§ Now,
−1
N
" N −1 −1
#
X  X  NX 2
var w[n] = E w[n] − E w[n]
n=0 n=0 n=0
 0 
−1 −1 z }| { 2
" N −1 #
 NX N
X   X 2
=E  w[n] − E w[n]  =E w[n]
n=0 n=0 n=0

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0
§ Now,
−1
N
" N −1 −1
#
X  X  NX 2
var w[n] = E w[n] − E w[n]
n=0 n=0 n=0
 0 
−1 −1 z }| { 2
" N −1 #
 NX N
X   X 2
=E  w[n] − E w[n]  =E w[n]
n=0 n=0 n=0

§ Using the above in eqn. (8)


N −1 N −1
1 X  1 X 
var(θ̂c ) = 2 var w[n] = 2 var(w[n]) (WGN)
N n=0
N n=0
N σ2 σ2
= =
N2 N
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])


   

(why?) − (Hint: What is random here?)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])


   

(why?) − (Hint: What is random here?)


 :0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)
 
(E[ −
θ̂] E[θ̂])

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])


   

(why?) − (Hint: What is random here?)


 :0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)
 
(E[ −
θ̂] E[θ̂])
= var(θ̂) + bias2 (θ̂)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])


   

(why?) − (Hint: What is random here?)


 :0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)
 
(E[ −
θ̂] E[θ̂])
= var(θ̂) + bias2 (θ̂)
§ So the mean square error in estimation, is composed of errors due to
the variance of the esstimator as well as the bias.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])


   

(why?) − (Hint: What is random here?)


 :0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)
 
(E[ −
θ̂] E[θ̂])
= var(θ̂) + bias2 (θ̂)
§ So the mean square error in estimation, is composed of errors due to
the variance of the esstimator as well as the bias.
§ Recall MC evaluation
Gt = Rt+1 + γRt+2 + · · · + γ T −1 RT and vπ (s) = E [Gt |St = s]
N
1 X (i)
v̂π (s) = Gt (St = s)
N i=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error


§ The mean of the square error of estimation is,
mse(θ̂c ) = E (θ̂ − θ) = E (θ̂ − E[θ̂] + E[θ̂] − θ)2
2
   

= E (θ̂ − E[θ̂])2 + E (E[θ̂] − θ)2 + 2E (θ̂ − E[θ̂])(E[θ̂] − θ)


     

= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)E (θ̂ − E[θ̂])


   

(why?) − (Hint: What is random here?)


 :0
= E (θ̂ − E[θ̂])2 + (E[θ̂] − θ)2 + 2(E[θ̂] − θ)
 
(E[ −
θ̂] E[θ̂])
= var(θ̂) + bias2 (θ̂)
§ So the mean square error in estimation, is composed of errors due to
the variance of the esstimator as well as the bias.
§ Recall MC evaluation
Gt = Rt+1 + γRt+2 + · · · + γ T −1 RT and vπ (s) = E [Gt |St = s]
N
1 X (i)
v̂π (s) = Gt (St = s)
N i=1
§ So v̂π (s) is an unbiased estimator but with variance (inversely
proportional to number of samples N .)
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance

Bias and Variance of MC and TD

§ One key contribution of variance in MC evaluation comes from the


randomness at each timestep.
§ This is not the case in TD as the Gt is estimated by bootstrapping,
Ĝt = Rt+1 + γ V̂ (St+1 )

§ This makes the estimator suffer less from variance as randomness


comes from only one random step taken. The rest is deterministic.
§ But this introduces bias. The estimate always have the deterministic
additive component γ V̂ (St+1 )

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 23 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ We have seen, " T #


X T
X
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ Inside each trajectory, a lot of randomness is there.


§ We can derive versions of this formula that eliminate terms to reduce
variance.
§ Let us apply the log derivative trick (∇θ log pθ (τ ) = ∇θ log πθ (at |st )) to
P
compute the gradient for a single" reward term. ! #
Xt
∇θ Eτ [r(st , at )] = Eτ ∼pθ (τ ) ∇θ log πθ (at0 |st0 ) r(st , at ) (9)
t0 =1

§ Note that the sum goes up to t. Why?

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 24 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ We have seen, " T #


X T
X
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1

§ Inside each trajectory, a lot of randomness is there.


§ We can derive versions of this formula that eliminate terms to reduce
variance.
§ Let us apply the log derivative trick (∇θ log pθ (τ ) = ∇θ log πθ (at |st )) to
P
compute the gradient for a single" reward term. ! #
Xt
∇θ Eτ [r(st , at )] = Eτ ∼pθ (τ ) ∇θ log πθ (at0 |st0 ) r(st , at ) (9)
t0 =1

§ Note that the sum goes up to t. Why? - The reward at timestep t


depends on actions till t0 ≤ t. - Causality

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 24 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Summing over time we get "(with some reordering of the sums,


# last)
T t X X
∇θ Eτ [r(τ )] = Eτ ∼pθ (τ ) r(st , at ) ∇θ log πθ (at0 |st0 )
t=1 t0 =1
" T T
#
X X
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st0 , at0 ) (10)
t=1 t0 =t

§ With less randomness inside each trajectory the variance is less, but
what about bias?

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 25 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


T 
hX T 
X i
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1
T 
hX T 
X i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st0 , at0 )
t=1 t0 =1
T X
hX T  i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 )
t=1 t0 =1
f (t,t0 )
T X
X T z }| {
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 ) (11)
t=1 t0 =1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 26 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


T 
hX T 
X i
∇θ J(θ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1
T 
hX T 
X i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st0 , at0 )
t=1 t0 =1
T X
hX T  i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 )
t=1 t0 =1
f (t,t0 )
T X
X T z }| {
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 ) (11)
t=1 t0 =1

§ Let us consider the term,


Eτ ∼pθ (τ ) f (t, t0 ) = Eτ ∼pθ (τ ) ∇θ log πθ (at |st )r(st0 , at0 )
  
(12)

§ We will show that for the case of t0 < t (reward coming before the
action is performed) the above term is zero.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 26 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


Z
Eτ ∼pθ (τ ) f (t, t0 ) = p(τ )f (t, t0 )d(τ )
 

Z
= p(s1 , a1 , · · · , st , at , · · · , st0 , at0 , · · · )f (t, t0 )

d(s1 , a1 , · · · , st , at , · · · , st0 , at0 , · · · )


Z
= p(st , at , st0 , at0 )f (t, t0 )d(st , at , st0 , at0 ) (13)

§ The above
Z Zcomes from the property
Z Zbelow.
f (X)P (X, Y )dY dX = f (X)P (X)P (Y |X)dY dX
X Y X Y

Z

Z * 1


= f (X)P (X) |X)dY dX
P (Y  
X Y
 
Z
= f (X)P (X)dX (14)
X

§ Taking X = {st , at , st0 , at0 } and Y the rest.


Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 27 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


§ Till now we have, Z
Eτ ∼pθ (τ ) f (t, t0 ) = p(st , at , st0 , at0 )f (t, t0 )d(st , at , st0 , at0 )
 
(15)

§ We will now use a variation Zof iterated expectation.


EA,B [f (A, B)] = P (A, B)f (A, B)dBdA
Z
= P (B|A)P (A)f (A, B)dBdA
Z Z
= P (A) P (B|A)f (A, B)dB dA
Z
= P (A)EB [f (A, B)|A] dA
 
= EA EB [f (A, B)|A]

§ Taking A = st0 , at0 and B = st , at , eqn. (15) can be written as,


Eτ ∼pθ (τ ) f (t, t0 ) = E E [f (t, t0 )|st0 , at0 ]
   
(16)
st0 ,at0 st ,at

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 28 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


§ Putting the value of 0
 f (t, t ) back in eqn. (16),
 we get,
Eτ ∼pθ (τ ) f (t, t0 ) = E E [f (t, t0 )|st0 , at0 ] (17)
st0 ,at0 st ,at
 
= E E [∇θ log πθ (at |st )r(st0 , at0 )|st0 , at0 ]
st0 ,at0 st ,at
 
= E r(st0 , at0 ) E [∇θ log πθ (at |st )|st0 , at0 ]
st0 ,at0 st ,at

§ Let us take a closer look at


Z the inner expectation,
E [∇θ log πθ (at |st )|st0 , at0 ] = P (st , at |st0 , at0 )∇θ log πθ (at |st )d(at , st ) (18)
st ,at

§ Now, let us consider the timestep t be greater than t0 , i.e., the action
occurs after the reward. In such a case, P (st , at |st0 , at0 ) can be
broken down to P (at |st )P
Z Z(st |st0 , at0 ). Thus eqn. (18) becomes,
E [∇θ log πθ (at |st )|st0 , at0 ] = P (at |st )P (st |st0 , at0 )∇θ log πθ (at |st )dat dst
st ,at
Z Z
= P (st |st0 , at0 ) P (at |st )∇θ log πθ (at |st )dat dst
h   i
=E E ∇θ log πθ (at |st )|st |st0 , at0 (19)
st at
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 29 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


§ Now we will use a neat trick known as ‘Expected
 GradLog
Probability’ (EGLP) lemma which says E ∇θ log pθ (x) = 0.
∇θ pθ (x)
Z Z
 
E ∇θ log pθ (x) = pθ (x)∇θ log pθ (x)dx = pθ (x) dx
x∼pθ (x) pθ (x)
Z Z
= ∇θ pθ (x)dx = ∇θ pθ (x)dx = ∇θ 1 = 0

§ Thus the inner expectation in eqn. (19) is 0. This, in turn, means


eqn. (17), (16) and (15) are all 0.
§ That is, Eτ ∼pθ (τ ) f (t, t0 ) = 0 for t > t0 .
 

§ Now for t ≤ t0 , P (st , at |st0 , at0 ) can not be broken down to


P (at |st )P (st |st0 , at0 ), as past state (st ) will get conditioned on future
state and actions (st0 , at0 ) violating the Markov property.
§ So, Eτ ∼pθ (τ ) f (t, t0 ) 6= 0 for t ≤ t0 .
 

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 30 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


§ So we began with,
T X
X T
Eτ ∼pθ (τ ) f (t, t0 )
 
∇θ J(θ) = (20)
t=1 t0 =1

and have shown that


(
 0
 = 0 if t0 < t
Eτ ∼pθ (τ ) f (t, t )
6= 0 if t0 ≥ t

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


§ So we began with,
T X
X T
Eτ ∼pθ (τ ) f (t, t0 )
 
∇θ J(θ) = (20)
t=1 t0 =1

and have shown that


(
 0
 = 0 if t0 < t
Eτ ∼pθ (τ ) f (t, t )
6= 0 if t0 ≥ t
§ So, the gradient of the total expected return can "be written as, #
T T XX T T XX
Eτ ∼pθ (τ ) f (t, t0 ) = Eτ ∼pθ (τ ) f (t, t0 )
 
∇θ J(θ) =
t=1 t0 =t t=1 t0 =t
T 
hX T 
X i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at ) (21)
t=1 t0 =t

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate


§ So we began with,
T X
X T
Eτ ∼pθ (τ ) f (t, t0 )
 
∇θ J(θ) = (20)
t=1 t0 =1

and have shown that


(
 0
 = 0 if t0 < t
Eτ ∼pθ (τ ) f (t, t )
6= 0 if t0 ≥ t
§ So, the gradient of the total expected return can "be written as, #
T T XX T T XX
Eτ ∼pθ (τ ) f (t, t0 ) = Eτ ∼pθ (τ ) f (t, t0 )
 
∇θ J(θ) =
t=1 t0 =t t=1 t0 =t
T 
hX T 
X i
= Eτ ∼pθ (τ ) ∇θ log πθ (at |st ) r(st , at ) (21)
t=1 t0 =t
§ This is the ‘reward to go’ formulation we have seen earlier and which
has less variance. But this also is same as the total expected reward
expression which is unbiased. So this is unbiased and less variance
estimator of the total expected reward.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely.


§ Bad stuff is made less likely.

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely.


§ Bad stuff is made less likely.
§ What if all have high reward?

Figure credit: [Sergey Levine, UC Berkeley]

" T T
#
X X
∇θ J(θ) = E [∇θ log pθ (τ )r(τ )] = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ ) τ ∼pθ (τ )
t=1 t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely.


§ Bad stuff is made less likely.
§ What if all have high reward?

Figure credit: [Sergey Levine, UC Berkeley]

" T T
#
X X
∇θ J(θ) = E [∇θ log pθ (τ )r(τ )] = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ ) τ ∼pθ (τ )
t=1 t=1
∇θ J(θ) = E [∇θ log pθ (τ )[r(τ ) − b]]
τ ∼pθ (τ )

§ Will it remain unbiased?


§ Only if E [∇θ log pθ (τ )b] = b E [∇θ log pθ (τ )] = 0
τ ∼pθ (τ ) τ ∼pθ (τ )
§ And E [∇θ log pθ (τ )] = 0 by EGLP Lemma.
τ ∼pθ (τ )

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines
§ So subtracting a constant baseline keeps the estimate unbiased.
§ A reasonable choice of baseline is average reward across the
N
trajectories, b = N1
P
r(τ )
i=1
§ What about variance?
 
∇θ J(θ) = E ∇θ log pθ (τ )[r(τ ) − b]
τ ∼pθ (τ )
 2    2
var = E ∇θ log pθ (τ )[r(τ ) − b] − E ∇θ log pθ (τ )[r(τ ) − b]
τ ∼pθ (τ ) τ ∼pθ (τ )
 2    2
= E ∇θ log pθ (τ )[r(τ ) − b] − E ∇θ log pθ (τ )r(τ )
τ ∼pθ (τ ) τ ∼pθ (τ )
 2 
∂ E ∇θ log pθ (τ )[r(τ ) − b]
∂var τ ∼pθ (τ )
= −0
∂b h ∂b
2  i
∂ E ∇θ log pθ (τ ) r2 (τ ) − 2r(τ )b + b2
τ ∼pθ (τ )
=
∂b
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 33 / 38
Agenda Introduction REINFORCE Bias/Variance

Baselines

h 2  i
∂ E ∇θ log pθ (τ ) r2 (τ ) − 2r(τ )b + b2
∂var τ ∼pθ (τ )
=
∂b ∂b
 2   2 
=0−2 E ∇θ log pθ (τ ) r(τ ) + 2b E ∇θ log pθ (τ )
τ ∼pθ (τ ) τ ∼pθ (τ )

§ For minimum variance,


∂var
=0
∂b
 2   2 
− E ∇θ log pθ (τ ) r(τ ) + b E ∇θ log pθ (τ ) =0
τ ∼pθ (τ ) τ ∼pθ (τ )
 2 
E ∇θ log pθ (τ ) r(τ )
τ ∼pθ (τ )
b=  2 
E ∇θ log pθ (τ )
τ ∼pθ (τ )

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 34 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX T 
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T 
hX  i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX T 
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T 
hX  i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

§ It would be good to have the true value of Q to be used in the


equation.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX T 
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T 
hX  i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

§ It would be good to have the true value of Q to be used in the


equation.
§ But that is not available to us.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX T 
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T 
hX  i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1

§ It would be good to have the true value of Q to be used in the


equation.
§ But that is not available to us.
§ Other alternatives are to estimate this value using methods that we
have seen earlier - MC evaluation, Bootstrapped evaluation (TD),
using function approximation for these.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX  i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1

§ We can also use a baseline version of this.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX  i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T 
hX  i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1

§ We can also use a baseline version of this.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX  i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T 
hX  i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1
T 
hX  i
= E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1

§ We can also use a baseline version of this.


§ This is called the ‘Advantage function’.

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX  i
Qθ (st , at ) − E Qθ (st , at )

∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T 
hX  i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1
T 
hX  i
= E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1

§ We can also use a baseline version of this.


§ This is called the ‘Advantage function’.
§ A(st , at ) can be approximated following the methods we used earlier
(single sample backup or bootstrapping)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance

Advantage Function

T 
hX  i
∇θ J(θ) = E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1
N X
T 
1 X 
≈ ∇θ log πθ (ai,t |si,t ) Aθ (si,t , ai,t )
N i=1 t=1

§ Qθ (st , at ) ≈ r(st , at ) + V θ (st+1 )


§ Aθ (st , at ) ≈ r(st , at ) + V θ (st+1 ) − V θ (st )
§ So we can use a neural network which learns to produce V (s)

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 37 / 38
Agenda Introduction REINFORCE Bias/Variance

Actor-Critic

Figure credit: [Sergey Levine, UC Berkeley]

Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 38 / 38

You might also like