08 PG Methods
08 PG Methods
Abir Das
IIT Kharagpur
Agenda
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 2 / 38
Agenda Introduction REINFORCE Bias/Variance
Resources
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 3 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 4 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 4 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 5 / 38
Agenda Introduction REINFORCE Bias/Variance
p(sT +1 , sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )p(sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )p(sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )p(aT |sT , sT −1 , aT −1 , · · · , s1 , a1 )p(sT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )πθ (aT |sT ) p(sT , sT −1 , aT −1 , · · · , s1 , a1 ) (2)
§ The boxed part of the equation is very simi-
lar to the left hand side. So, using similar argument repetitively, we get,
p(sT +1 , sT , aT , sT −1 , aT −1 , · · · , s1 , a1 )
= p(sT +1 |sT , aT )πθ (aT |sT )p(sT |sT −1 , aT −1 )πθ (aT −1 |sT −1 )
p(sT −1 , sT −2 , aT −2 · · · , s1 , a1 )
T
Y
= p(s1 ) p(st+1 |st , at )πθ (at |st ) (3)
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 7 / 38
Agenda Introduction REINFORCE Bias/Variance
§ Note that, for the time being, we are not considering discount. We
will come back to that.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 8 / 38
Agenda Introduction REINFORCE Bias/Variance
§ We will see how we can optimize this objective - the expected value
of the total reward under the trajectory distribution induced by the
policy θ.
§ But before that let us see how we can evaluate the objective in model
free setting.
" #
X
J(θ) = Eτ ∼pθ (τ ) r(st , at ) (4)
t
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 9 / 38
Agenda Introduction REINFORCE Bias/Variance
§ We will see how we can optimize this objective - the expected value
of the total reward under the trajectory distribution induced by the
policy θ.
§ But before that let us see how we can evaluate the objective in model
free setting.
" #
X 1 XX
J(θ) = Eτ ∼pθ (τ ) r(st , at ) ≈ r(si,t , ai,t ) (4)
t
N t i
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 9 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance
| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance
| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 10 / 38
Agenda Introduction REINFORCE Bias/Variance
| {z }
J(θ)
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Z
∇θ J(θ) = ∇θ pθ (τ )r(τ )dτ
∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance
∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance
∂ log pθ (τ ) 1
∇θ log pθ (τ ) = ∇θ pθ (τ ) = ∇θ pθ (τ )
∂pθ (τ ) pθ (τ )
=⇒ ∇θ pθ (τ ) = pθ (τ )∇θ log pθ (τ ) (5)
§ Remember that
Z
J(θ) = Eτ ∼pθ (τ ) [r(τ )] = pθ (τ )r(τ )dτ
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 11 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 12 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance
§ So, to get the estimate of the gradient we take samples and average
not only the sum of rewards but also average the sum of the gradients
of the policy values.
N
" T T
#
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance
§ So, to get the estimate of the gradient we take samples and average
not only the sum of rewards but also average the sum of the gradients
of the policy values.
N
" T T
#
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 13 / 38
Agenda Introduction REINFORCE Bias/Variance
N T T
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
θ ← θ + α∇θ J(θ)
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 14 / 38
Agenda Introduction REINFORCE Bias/Variance
N T T
1 X X X
∇θ J(θ) ≈ ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t )
N i=1 t=1 t=1
θ ← θ + α∇θ J(θ)
REINFORCE Algorithm
1 Sample {ri } from πθ (at |st ) (run the
policy)
2 ∇θ J(θ)
" ≈ #
N T T
1 P P P
N
∇ θ log π θ (ai,t |si,t ) r(si,t , ai,t )
i=1 t=1 t=1
3 θ ← θ + α∇θ J(θ)
Figure credit: [Sergey Levine, UC Berkeley] 4 Repeat
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 14 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 15 / 38
Agenda Introduction REINFORCE Bias/Variance
T
§ Now consider the case, when it is getting multiplied by r(si,t , ai,t ).
P
t=1
§ Those actions with high rewards are getting more likely.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 15 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 16 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 17 / 38
Agenda Introduction REINFORCE Bias/Variance
Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance
Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 18 / 38
Agenda Introduction REINFORCE Bias/Variance
Unbiased Estimators
§ An unbiased estimator is the one that yields the true value of the
variable being estimated on average. With θ denoting the true value
and θ̂ denoting the estimated value, and unbiased estimator is one
with, E[θ̂] = θ
§ Naturally bias is defined as,
b = E[θ̂] − θ
Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X 1 X
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X
= = θ+0 =θ
N n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance
Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X 1 X
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X
= = θ+0 =θ
N n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance
Estimator Bias
§ The sample mean"estimator is#unbiased.
N −1 N −1
1 X 1 X
E[θ̂] = E x[n] = E[x[n]]
N n=0 N n=0
N −1 N −1
1 X 1 X
= E [θ + w[n]] = E[θ] + E[w[n]]
N n=0 N n=0
N −1
1 X
= = θ+0 =θ
N n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 19 / 38
Agenda Introduction REINFORCE Bias/Variance
Estimator Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 20 / 38
Agenda Introduction REINFORCE Bias/Variance
Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance
Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0
§ Now,
−1
N
" N −1 −1
#
X X NX 2
var w[n] = E w[n] − E w[n]
n=0 n=0 n=0
0
−1 −1 z }| { 2
" N −1 #
NX N
X X 2
=E w[n] − E w[n] =E w[n]
n=0 n=0 n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 21 / 38
Agenda Introduction REINFORCE Bias/Variance
Estimator Variance
N −1
1 X
var(θ̂c ) = E[(θ̂c − E[θ̂c ])2 ] = E[( x[n] − E[θ̂c ])2 ] (8)
N n=0
N −1 N −1 N −1
1 X 2 1 X 2 1 X
= E[( θ + w[n] − θ) ] = E[( w[n]) ] = 2 E[( w[n])2 ]
N n=0 N n=0 N n=0
§ Now,
−1
N
" N −1 −1
#
X X NX 2
var w[n] = E w[n] − E w[n]
n=0 n=0 n=0
0
−1 −1 z }| { 2
" N −1 #
NX N
X X 2
=E w[n] − E w[n] =E w[n]
n=0 n=0 n=0
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 22 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 23 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 24 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 24 / 38
Agenda Introduction REINFORCE Bias/Variance
§ With less randomness inside each trajectory the variance is less, but
what about bias?
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 25 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 26 / 38
Agenda Introduction REINFORCE Bias/Variance
§ We will show that for the case of t0 < t (reward coming before the
action is performed) the above term is zero.
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 26 / 38
Agenda Introduction REINFORCE Bias/Variance
Z
= p(s1 , a1 , · · · , st , at , · · · , st0 , at0 , · · · )f (t, t0 )
§ The above
Z Zcomes from the property
Z Zbelow.
f (X)P (X, Y )dY dX = f (X)P (X)P (Y |X)dY dX
X Y X Y
Z
Z * 1
= f (X)P (X) |X)dY dX
P (Y
X Y
Z
= f (X)P (X)dX (14)
X
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 28 / 38
Agenda Introduction REINFORCE Bias/Variance
§ Now, let us consider the timestep t be greater than t0 , i.e., the action
occurs after the reward. In such a case, P (st , at |st0 , at0 ) can be
broken down to P (at |st )P
Z Z(st |st0 , at0 ). Thus eqn. (18) becomes,
E [∇θ log πθ (at |st )|st0 , at0 ] = P (at |st )P (st |st0 , at0 )∇θ log πθ (at |st )dat dst
st ,at
Z Z
= P (st |st0 , at0 ) P (at |st )∇θ log πθ (at |st )dat dst
h i
=E E ∇θ log πθ (at |st )|st |st0 , at0 (19)
st at
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 29 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 30 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 31 / 38
Agenda Introduction REINFORCE Bias/Variance
Baselines
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance
Baselines
" T T
#
X X
∇θ J(θ) = E [∇θ log pθ (τ )r(τ )] = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ ) τ ∼pθ (τ )
t=1 t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance
Baselines
" T T
#
X X
∇θ J(θ) = E [∇θ log pθ (τ )r(τ )] = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ ) τ ∼pθ (τ )
t=1 t=1
∇θ J(θ) = E [∇θ log pθ (τ )[r(τ ) − b]]
τ ∼pθ (τ )
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 32 / 38
Agenda Introduction REINFORCE Bias/Variance
Baselines
§ So subtracting a constant baseline keeps the estimate unbiased.
§ A reasonable choice of baseline is average reward across the
N
trajectories, b = N1
P
r(τ )
i=1
§ What about variance?
∇θ J(θ) = E ∇θ log pθ (τ )[r(τ ) − b]
τ ∼pθ (τ )
2 2
var = E ∇θ log pθ (τ )[r(τ ) − b] − E ∇θ log pθ (τ )[r(τ ) − b]
τ ∼pθ (τ ) τ ∼pθ (τ )
2 2
= E ∇θ log pθ (τ )[r(τ ) − b] − E ∇θ log pθ (τ )r(τ )
τ ∼pθ (τ ) τ ∼pθ (τ )
2
∂ E ∇θ log pθ (τ )[r(τ ) − b]
∂var τ ∼pθ (τ )
= −0
∂b h ∂b
2 i
∂ E ∇θ log pθ (τ ) r2 (τ ) − 2r(τ )b + b2
τ ∼pθ (τ )
=
∂b
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 33 / 38
Agenda Introduction REINFORCE Bias/Variance
Baselines
h 2 i
∂ E ∇θ log pθ (τ ) r2 (τ ) − 2r(τ )b + b2
∂var τ ∼pθ (τ )
=
∂b ∂b
2 2
=0−2 E ∇θ log pθ (τ ) r(τ ) + 2b E ∇θ log pθ (τ )
τ ∼pθ (τ ) τ ∼pθ (τ )
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 34 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX T
X i
∇θ J(θ) = E ∇θ log πθ (at |st ) r(st , at )
τ ∼pθ (τ )
t=1 t0 =t
| {z }
b θ (st ,at )
Q
T
hX i
= E ∇θ log πθ (at |st ) Qb θ (st , at )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 35 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX i
Qθ (st , at ) − E Qθ (st , at )
∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX i
Qθ (st , at ) − E Qθ (st , at )
∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T
hX i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX i
Qθ (st , at ) − E Qθ (st , at )
∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T
hX i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1
T
hX i
= E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX i
Qθ (st , at ) − E Qθ (st , at )
∇θ J(θ) = E ∇θ log πθ (at |st )
τ ∼pθ (τ ) at
t=1
T
hX i
= E ∇θ log πθ (at |st ) Qθ (st , at ) − V θ (st )
τ ∼pθ (τ )
t=1
T
hX i
= E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 36 / 38
Agenda Introduction REINFORCE Bias/Variance
Advantage Function
T
hX i
∇θ J(θ) = E ∇θ log πθ (at |st ) Aθ (st , at )
τ ∼pθ (τ )
t=1
N X
T
1 X
≈ ∇θ log πθ (ai,t |si,t ) Aθ (si,t , ai,t )
N i=1 t=1
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 37 / 38
Agenda Introduction REINFORCE Bias/Variance
Actor-Critic
Abir Das (IIT Kharagpur) CS60077 Oct 28, 30, Nov 05, 2021 38 / 38