An Introduction to Policy Search Methods
Thomas Furmston
January 23, 2017
Markov Decision Processes
Markov decision processes (MDPs) are the standard model
for optimal control in a fully observable environment.
Successful applications include,
robotics,
board games, such as chess, backgammon & go,
computer games, such as Tetris & Atari 2600 video games,
traffic management, elevator scheduling & helicopter flight
control.
Markov Decision Processes - Notation
A Markov decision process is described by the tuple
(S, A, D, P, R), in which,
S - state space (finite set)
A - action space (finite set)
D - initial state distribution
P - transition dynamics, which is a set of conditional
distributions over the state space, {P(|s, a)}(s,a)SA
R - reward function, which is a function R(, ) : S A R
Markov Decision Processes - Notation
Given a MDP we then have a policy, .
This is a set of conditional distributions over the action space,
{(|s)}sS , which is used to determine which action to take
given the current state of the environment.
The policy can be optimised in order to maximise an objective.
Markov Decision Processes - Sampling
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 )
3: for t = 1 to H do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 )
7: end for
Algorithm 1: pseudocode for sampling a trajectory from a
Markov decision process.
Tetris - An Example
Total Expected Reward
We shall consider the total expected reward with an
infinite horizon and a discounted reward.
Discount factor - [0, 1).
Objective function takes the form,
X
t1
U() := Est ,at pt R(st , at ); , (1)
t=1
in which, pt , is the occupancy marginal at time t given, .
Value Functions & Dynamic Programming
Value functions are a core concept in Markov decision pro-
cesses.
The state value function is given by,
X
Est ,at pt t1 R(st , at )s1 = s; ,
V (s) :=
t=1
which satisfies the fixed point equation, known as the Bellman
equation,
0
V (s) = Ea(|s) R(s, a) + Es0 P(|s,a) V (s ) .
Value Functions & Dynamic Programming
The state-action value function is given by,
X
t1
Q (s, a) := Est ,at pt R(st , at )s1 = s; a1 = a; ,
t=1
which can also be written in terms of the state value function,
0
Q (s, a) := R(s, a) + Es0 P(|s,a) V (s ) .
The global optimum of (1) can be found through dynamic pro-
gramming.
Dynamic Programming
Dynamic programming is infeasible for many real-world prob-
lems of interest.
As a result, most research has focused on obtaining approxi-
mate or locally optimal solutions, including,
approximate dynamic programming methods,
tree search methods,
local trajectory-optimization techniques, e.g., differential dy-
namic programming,
policy search methods.
Policy Search Methods
Policy search methods are typically specialized applications of
techniques from numerical optimization.
As such, policy is given some differentiable parametric form, de-
noted (a|s; w ), or w , with policy parameters, w W Rn ,
n N.
For example,
> (a,s)
ew
(a|s; w ) = P w > (a0 ,s)
, (2)
a0 A e
with : A S Rn is a feature mapping.
Policy Search Methods
We overload notation and write the objective function directly in
terms of the parameter vector, i.e.,
U(w ) := U(w ), w W. (3)
Similarly, w W, we have,
V (s; w ) := V (s; w ), s S,
Q(s, a; w ) := Q(s, a; w ), (s, a) S A,
pt (s, a; w ) := pt (s, a; w ), (s, a) S A.
Policy Search Methods
Local information, such as the gradient of the objective function,
is used to update the policy in an incremental manner until con-
vergence to a local optimum.
Benefits include,
General convergence guarantees.
Good any time performance.
Only necessary to approximate a low-dimensional projec-
tion of the value function.
Easily extendible to models for partially observable environ-
ments, such as the finite state controllers.
Policy Gradient Theorem
Theorem (Policy Gradient Theorem [1])
Given a Markov decision process with objective (1), then for
any, w W, the gradient of (3) takes the form,
XX
U(w ) = p (s, a; w )Q(s, a; w ) log (a|s; w ),
w w
sS aA
in which,
X
p (s, a; w ) = t1 pt (s, a; w ).
t=1
An On-line Policy Search Method - Version 1
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Calculate state-action value : Q(st , at ; wt )
8: Update policy :
wt+1 = wt + t Q(st , at ; wt ) log (at |st ; w )
w w =wt
9: end for
Algorithm 2: pseudo code for an on-line policy search method.
Compatible Function Approximation
Definition (Compatible Function Approximation [1])
Let fw : S A R be a function approximator to Qw , which is
parametrised by v Rn .
fw is said to be compatible with respect to a policy parametrisa-
tion if,
fw is linear in v, i.e. fw (s, a; v ) = v > (a, s),
v fw (s, a; v) = w log (a|s; w ).
Compatible Function Approximation
Theorem (Policy Gradient Theorem with Compatible
Function Approximation [1])
If fw is a function approximator that is compatible w.r.t. the
given policy parametrisation, and,
v = argminT (v; w ), (4)
v R
with,
XX 2
T (v; w ) = p (s, a; w ) Q(s, a; w ) f (s, a; v) , (5)
sS aA
then,
XX
U(w ) = p (s, a; w )fw (s, a; v ) log (a|s; w ).
w w
sS aA
An On-line Policy Search Method - Version 2
1: Sample initial state : s1 D()
2: Sample action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Optimise function approximation :
vt = argminT (v; wt )
vR
8: Update policy :
wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt
9: end for
Algorithm 3: pseudo code for an on-line policy search method.
Actor-Critic Methods
However, performing the optimisation,
v = argminT (v; w ),
vR
at every time-step will generally be prohibitively expensive.
Also, wt+1 wt , implies that, vt+1 vt , which suggests that we
update the function approximation parameters in an incremental
manner.
These observations give rise to actor-critic methods.
Actor-Critic Methods
In these methods we iteratively optimise the policy parameters
and the function approximation parameters at the same time.
For example, at each iteration we could have,
wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w ),
w w =wt
vt+1 = vt + t g(wt ),
in which
g(wt ) is a step direction in the function approximation pa-
rameter space (algorithm dependent).
{t }
t=1 is a step-size sequence for the function approxima-
tion parameters.
Actor-Critic Methods
Different types of critic can be considered. For example, a batch-
based solution of the least squares problem (5).
Popular approach in literature is to use temporal difference
learning [2]. We follow approach of [2] and consider a linear
compatible critic learnt through TD(0).
In this case the critic update at the t th iteration takes the form,
t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )
vt+1 = vt + t t (st , at ; wt )
An On-line Actor-Critic Algorithm
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Update critic :
t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )
vt+1 = vt + t t (st , at ; wt )
8: Update policy :
wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt
9:end for
Algorithm 4: pseudo code for TD(0) actor-critic algorithm.
Actor-Critic Methods
To prove convergence we need the two step-size sequences to
satisfy the following criteria,
Robbins-Munro conditions,
X
X
t > 0, t, t = , t2 < ,
t=1 t=1
X
X
t > 0, t, t = , t2 < .
t=1 t=1
Policy parameters updated at a slower rate than function
approximation parameters,
t
lim = 0.
t t
Natural Gradient Ascent
Steepest gradient ascent often gives poor results in practice,
e.g., due to poor scaling of the objective function.
As a result alternative optimisation techniques are often consid-
ered.
A popular alternative is natural gradient ascent, which was in-
troduced into the policy search literature in the work of [2].
Natural Policy Gradients
In natural gradient ascent the parameter update takes the form,
w new = w + G1 (w ) U(w ),
w
in which, G(w), is the Fisher information matrix of the policy
distribution, averaged over the state distribution, i.e.,
XX >
G(w ) = p (s, a; w ) log (a|s; w ) log (a|s; w ).
w w
sS aA
Natural Actor-Critic
Theorem (Natural Policy Gradients with Compatible
Function Approximation [2])
Suppose that f is a linear function approximator that is compati-
ble w.r.t. the given policy parametrisation.
If v Rn are the optimal critic parameters, i.e., v minimises (5),
then,
v = G1 (w ) U(w ).
w
In other words, the natural gradient is given by the optimal critic
parameters.
An On-line Natural Actor-Critic Algorithm
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Update critic :
t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )
vt+1 = vt + t t (st , at ; wt )
8: Update policy :
wt+1 = wt + t vt
9: end for
Algorithm 5: pseudo code for TD(0) natural actor-critic algo-
rithm.
Tetris
As an example of policy gradients in action we consider the
Tetris domain.
We consider the parametrisation in (2).
For a given state-action pair we consider the following features,
each evaluated on the board that results from taking the given
action in the given state,
number of holes in the board,
column heights in the board,
difference in column heights,
maximum column height.
Total of 21 features.
Tetris
Bibliography - Policy Search Methods I
R. Sutton, D. McAllester, S. Singh, and Y. Mansour.
Policy gradient methods for reinforcement learning with
function approximation.
NIPS, 13, 2000.
S. Kakade.
A natural policy gradient.
NIPS, 14, 2002.
T. Furmston, G.Lever, and D. Barber.
Approximate Newton methods for policy search in Markov
decision processes.
Journal of Machine Learning Research, 17:151, 2016.
Bibliography - Actor-Critic Methods I
V. Konda and J. Tsitsiklis.
Actor-critic algorithms.
NIPS, 11:10081014, 1999.
S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark.
Natural actor-critic algorithms.
Automatica, 45:24712482, 2009.
Bibliography - Policy Search Methods & Neural
Networks I
N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and
Y. Tassa.
Learning continuous control policies by stochastic value
gradients.
NIPS, 27:29262934, 2015.
T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning.
ICLR, 4, 2016.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
P. Moritz.
Trust region policy optimization.
ICML, 32:18891897, 2015.
Bibliography - Misc I
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and
M. Riedmiller.
Deterministic policy gradient algorithms.
ICML, 31:387395, 2014.
R. Sutton.
Learning to predict by the method of temporal differences..
Machine Learning, 3:944, 1988.