0% found this document useful (0 votes)

83 views33 pages

An Introduction To Policy Search Methods: Thomas Furmston

The document provides an introduction to policy search methods for solving Markov decision processes. It discusses Markov decision process notation and concepts like state-action value functions. Policy search methods optimize a parameterized policy using gradient-based optimization. The policy gradient theorem allows calculating the policy gradient using sample trajectories. Compatible function approximation can be used to estimate state-action values to apply the policy gradient theorem. Actor-critic methods optimize the policy and value function approximation parameters jointly in an iterative online process.

Uploaded by

behera.ece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views33 pages

An Introduction To Policy Search Methods: Thomas Furmston

Uploaded by

behera.ece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

An Introduction to Policy Search Methods

Thomas Furmston

January 23, 2017

Markov Decision Processes

Markov decision processes (MDPs) are the standard model

for optimal control in a fully observable environment.

Successful applications include,

robotics,
board games, such as chess, backgammon & go,
computer games, such as Tetris & Atari 2600 video games,
traffic management, elevator scheduling & helicopter flight
control.
Markov Decision Processes - Notation

A Markov decision process is described by the tuple

(S, A, D, P, R), in which,
S - state space (finite set)
A - action space (finite set)
D - initial state distribution
P - transition dynamics, which is a set of conditional
distributions over the state space, {P(|s, a)}(s,a)SA
R - reward function, which is a function R(, ) : S A R
Markov Decision Processes - Notation

Given a MDP we then have a policy, .

This is a set of conditional distributions over the action space,

{(|s)}sS , which is used to determine which action to take
given the current state of the environment.

The policy can be optimised in order to maximise an objective.

Markov Decision Processes - Sampling

1: Sample initial state : s1 D()

2: Sample initial action : a1 (|s = s1 )
3: for t = 1 to H do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 )
7: end for
Algorithm 1: pseudocode for sampling a trajectory from a
Markov decision process.
Tetris - An Example
Total Expected Reward

We shall consider the total expected reward with an

infinite horizon and a discounted reward.

Discount factor - [0, 1).

Objective function takes the form,

X
t1
U() := Est ,at pt R(st , at ); , (1)
t=1

in which, pt , is the occupancy marginal at time t given, .

Value Functions & Dynamic Programming

Value functions are a core concept in Markov decision pro-

cesses.

The state value function is given by,

X
Est ,at pt t1 R(st , at )s1 = s; ,

V (s) :=
t=1

which satisfies the fixed point equation, known as the Bellman

equation,

0

V (s) = Ea(|s) R(s, a) + Es0 P(|s,a) V (s ) .
Value Functions & Dynamic Programming

The state-action value function is given by,

X
t1

Q (s, a) := Est ,at pt R(st , at )s1 = s; a1 = a; ,

t=1

which can also be written in terms of the state value function,

0
Q (s, a) := R(s, a) + Es0 P(|s,a) V (s ) .

The global optimum of (1) can be found through dynamic pro-

gramming.
Dynamic Programming

Dynamic programming is infeasible for many real-world prob-

lems of interest.

As a result, most research has focused on obtaining approxi-

mate or locally optimal solutions, including,
approximate dynamic programming methods,
tree search methods,
local trajectory-optimization techniques, e.g., differential dy-
namic programming,
policy search methods.
Policy Search Methods

Policy search methods are typically specialized applications of

techniques from numerical optimization.

As such, policy is given some differentiable parametric form, de-

noted (a|s; w ), or w , with policy parameters, w W Rn ,
n N.

For example,
> (a,s)
ew
(a|s; w ) = P w > (a0 ,s)
, (2)
a0 A e

with : A S Rn is a feature mapping.

Policy Search Methods

We overload notation and write the objective function directly in

terms of the parameter vector, i.e.,

U(w ) := U(w ), w W. (3)

Similarly, w W, we have,

V (s; w ) := V (s; w ), s S,
Q(s, a; w ) := Q(s, a; w ), (s, a) S A,
pt (s, a; w ) := pt (s, a; w ), (s, a) S A.
Policy Search Methods

Local information, such as the gradient of the objective function,

is used to update the policy in an incremental manner until con-
vergence to a local optimum.

Benefits include,
General convergence guarantees.
Good any time performance.
Only necessary to approximate a low-dimensional projec-
tion of the value function.
Easily extendible to models for partially observable environ-
ments, such as the finite state controllers.
Policy Gradient Theorem

Theorem (Policy Gradient Theorem [1])

Given a Markov decision process with objective (1), then for
any, w W, the gradient of (3) takes the form,

XX
U(w ) = p (s, a; w )Q(s, a; w ) log (a|s; w ),
w w
sS aA

in which,

X
p (s, a; w ) = t1 pt (s, a; w ).
t=1
An On-line Policy Search Method - Version 1

1: Sample initial state : s1 D()

2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Calculate state-action value : Q(st , at ; wt )
8: Update policy :

wt+1 = wt + t Q(st , at ; wt ) log (at |st ; w )
w w =wt

9: end for
Algorithm 2: pseudo code for an on-line policy search method.
Compatible Function Approximation

Definition (Compatible Function Approximation [1])

Let fw : S A R be a function approximator to Qw , which is
parametrised by v Rn .

fw is said to be compatible with respect to a policy parametrisa-

tion if,
fw is linear in v, i.e. fw (s, a; v ) = v > (a, s),

v fw (s, a; v) = w log (a|s; w ).
Compatible Function Approximation
Theorem (Policy Gradient Theorem with Compatible
Function Approximation [1])
If fw is a function approximator that is compatible w.r.t. the
given policy parametrisation, and,

v = argminT (v; w ), (4)

v R

with,
XX 2
T (v; w ) = p (s, a; w ) Q(s, a; w ) f (s, a; v) , (5)
sS aA

then,
XX
U(w ) = p (s, a; w )fw (s, a; v ) log (a|s; w ).
w w
sS aA
An On-line Policy Search Method - Version 2
1: Sample initial state : s1 D()
2: Sample action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Optimise function approximation :

vt = argminT (v; wt )
vR

8: Update policy :

wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt

9: end for
Algorithm 3: pseudo code for an on-line policy search method.
Actor-Critic Methods

However, performing the optimisation,

v = argminT (v; w ),
vR

at every time-step will generally be prohibitively expensive.

Also, wt+1 wt , implies that, vt+1 vt , which suggests that we

update the function approximation parameters in an incremental
manner.

These observations give rise to actor-critic methods.

Actor-Critic Methods

In these methods we iteratively optimise the policy parameters

and the function approximation parameters at the same time.

For example, at each iteration we could have,

wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w ),
w w =wt
vt+1 = vt + t g(wt ),

in which
g(wt ) is a step direction in the function approximation pa-
rameter space (algorithm dependent).
{t }
t=1 is a step-size sequence for the function approxima-
tion parameters.
Actor-Critic Methods

Different types of critic can be considered. For example, a batch-

based solution of the least squares problem (5).

Popular approach in literature is to use temporal difference

learning [2]. We follow approach of [2] and consider a linear
compatible critic learnt through TD(0).

In this case the critic update at the t th iteration takes the form,

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

vt+1 = vt + t t (st , at ; wt )
An On-line Actor-Critic Algorithm
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Update critic :

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

vt+1 = vt + t t (st , at ; wt )

8: Update policy :

wt+1 = wt + t fwt (st , at ; vt ) log (at |st ; w )
w w =wt

9:end for
Algorithm 4: pseudo code for TD(0) actor-critic algorithm.
Actor-Critic Methods

To prove convergence we need the two step-size sequences to

satisfy the following criteria,
Robbins-Munro conditions,

X
X
t > 0, t, t = , t2 < ,
t=1 t=1

X
X
t > 0, t, t = , t2 < .
t=1 t=1

Policy parameters updated at a slower rate than function

approximation parameters,
t
lim = 0.
t t
Natural Gradient Ascent

Steepest gradient ascent often gives poor results in practice,

e.g., due to poor scaling of the objective function.

As a result alternative optimisation techniques are often consid-

ered.

A popular alternative is natural gradient ascent, which was in-

troduced into the policy search literature in the work of [2].
Natural Policy Gradients

In natural gradient ascent the parameter update takes the form,

w new = w + G1 (w ) U(w ),
w
in which, G(w), is the Fisher information matrix of the policy
distribution, averaged over the state distribution, i.e.,
XX >
G(w ) = p (s, a; w ) log (a|s; w ) log (a|s; w ).
w w
sS aA
Natural Actor-Critic

Theorem (Natural Policy Gradients with Compatible

Function Approximation [2])
Suppose that f is a linear function approximator that is compati-
ble w.r.t. the given policy parametrisation.

If v Rn are the optimal critic parameters, i.e., v minimises (5),

then,

v = G1 (w ) U(w ).
w
In other words, the natural gradient is given by the optimal critic
parameters.
An On-line Natural Actor-Critic Algorithm
1: Sample initial state : s1 D()
2: Sample initial action : a1 (|s = s1 ; w1 )
3: for t = 1 to do
4: Obtain reward : rt = R(st , at )
5: Sample next state : st+1 P(|s = st , a = at )
6: Sample next action : at+1 (|s = st+1 ; wt )
7: Update critic :

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

vt+1 = vt + t t (st , at ; wt )

8: Update policy :

wt+1 = wt + t vt

9: end for
Algorithm 5: pseudo code for TD(0) natural actor-critic algo-
rithm.
Tetris
As an example of policy gradients in action we consider the
Tetris domain.

We consider the parametrisation in (2).

For a given state-action pair we consider the following features,

each evaluated on the board that results from taking the given
action in the given state,
number of holes in the board,
column heights in the board,
difference in column heights,
maximum column height.

Total of 21 features.
Tetris
Bibliography - Policy Search Methods I

R. Sutton, D. McAllester, S. Singh, and Y. Mansour.

Policy gradient methods for reinforcement learning with
function approximation.
NIPS, 13, 2000.
S. Kakade.
A natural policy gradient.
NIPS, 14, 2002.
T. Furmston, G.Lever, and D. Barber.
Approximate Newton methods for policy search in Markov
decision processes.
Journal of Machine Learning Research, 17:151, 2016.
Bibliography - Actor-Critic Methods I

V. Konda and J. Tsitsiklis.

Actor-critic algorithms.
NIPS, 11:10081014, 1999.
S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark.
Natural actor-critic algorithms.
Automatica, 45:24712482, 2009.
Bibliography - Policy Search Methods & Neural
Networks I

N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and

Y. Tassa.
Learning continuous control policies by stochastic value
gradients.
NIPS, 27:29262934, 2015.
T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning.
ICLR, 4, 2016.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and
P. Moritz.
Trust region policy optimization.
ICML, 32:18891897, 2015.
Bibliography - Misc I

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and

M. Riedmiller.
Deterministic policy gradient algorithms.
ICML, 31:387395, 2014.
R. Sutton.
Learning to predict by the method of temporal differences..
Machine Learning, 3:944, 1988.

Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Natural Actor-Critic Reinforcement Learning
No ratings yet
Natural Actor-Critic Reinforcement Learning
12 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Unifying Parametric Policy Search Methods
No ratings yet
Unifying Parametric Policy Search Methods
9 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
RL 5
No ratings yet
RL 5
26 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
MDPs and State Machines Overview
No ratings yet
MDPs and State Machines Overview
64 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Paper RL
No ratings yet
Paper RL
61 pages
S A - C D A S: OFT Ctor Ritic For Iscrete Ction Ettings
No ratings yet
S A - C D A S: OFT Ctor Ritic For Iscrete Ction Ettings
7 pages
Unified Stochastic Optimization Framework
No ratings yet
Unified Stochastic Optimization Framework
69 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Mathematical Foundations of Reinforcement Learning
No ratings yet
Mathematical Foundations of Reinforcement Learning
283 pages
M 2
No ratings yet
M 2
12 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Module 04
No ratings yet
Module 04
63 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Policy Approximation Document
No ratings yet
Policy Approximation Document
2 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
Lec 09
No ratings yet
Lec 09
51 pages
Lec 08
No ratings yet
Lec 08
59 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
ml4r 2025 06
No ratings yet
ml4r 2025 06
16 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
Value-Policy Integration in RL
No ratings yet
Value-Policy Integration in RL
21 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Stanford Markov Decision Processes
No ratings yet
Stanford Markov Decision Processes
20 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
L10 Actor Critic
No ratings yet
L10 Actor Critic
56 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Score Sheet: Mens Singles
No ratings yet
Score Sheet: Mens Singles
3 pages
Websites Kids Capability Parent Feedback Whatsapp Fees Payments Dates Term Dates Lettings Bookings For Half Term
No ratings yet
Websites Kids Capability Parent Feedback Whatsapp Fees Payments Dates Term Dates Lettings Bookings For Half Term
1 page
Student Medical Information Sheet
No ratings yet
Student Medical Information Sheet
2 pages
Undan
No ratings yet
Undan
4 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Badminton Forecasting
No ratings yet
Badminton Forecasting
3 pages
Util 29 Mayv 2
No ratings yet
Util 29 Mayv 2
88 pages
Guest Phone Number
No ratings yet
Guest Phone Number
4 pages
BI Strategy and Roadmap Overview
No ratings yet
BI Strategy and Roadmap Overview
3 pages
Util 29 Mayv 2
No ratings yet
Util 29 Mayv 2
88 pages
Costs Estimate - : Please Tick Appropriate Boxes
No ratings yet
Costs Estimate - : Please Tick Appropriate Boxes
2 pages
Job Description
No ratings yet
Job Description
4 pages
Script Task
No ratings yet
Script Task
6 pages
Mastering Business Analysis2011
No ratings yet
Mastering Business Analysis2011
26 pages
HP Oracle DW - BI Sizing Questionnaire
100% (1)
HP Oracle DW - BI Sizing Questionnaire
15 pages
Case Study Fastfood
No ratings yet
Case Study Fastfood
8 pages
41 Data Reconciliation
No ratings yet
41 Data Reconciliation
3 pages
OMKAR (AUM) Chanting-Time Frequency Analysis
100% (1)
OMKAR (AUM) Chanting-Time Frequency Analysis
6 pages
Timetable For BCA & M. SC Even Semester 2017
No ratings yet
Timetable For BCA & M. SC Even Semester 2017
4 pages
Rome Total War Console Cheat Codes
No ratings yet
Rome Total War Console Cheat Codes
2 pages
Standard Operating Procedure (SOP) TTT PUne 2019
No ratings yet
Standard Operating Procedure (SOP) TTT PUne 2019
15 pages
Book Shop Management System Documentation
74% (39)
Book Shop Management System Documentation
46 pages
FIM10 DataSheet (v1.1)
No ratings yet
FIM10 DataSheet (v1.1)
16 pages
Framework For Integrated Test
No ratings yet
Framework For Integrated Test
13 pages
AS Computer Science Specimen Paper
No ratings yet
AS Computer Science Specimen Paper
6 pages
Well Architect
No ratings yet
Well Architect
1 page
Nexus 1000v ARP Inspection Guide
No ratings yet
Nexus 1000v ARP Inspection Guide
6 pages
Practical Last
No ratings yet
Practical Last
2 pages
Computer Science Exam Paper 2000
No ratings yet
Computer Science Exam Paper 2000
3 pages
C++ Generic Match Function: Vicente - Botet@
No ratings yet
C++ Generic Match Function: Vicente - Botet@
14 pages
Naivebayes Tute
No ratings yet
Naivebayes Tute
4 pages
CKM 200 Car Key Master Manual
100% (2)
CKM 200 Car Key Master Manual
33 pages
Principles of Compiler Design PDF
0% (1)
Principles of Compiler Design PDF
177 pages
2002 Computer Studies Paper 2
No ratings yet
2002 Computer Studies Paper 2
21 pages
Shruthi Nagaraj: Contact
No ratings yet
Shruthi Nagaraj: Contact
2 pages
Certificate JS, HTML, Css (Duke)
No ratings yet
Certificate JS, HTML, Css (Duke)
1 page
Python Programming Topics & Exercises
No ratings yet
Python Programming Topics & Exercises
3 pages
Bca 1st Sem
No ratings yet
Bca 1st Sem
1 page
Writing Order Recovery From Telugu Character Images
No ratings yet
Writing Order Recovery From Telugu Character Images
9 pages
Etrust SeOS
No ratings yet
Etrust SeOS
335 pages
Smart Fabrics: Types and Applications
No ratings yet
Smart Fabrics: Types and Applications
27 pages
Image Captioning
67% (3)
Image Captioning
16 pages
Project Report "City Blood Bank Management System" Submitted by
No ratings yet
Project Report "City Blood Bank Management System" Submitted by
8 pages
Introduction To C Programming: "Turbo C Programming For PC", Robert Lafore, SAMS
100% (3)
Introduction To C Programming: "Turbo C Programming For PC", Robert Lafore, SAMS
30 pages
The Impact of Technology on Students
No ratings yet
The Impact of Technology on Students
2 pages
Name: Joel James P. Alao Assignment 1
No ratings yet
Name: Joel James P. Alao Assignment 1
7 pages
Match Factor for Mixed Truck Fleets
100% (1)
Match Factor for Mixed Truck Fleets
10 pages

An Introduction To Policy Search Methods: Thomas Furmston

Uploaded by

An Introduction To Policy Search Methods: Thomas Furmston

Uploaded by

An Introduction to Policy Search Methods

January 23, 2017

Markov decision processes (MDPs) are the standard model

Successful applications include,

A Markov decision process is described by the tuple

Given a MDP we then have a policy, .

This is a set of conditional distributions over the action space,

The policy can be optimised in order to maximise an objective.

1: Sample initial state : s1 D()

We shall consider the total expected reward with an

Discount factor - [0, 1).

Objective function takes the form,

in which, pt , is the occupancy marginal at time t given, .

Value functions are a core concept in Markov decision pro-

The state value function is given by,

which satisfies the fixed point equation, known as the Bellman

The state-action value function is given by,

which can also be written in terms of the state value function,

The global optimum of (1) can be found through dynamic pro-

Dynamic programming is infeasible for many real-world prob-

As a result, most research has focused on obtaining approxi-

Policy search methods are typically specialized applications of

As such, policy is given some differentiable parametric form, de-

with : A S Rn is a feature mapping.

We overload notation and write the objective function directly in

U(w ) := U(w ), w W. (3)

Local information, such as the gradient of the objective function,

Theorem (Policy Gradient Theorem [1])

1: Sample initial state : s1 D()

Definition (Compatible Function Approximation [1])

fw is said to be compatible with respect to a policy parametrisa-

v = argminT (v; w ), (4)

However, performing the optimisation,

at every time-step will generally be prohibitively expensive.

Also, wt+1 wt , implies that, vt+1 vt , which suggests that we

These observations give rise to actor-critic methods.

In these methods we iteratively optimise the policy parameters

For example, at each iteration we could have,

Different types of critic can be considered. For example, a batch-

Popular approach in literature is to use temporal difference

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

To prove convergence we need the two step-size sequences to

Policy parameters updated at a slower rate than function

Steepest gradient ascent often gives poor results in practice,

As a result alternative optimisation techniques are often consid-

A popular alternative is natural gradient ascent, which was in-

In natural gradient ascent the parameter update takes the form,

Theorem (Natural Policy Gradients with Compatible

If v Rn are the optimal critic parameters, i.e., v minimises (5),

t = R(st , at ) + fwt (st+1 , at+1 ; vt ) fwt (st , at ; vt )

We consider the parametrisation in (2).

For a given state-action pair we consider the following features,

R. Sutton, D. McAllester, S. Singh, and Y. Mansour.

V. Konda and J. Tsitsiklis.

N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and

You might also like