0% found this document useful (0 votes)
22 views6 pages

Icml04 Apprentice Extended

This document provides more detailed proofs of theoretical results from a submitted paper on apprenticeship learning via inverse reinforcement learning. It presents an extended proof of Lemma 3 from the original paper to more clearly convey the geometric intuition. It also includes proofs for several additional lemmas that establish properties of a single iteration of the projection algorithm and will be useful for proving the main convergence theorem. Figures are referenced to help explain the geometric aspects of the proofs.

Uploaded by

dixade1732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

Icml04 Apprentice Extended

This document provides more detailed proofs of theoretical results from a submitted paper on apprenticeship learning via inverse reinforcement learning. It presents an extended proof of Lemma 3 from the original paper to more clearly convey the geometric intuition. It also includes proofs for several additional lemmas that establish properties of a single iteration of the projection algorithm and will be useful for proving the main convergence theorem. Figures are referenced to help explain the geometric aspects of the proofs.

Uploaded by

dixade1732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Supplementary Material to ICML04 Submission

Apprenticeship Learning via


Inverse Reinforcement Learning

Abstract 1. The definition of µ̃(i+1) = µ(i+1) ·µ̂E


µ(i+1) ,
µ(i+1) ·µ(i+1)
In this document, we give more elaborate
which gives for the numerator (µ̃(i+1) − µ̂E ) ·
proofs for the theorems in the submitted pa- µ(i+1) ·µ̂E
per. We also have a section on a different (µ̃(i+1) − µ̂E ) = ( µ(i+1) ·µ(i+1)
µ(i+1) − µ̂E ) ·
interpretation of the projection algorithm. µ(i+1)
·µ̂E (µ(i+1) ·µ̂E )2
( µ(i+1) ·µ(i+1)
µ(i+1) − µ̂E ) = (µ(i+1) ·µ(i+1) )2
µ(i+1) ·
(i+1) µ(i+1) ·µ̂E
µ − 2 µ(i+1) ·µ(i+1)
µ(i+1) · µ̂E + µ̂E · µ̂E =
(i+1) 2
A. Proofs of Theoretical Results − µ(µ(i+1) ·µ·µ̂(i+1)
E)
+ µ̂E · µ̂E . Using this expression
for
the numerator, and multiplying numerator and
Due to space constraints, in (Abbeel & Ng, 2004) we (i+1)
·µ(i+1)
gave a full proof only for the case of µ̂E ∈ M . Here denominator by µ µ̂E ·µ̂ E
gives Equation (2).
we give proofs for the more general case, i.e. we have
2. The following inequalities are easily seen to be
not necessarily that µ̂E ∈ M . First however, we give
true:
a more extensive proof of Lemma 3 in (Abbeel & Ng,
2004), which was proved very densely there.
(µ(i+1) · µ̂E − µ̂E · µ̂E )2 ≥ 0
A.1. Extended Proof of Lemma 3 (µ(i+1) · µ̂E )2 − 2(µ(i+1) · µ̂E )(µ̂E · µ̂E ) + (µ̂E · µ̂E )2 ≥ 0
Figure 1 may be helpful for conveying geometric intu- −(µ(i+1) · µ̂E )2 ≤ −2(µ(i+1) · µ̂E )(µ̂E · µ̂E ) + (µ̂E · µ̂E )2
ition. −(µ(i+1) · µ̂E )2
≤ −2(µ(i+1) · µ̂E ) + µ̂E · µ̂E .
Proof of Lemma 3 µ̂E · µ̂E

Proof. For simplicity of notation, we let the origin of We used the last of these inequalities.
our coordinate system coincide with µ̄(i) . Then
3. For the numerator, we just used (µ(i+1) − µ̂E ) ·
(i+1) (i+1)
(µ̃ − µ̂E ) · (µ̃ − µ̂E ) (µ(i+1) −µ̂E ) = µ(i+1) ·µ(i+1) −2µ̂E ·µ(i+1) +µ̂E ·µ̂E .
(1)
µ̂E · µ̂E We rewrote the denominator as follows µ(i+1) ·
(µ(i+1) ·µ̂E )2 µ(i+1) = (µ(i+1) − µ̂E + µ̂E ) · (µ(i+1) − µ̂E + µ̂E ) =
µ(i+1) · µ(i+1) − µ̂E ·µ̂E (µ(i+1) − µ̂E )·(µ(i+1) − µ̂E )+(µ̂E · µ̂E )+2(µ(i+1) −
= (2)
µ(i+1) · µ(i+1) µ̂E ) · µ̂E .
µ(i+1) · µ(i+1) − 2µ̂E · µ(i+1) + µ̂E · µ̂E
≤ (3) 4. Since π (i+1) = arg maxπ µ̂E ·µ(π) (recall the origin
µ(i+1) · µ(i+1)
is at µ̄(i) for notational convenience), we have µ̂E ·
(i+1) (i+1)
(µ − µ̂E ) · (µ − µ̂E ) µ(i+1) = µ̂E · µ(π (i+1) ) ≥ µ̂E · µ̂E , which implies
= (i+1) (i+1) (i+1)
(4)
(µ − µ̂E ) · (µ − µ̂E ) + µ̂E · µ̂E + 2(µ − µ̂E ) · µ̂E
(µ(i+1) − µ̂E ) · (µ(i+1) − µ̂E ) (i+1)
≤ (5) µ ^
(µ(i+1) − µ̂E ) · (µ(i+1) − µ̂E ) + µ̂E · µ̂E µE
k 2 /(1 − γ)2
≤ , (6)
k 2 /(1 − γ)2 + µ̂E · µ̂E
w
where we used in order:
~ (i+1)
µ
Preliminary work. Under review by the International Con-
_ (i)
ference on Machine Learning (ICML). Do not distribute. µ

Figure 1. Progress in one iteration step.


2(µ(i+1) − µ̂E ) · µ̂E ≥ 0, which implies Equation Lemmas 4-6 will establish properties for a single iter-
(4). ation of the algorithm, that will be useful for proving
the main convergence theorem. In reading the proofs,
1 k
5. All points considered lie in M = [0, 1−γ ] , so their Figure 2 may be helpful for conveying geometric intu-
norms are bounded by k/(1 − γ). ition.
Lemma 4. Let µ̄(i) , µ̂E ∈ Rk , and ρ ∈ R, with kµ̂E −
This proves the one step improvement equation of the µ̄(i) k2 ≥ ρ be given,
Lemma. then the two following optimization problems
It remains to prove that that µ̃(i+1) = λµ(i+1) + (1 −
λ)µ̄(i) , for some λ ∈ [0, 1]. It is easily seen from min kµ − µ̄(i) k2 (10)
µ̂E ·µ(i+1)
the definition of µ̃(i+1) that for λ = µ(i+1) ·µ(i+1)
, we µ:kµ̂E −µk2 ≤ρ

have µ̃(i+1) = λµ(i+1) + (1 − λ)µ̄(i) . (Recall we still min µ · (µ̂E − µ̄(i) ) (11)
µ:kµ̂E −µk2 ≤ρ
have µ̄(i) as our origin to simplify notation.) Since
π (i+1) = arg maxπ µ̂E · µ(π), we have µ̂E · µ(i+1) = have the same minimizing argument, which is given by
µ̂E · µ(π (i+1) ) ≥ µ̂E · µ̂E ≥ 0, which implies λ ≥ 0.
ρ (i) kµ̂E − µ̄(i) k2 − ρ
Now to prove λ ≤ 1, we start with Cauchy-Schwartz µρ,i = µ̄ + µ̂E (12)
kµ̂E − µ̄(i) k2 kµ̂E − µ̄(i) k2
inequality
We also have that
(µ(i+1) · µ̂E )2 ≤ (µ(i+1) · µ(i+1) )(µ̂E · µ̂E ), (7)
∃α > 0 such that µ̄(i) − µρ,i = α(µ̄(i) − µ̂E ) (13)
(i+1)
which combined with µ̂E · µ̂E ≤ µ · µ̂E gives
Proof. This can be verified by solving each of the prob-
(i+1) 2 (i+1) (i+1) (i+1) lems, which can be done by forming the Lagrangian,
(µ · µ̂E ) ≤ (µ ·µ )(µ̂E · µ ), (8)
taking derivatives and setting to zero. The deriva-
from which we immediately get tion is trivial but quite long, and thus omitted. Equa-
tion (13) follows immediately from Equation (12). 
µ̂E · µ(i+1)
λ= (i+1)
≤1 (9) Lemma 5. Let there be given an MDP\R, features
µ · µ(i+1)
φ : S 7→ [0, 1]k , a set of policies Π, µ̄(i) ∈ M , and
 ρ ∈ R. Suppose kµ̂E − µ̄(i) k2 ≥ ρ, and that there exists
some µ̄E ∈ M such that kµ̂E − µ̄E k2 ≤ ρ. Let π (i+1) be
A.2. More General Convergence Theorem the optimal policy for the MDP\R with reward R(s) =
We first review some definitions from the proofs sec- (µ̂E − µ̄(i) ) · φ(s), and µ(i+1) = µ(π (i+1) ). Further, let
tion in (Abbeel & Ng, 2004). Given a set of policies µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − µ̄(i) k2 . Then, we have
Π, we define M = M (Π) = Co{µ(π) : π ∈ Π} to that
be the convex hull of the set of feature expectations (µ(i+1) − µρ,i ) · (µ̄(i) − µρ,i ) ≤ 0 .
attained by policies π ∈ Π. Hence, given any vec-
Proof. Since µ(i+1) = µ(π (i+1) ), and π (i+1) is the
tor of feature expectations µ̃ ∈ M , there is a set of
n optimal policy for the MDP\R with reward R(s) =
policies πP1 ,n. . . , πn ∈ Π and mixture Pnweights {λi }i=1 (µ̂E − µ̄(i) ) · φ(s), we have
(λi ≥ 0, i=1 λi = 1), so that µ̃ = i=1 µ(πi ). Thus,
given any point µ̃ ∈ M , by mixing together policies in µ(i+1) = arg maxµ∈M (µ̂E − µ̄(i) ) · µ.
Π, we can obtain a new policy whose feature expec-
tations are exactly µ̃. (Here, mixture policies are as This implies
defined in Section 2 of the paper.
(µ̂E − µ̄(i) ) · µ(i+1) ≥ (µ̂E − µ̄(i) ) · µ̄E (14)
We also define M (i) = Co{µ(π (j) ) : j = 0, . . . , i} to
be the convex hull of the set of feature expectations of since µ̄E ∈ M . Using the equivalent definition of µρ,i
policies found after iterations 0, . . . , i of our algorithm. as given by Equation (11) in Lemma 4 and kµ̄E −
As mentioned previously, µ̂E is a noisy estimate of µ̂E k2 ≤ ρ, we have (µ̂E − µ̄(i) ) · µ̄E ≥ (µ̂E − µ̄(i) ) · µρ,i ,
µE . Thus, it may not be a valid feature expectation which combined with Equation (14) gives
vector for any policy; i.e., we do not necessarily have
(µ̂E − µ̄(i) ) · µ(i+1) ≥ (µ̂E − µ̄(i) ) · µρ,i
µ̂E ∈ M . So rather than proving convergence to µE ,
we will instead consider a small ball with radius ρ cen- Simple manipulation gives
tered around µ̂E and that intersects M , and prove con-
vergence to this ball. (µ̄(i) − µ̂E ) · (µ(i+1) − µρ,i ) ≤ 0
rule for right triangles; the sin rule for triangles1 ;
(i+1)
µ
a
ρ µ^
B E sin C ≤ 1 and kck22 = cT c = (a + b)T (a + b) =
C µρ, i aT a + bT b + 2aT b ≥ kak22 + kbk22 , where the inequality
c h follows from aT b = (µ(i+1) − µρ,i ) · (µ̄(i) − µρ,i ) ≤ 0,
b
which follows directly from Lemma 5. By taking
the first derivative, we easily verify that √ kak
~(i+1)
µ 2
2 2 kak2 +kbk2
w (i+1)
A is strictly increasing√ as a function of kak2 and thus

µ (i)
√ kak 2
2
2
≤√ k
2 2
since k/(1−γ) is an up-
kak2 +kbk2 k+(1−γ) kbk2
perbound on the distance between 2 points in M . This
Figure 2. Triangle characterizing improvement for one it- upperbound on the distance follows from φ ∈ [0, 1]k
1
eration. µ̄(i) ∈ M (i) , w(i+1) = α(µ̂E − µ̄(i) ), µ(i+1) = which implies µ ∈ 1−γ [0, 1]k . Combining this inequal-
µ(π (i+1) ), with π (i+1) the optimal policy for the given ity with Equation (18) and b = µ̄i −µρ,i (by definition)
MDP\R, and R(s) = w (i+1) ·φ(s). µ̂E is the estimate of the gives the lemma. 
expert’s feature expectations. µρ,i is the point in the ρ-ball
around µ̂E closest to µ̄(i) . µ̃(i+1) is the projection of µρ,i
onto the line through µ(i+1) , µ̄(i) .A, B, C denote the 3 an- The above lemma describes how, starting from feature
gles of the triangle. a, b, c denote the vectors of each of the expectations µ̄(i) ∈ M , we can pick w = µ̂E − µ̄(i)
sides, i.e. a = µρ,i −µ(i+1) , b = µ̄(i) −µρ,i , c = µ̄(i) −µ(i+1) . such that the feature expectations µ(i+1) of the corre-
sponding optimal policy π (i+1) and µ̄(i) have a mixture
µ̃(i+1) such that kµ̃(i+1) −µρ,i k2 ≤ ckµ̄(i) −µρ,i k2 , with
By using Equation (13) of Lemma 4 we get the desired c < 1. Convergence of each version of the algorithm
result is proved below, by showing how in each iteration, we
(µ̄(i) − µρ,i ) · (µ(i+1) − µρ,i ) ≤ 0 achieve the improvement from the lemma.
For simplicity, in Section 3 of the paper we gave

the algorithm assuming we are given the exact
feature expectations µE . In the general case,
Lemma 6. Let there be an MDP\R, features φ : S 7→ the algorithm will use an estimate µ̂E instead,
[0, 1]k , a set of policies Π, µ̄(i) ∈ M , and ρ ∈ R. Sup- and convergence will be to a ball of radius ρ
pose kµ̂E − µ̄(i) k2 ≥ ρ and there is some µ̄E ∈ M such around µ̂E , for any ρ ≥ minµ∈M kµ − µ̂E k2 . So
that kµ̂E − µ̄E k2 ≤ ρ. Let π (i+1) be the optimal policy (i)
now we let tmm = minν∈M (i) ,µ:kµ̂E −µk2 ≤ρ kν −
for the MDP\R with reward R(s) = (µ̂E − µ̄(i) ) · φ(s), (i)
µk2 for the max-margin version, and tproj =
and µ(i+1) = µ(π (i+1) ). Further, let µ̃(i+1) be the or-
minν∈Co{µ̄(i) ,µ(i+1) },µ:kµ̂E −µk2 ≤ρ kν − µk2 . Note that
thogonal projection of µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − (i) (i) (i)
µ̄(i) k2 onto the line passing through the points µ(i+1) tproj ≥ tmm , because the tproj is defined using a
and µ̄(i) . We then have that minimize over a smaller domain (Co{µ̄(i) , µ(i+1) } ⊆
M (i+1) ). Here, as before, we use CoZ to denote the

kµ̃(i+1) − µρ,i k2 k convex hull of the set Z. Using our new definition of
(i)
≤p . t(i) , we can now state a more general version of the
kµ̄ − µρ,i k2 k + (1 − γ) kµ̄(i) − µρ,i k22
2
convergence theorem which can be seen to be the spe-
Proof. Consider the triangle formed by the three cial case where µ̂E = µE and thus we can choose ρ = 0
points µ̄(i) , µ(i+1) and µρ,i , and name the sides and and the old and new definition of t(i) coincide.
angles as in Figure 2. (Note that a, b, and c are vec- Theorem 1. Let an MDP\R, features φ : S 7→ [0, 1]k ,
tors; see figure caption.) Then we have any  > 0 and any ρ ≥ minµ∈M kµ − µ̂E k2 be given.
Then the apprenticeship learning algorithm (both max-
h h margin and projection versions) will terminate with
= (15)
kµ̄(i) − µρ,i k2 kbk2 t(i) ≤  after at most
= sin A (16)  
k k
kak2 n=O (1−γ)2 2 log (1−γ) (19)
= sin C (17)
kck2
a iterations.
≤ p , (18)
2
kak2 + kbk22 1
For any triangle with sides a, b, c and opposite angles
A, B, C the sin rule for triangles states sina A = sinb B =
sin C
where we used in order: the definition of b; the sin c
.
Proof. We first show for both versions

that we have Combining Equations (22) and (23) gives
k
geometric convergence at a rate √ 2 2
, and then
k+(1−γ)  (i+1)
use this to compute the required number of iterations. tproj kµ̃(i+1) − µρ,i k2
(i)

In the max-margin version, the computation of w (i) at tproj kµ̄i − µρ,i k2
every iteration, is easily seen to be equivalent to setting
w(i) = µ̂E − µ̄(i) , for µ̄(i) = arg minµ∈M (i) kµ̂E − µk2 . Applying Lemma 6 gives
Obviously µ̄(i) ∈ M (i) as is required to apply Lemma 6 √
(i+1)
later. If we define µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − tproj k
≤p
µ̄(i) k2 as in Lemma 6, then we see that (i)
tproj k + (1 − γ)2 kµ̄(i) − µρ,i k22
t(i)
mm = kµρ,i − µ̄i k2 . (20)
So in both cases we get the same guarantee for im-
Define µ̃(i+1) as in Lemma 6 and observe that µ̃(i+1) ∈ provement in every iteration. Throughout iterations
(i+1)
M (i+1) , and using the definition of tmm we have this results into
√ √ √
t(i+1)
mm ≤ kµ̃
(i+1)
− µρ,i k2 (21) (i) k i (0) k i k
t ≤( p )t ≤( p )
2 2
(1 − γ)  + k 2 2
(1 − γ)  + k 1 − γ
Combining Equations (20) and (21) gives
1 k
(i+1)
tmm kµ̃(i+1) − µρ,i k2 where the last inequality follows from M ⊆ 1−γ [0, 1] ,

≤ k
(i)
tmm kµ̄i − µρ,i k2 and so 1−γ is an upper bound on the distance between
any 2 points in M . So we have t(i) ≤  if and only if
Applying Lemma 6 gives
√ √
(i+1) √ k k
tmm k (p ) i
≤
(i)
≤p (1 − γ)2 2 + k 1 − γ
tmm k + (1 − γ)2 kµ̄(i) − µρ,i k22
which is equivalent to
For the projection version, at every iteration we have
µ̄(i) = µ̃(i) = arg minµ∈Af f {µ̄(i−1) ,µ(i) } kµ̂E − µk2 = √ p
(1 − γ)2 2 + k
 
k k k
arg minµ∈Co{µ̄(i) ,µ(i+1) } kµ̂E − µk2 , so obviously µ̄(i) ∈ i ≥ log /log √ =O log
(1 − γ) k (1 − γ)2 2 (1 − γ)
M (i) as is required to apply Lemma 6 later. Here,
Af f Z denotes the affine hull of Z; specifically, if Z 
is a set of two points, then this is the line through
these 2 points.2 The first equality holds because in
step 2 of our algorithm we compute µ̄(i−1) exactly like Note that in practice, convergence might be much
we defined µ̃(i−1) . The second equality corresponds faster than predicted by the upperbound in the theo-
to the definition of orthogonal projection onto a line rem, since we have

an improvement by at least a fac-
k
(and thus corresponds to our definition of µ̃(i) ), the tor of √ 2 (i) 2
for every iteration, which
k+(1−γ) kµ̄ −µρ,i k2
last equality follows because of Lemma 5.3 If we define √
k
µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − µ̄(i) k2 as in Lemma 6, is typically much better than the bound √
k+(1−γ)2 2
then we see that used in the proof.
(i) Remark (using approximate RL algorithms).
tproj = kµ̄i − µρ,i k2 (22)
Sometimes, in each iteration of the algorithm we will
be able to solve the MDP only approximately. It is
Define µ̃(i+1) as in Lemma 6, and observe that µ̃(i+1) ∈
(i+1) straightforward to generalize our result to this setting.
M (i+1) , and using the definition of tproj we have
Assume on each iteration we can obtain an approxi-
(i+1) mately optimal policy π such that kµ(π) − µ(π ∗ )k2 ≤
tproj ≤ kµ̃(i+1) − µρ,i k2 (23) 1 , with π ∗ the optimal policy for that iteration. Then
2 P
More formally, Af f Z = { i λi zi : zi ∈ Z, i λi =
P for any ρ ≥ minµ∈M kµ − µ̂E k2 and any  > 0
1, λi ∈ R}. our algorithm will converge to a policy π̃, such that
3
More precisely, Lemma 5 implies the 3 points kµ̂E − µ(π̃)k2 ≤ ρ + 1 +  after at most the number of
(i+1)
µ , µρ,i , µ̄(i) form an obtuse angle at µρ,i , which implies iterations given in Equation (19) of Theorem 1. This
the orthogonal projection of µρ,i onto Af f {µ(i+1) , µ̄(i) } result is easily proved by changing the definition of the
falls into Co{µ(i+1) , µ̄(i) }. ρ-ball around µ̂E to a (ρ + 1 )-ball.
A.3. Sample Complexity where k is the dimension of the feature vectors φ and
We now consider the number of samples from the ex- feature expectations µ, and m is the number of sample
pert required. In the paper we assumed µ̂E ∈ M . Here trajectories used for the estimate µ̂E . So if we take
9k 2k
we do not assume this, which leads to slightly different m ≥ 2((1−γ)) 2 log δ then with probability at least
constant factors. The differences between this proof (1 − δ) we have that
and the proof in the paper are the following: we take 
into account the possibility µ̂ ∈
/ M , the proof is a little kµE − µ̂E k∞ ≤ √
3 k
less dense.

Theorem 2. Let an MDP\R, features φ : S 7→ [0, 1]k , Using k · k2 ≤ kk · k∞ for k-dimensional space, we get
and any  > 0, δ > 0 be given. Suppose the appren-

ticeship learning algorithm (either max-margin or pro- kµE − µ̂E k2 ≤ (31)
jection version) is run using an estimate µ̂E for µE 3
obtained by m Monte Carlo samples. In order to en- Since µE ∈ M , Theorem 1 together with Equation (31)
sure that with probability at least 1 − δ the algorithm guarantee convergence to the ball of radius ρ = 3
terminates after at most a number of iterations n given around µ̂E . After sufficient iterations of the algorithm
by Equation (19), and outputs a policy π̃ so that for as specified in Equation (19), we have t(i) ≤ 3 , and
any true reward R∗ (s) = w ∗ T φ(s) (kw ∗ k1 ≤ 1) 4 we thus the algorithm will return a policy π̃ such that
have µ̃ = µ(π̃) satisfies
P∞ P∞
E[ t=0 γ t R∗ (st )|π̃] ≥ E[ t=0 γ t R∗ (st )|πE ]−, (24) 2
kµ̃ − µ̂E k2 ≤ t(i) + ρ ≤ (32)
it suffices that 3
9k 2k Now (keeping in mind that k.k2 ≤ k.k1 and so kwk1 ≤
m≥ 2((1−γ))2 log δ .
1 implies kwk2 ≤ 1) we can easily prove the result
1 k
Proof. Recall φ ∈ [0, 1]k so µ ∈ [0, 1−γ ] . Let µi de- ∞ ∞
note the i’th component of µ then applying the Cher- |E[
X
γ t R∗ (s(t) )|π̃] − E[
X
γ t R∗ (s(t) )|πE ]| (33)
noff bound on the m-sample estimate (1 − γ)µ̂i of t=0 t=0
(1 − γ)µi ∈ [0, 1] gives
= |(w∗ )T (µ̃ − µE )|
P ((1 − γ)|µi − µ̂i | > τ ) ≤ 2 exp(−2τ 2 m). (25) = |(w∗ )T (µ̃ − µ̂E + µ̂E − µE )|
Using Equation (25) for all components i and the union ≤ |(w∗ )T (µ̃ − µ̂E )| + |(w∗ )T (µ̂E − µE )|
bound gives us ≤ kw∗ k2 kµ̃ − µ̂E k2 + kw∗ k2 kµ̂E − µE k2
2  9k 2k
P (∃i ∈ {1 . . . k}.(1−γ)|µi −µ̂i | > τ ) ≤ 2k exp(−2τ 2 m). ≤ 1( + ) w.p. (1 − δ) for m ≥ log
3 3 2((1 − γ))2 δ
(26)
Now we subtract both sides of Equation (26) from 1, 9k 2k
=  w.p. (1 − δ) for m ≥ log
to find that 2((1 − γ))2 δ

P (¬∃i ∈ {1 . . . k}.(1 − γ)|µi − µ̂i | > τ ) (27) where we used in order the definition of w,µ; adding
subtracting the same terms; the triangle inequal-
= P ((1 − γ)kµE − µ̂E k∞ ≤ τ ) (28)
ity; Hölder’s inequality for p = q = 2;5 Equa-
≥ 1 − 2k exp(−2τ 2 m) (29) tions (31), (32); and simplification. The last line di-
√ rectly implies the theorem.
Substituting τ = (1 − γ)/(3 k) into Equation (29)
gives 

 (1 − γ) Note that in case the underlying reward function R∗


P (kµE −µ̂E k∞ ≤ √ ) ≥ 1−2k exp(−2( √ )2 m),
3 k 3 k does not lie exactly in the span of basis functions,
(30) we have graceful degradation of performance. Let
4
R∗ (s) = w ∗ · φ(s) + f (s), then it is easy to see from
The rationale for the 1-norm constraint on w and ∞- Equation (33) in the proof above, that performance
norm constraint on φ is that these are dual norms, and
thus we have w · φ(s) ≤ kwk1 kφ(s)k∞ (Hölder’s inequal- degradation is bounded by kf k∞
1−γ .
ity). So it corresponds to constraining the maximal reward
5
maxs R(s) ≤ 1. Taking any 2 dual norms for w and φ would Hölder’s inequality states that for any p ≥ 1, q ≥
imply maxs R(s) ≤ 1. Taking 2-norm constraints on w and 1, 1/p + 1/q = 1 and any x, y in some vector space we
φ results in exactly the same theorem. have x · y ≤ kxkp kykq .
B. Alternative Interpretation of the So the corresponding LP is the dual of a Bellman
(i)
Projection Algorithm
P
LP, with reward R(s) = k (µE,k − µk )φk (s) =
(i)
It is well-known that MDP’s can be solved via linear (µE − µ ) · φ(s). It is easily seen that the above al-
programming, more specifically by solving the Bellman gorithm corresponds to the projection algorithm. The
LP projection corresponds to the line search, and choos-
ing a reward weight vector w and finding the respective
minV e0 V (34) optimal policy corresponds to linearizing and solving
P the obtained LP.
s.t. ∀s, a V (s) ≥ R(s) + γ z P (z|s, a)V (z)

where e is an |S| dimensional vector of all ones. Al- References


though this LP is generally too big to solve exactly, Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning
there has been recent work on using this LP for- via inverse reinforcement learning. Proc. ICML.
mulation to get approximate solutions (de Farias & Censor, Y., & Zenios, S. (1997). Parallel optimization:
Van Roy, 2003; Guestrin et al., 2003). The dual of Theory, algorithms, and applications. Oxford University
this LP is Press.
P de Farias, D. P., & Van Roy, B. (2003). The linear program-
maxλ s,a λ(s, a)R(s) (35) ming approach to approximate dynamic programming.
P P
s.t. ∀s a λ(s, a) − γ z,a P (s|z, a)λ(z, a) = 1 Operations Research, 51, No. 6.
Guestrin, C., Koller, D., Parr, R., & Venkataraman, S.
The entry λ(s, a) represents the expected frequency
(2003). Efficient solution algorithms for factored mdps.
of the state-action pair s, a, the constraints ensure λ JAIR, 19, 399–468.
is consistent with the transition probabilities in the
MDP. So we can write the feature expectations for a
policy specified by λ explicitly as a function of λ
X
(µ(λ))k = λ(s, a)φk (s) (36)
s,a

We can explicitly formulate the problem of matching


the expert’s feature expectations µE as a QP

minλ k (µE,k − s,a λ(s, a)φk (s))2


P P
(37)
P P
s.t. ∀s a λ(s, a) − γ z,a P (s|z, a)λ(z, a) = 1

In practice, the above QP is typically too large to solve


exactly, just like the Bellman LP and its dual. We
will now derive an algorithm to solve the above QP,
assuming we have access to a reinforcement learning
algorithm, i.e. we assume we can get a solution to the
Bellman LP (and/or its dual). The idea is to linearize
the objective of (37) at the point of its current iterate,
and then use the RL algorithm to solve the correspond-
ing LP. Then the algorithm does a line search between
the current iterate’s point, and the solution to the LP
(which is feasible, but not necessarily optimal for the
QP). Then we linearize around the obtained point and
iterate the above steps. 6 The linearized objective at
a point λ(i) with corresponding feature expectations
µ(i) is easily computed to be
(i)
X X
λ(s, a) (µE,k − µk )φk (s) (38)
s,a k

6
Note our algorithm is an instantiation of the so called
Frank-Wolfe algorithm (Censor & Zenios, 1997).

You might also like