Icml04 Apprentice Extended

This document provides more detailed proofs of theoretical results from a submitted paper on apprenticeship learning via inverse reinforcement learning. It presents an extended proof of Lemma 3 from the original paper to more clearly convey the geometric intuition. It also includes proofs for several additional lemmas that establish properties of a single iteration of the projection algorithm and will be useful for proving the main convergence theorem. Figures are referenced to help explain the geometric aspects of the proofs.

Uploaded by

dixade1732

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views6 pages

Icml04 Apprentice Extended

Uploaded by

dixade1732

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Supplementary Material to ICML04 Submission

Apprenticeship Learning via

Inverse Reinforcement Learning

Abstract 1. The definition of µ̃(i+1) = µ(i+1) ·µ̂E

µ(i+1) ,
µ(i+1) ·µ(i+1)
In this document, we give more elaborate
which gives for the numerator (µ̃(i+1) − µ̂E ) ·
proofs for the theorems in the submitted pa- µ(i+1) ·µ̂E
per. We also have a section on a different (µ̃(i+1) − µ̂E ) = ( µ(i+1) ·µ(i+1)
µ(i+1) − µ̂E ) ·
interpretation of the projection algorithm. µ(i+1)
·µ̂E (µ(i+1) ·µ̂E )2
( µ(i+1) ·µ(i+1)
µ(i+1) − µ̂E ) = (µ(i+1) ·µ(i+1) )2
µ(i+1) ·
(i+1) µ(i+1) ·µ̂E
µ − 2 µ(i+1) ·µ(i+1)
µ(i+1) · µ̂E + µ̂E · µ̂E =
(i+1) 2
A. Proofs of Theoretical Results − µ(µ(i+1) ·µ·µ̂(i+1)
E)
+ µ̂E · µ̂E . Using this expression
for
the numerator, and multiplying numerator and
Due to space constraints, in (Abbeel & Ng, 2004) we (i+1)
·µ(i+1)
gave a full proof only for the case of µ̂E ∈ M . Here denominator by µ µ̂E ·µ̂ E
gives Equation (2).
we give proofs for the more general case, i.e. we have
2. The following inequalities are easily seen to be
not necessarily that µ̂E ∈ M . First however, we give
true:
a more extensive proof of Lemma 3 in (Abbeel & Ng,
2004), which was proved very densely there.
(µ(i+1) · µ̂E − µ̂E · µ̂E )2 ≥ 0
A.1. Extended Proof of Lemma 3 (µ(i+1) · µ̂E )2 − 2(µ(i+1) · µ̂E )(µ̂E · µ̂E ) + (µ̂E · µ̂E )2 ≥ 0
Figure 1 may be helpful for conveying geometric intu- −(µ(i+1) · µ̂E )2 ≤ −2(µ(i+1) · µ̂E )(µ̂E · µ̂E ) + (µ̂E · µ̂E )2
ition. −(µ(i+1) · µ̂E )2
≤ −2(µ(i+1) · µ̂E ) + µ̂E · µ̂E .
Proof of Lemma 3 µ̂E · µ̂E

Proof. For simplicity of notation, we let the origin of We used the last of these inequalities.
our coordinate system coincide with µ̄(i) . Then
3. For the numerator, we just used (µ(i+1) − µ̂E ) ·
(i+1) (i+1)
(µ̃ − µ̂E ) · (µ̃ − µ̂E ) (µ(i+1) −µ̂E ) = µ(i+1) ·µ(i+1) −2µ̂E ·µ(i+1) +µ̂E ·µ̂E .
(1)
µ̂E · µ̂E We rewrote the denominator as follows µ(i+1) ·
(µ(i+1) ·µ̂E )2 µ(i+1) = (µ(i+1) − µ̂E + µ̂E ) · (µ(i+1) − µ̂E + µ̂E ) =
µ(i+1) · µ(i+1) − µ̂E ·µ̂E (µ(i+1) − µ̂E )·(µ(i+1) − µ̂E )+(µ̂E · µ̂E )+2(µ(i+1) −
= (2)
µ(i+1) · µ(i+1) µ̂E ) · µ̂E .
µ(i+1) · µ(i+1) − 2µ̂E · µ(i+1) + µ̂E · µ̂E
≤ (3) 4. Since π (i+1) = arg maxπ µ̂E ·µ(π) (recall the origin
µ(i+1) · µ(i+1)
is at µ̄(i) for notational convenience), we have µ̂E ·
(i+1) (i+1)
(µ − µ̂E ) · (µ − µ̂E ) µ(i+1) = µ̂E · µ(π (i+1) ) ≥ µ̂E · µ̂E , which implies
= (i+1) (i+1) (i+1)
(4)
(µ − µ̂E ) · (µ − µ̂E ) + µ̂E · µ̂E + 2(µ − µ̂E ) · µ̂E
(µ(i+1) − µ̂E ) · (µ(i+1) − µ̂E ) (i+1)
≤ (5) µ ^
(µ(i+1) − µ̂E ) · (µ(i+1) − µ̂E ) + µ̂E · µ̂E µE
k 2 /(1 − γ)2
≤ , (6)
k 2 /(1 − γ)2 + µ̂E · µ̂E
w
where we used in order:
~ (i+1)
µ
Preliminary work. Under review by the International Con-
_ (i)
ference on Machine Learning (ICML). Do not distribute. µ

Figure 1. Progress in one iteration step.

2(µ(i+1) − µ̂E ) · µ̂E ≥ 0, which implies Equation Lemmas 4-6 will establish properties for a single iter-
(4). ation of the algorithm, that will be useful for proving
the main convergence theorem. In reading the proofs,
1 k
5. All points considered lie in M = [0, 1−γ ] , so their Figure 2 may be helpful for conveying geometric intu-
norms are bounded by k/(1 − γ). ition.
Lemma 4. Let µ̄(i) , µ̂E ∈ Rk , and ρ ∈ R, with kµ̂E −
This proves the one step improvement equation of the µ̄(i) k2 ≥ ρ be given,
Lemma. then the two following optimization problems
It remains to prove that that µ̃(i+1) = λµ(i+1) + (1 −
λ)µ̄(i) , for some λ ∈ [0, 1]. It is easily seen from min kµ − µ̄(i) k2 (10)
µ̂E ·µ(i+1)
the definition of µ̃(i+1) that for λ = µ(i+1) ·µ(i+1)
, we µ:kµ̂E −µk2 ≤ρ

have µ̃(i+1) = λµ(i+1) + (1 − λ)µ̄(i) . (Recall we still min µ · (µ̂E − µ̄(i) ) (11)
µ:kµ̂E −µk2 ≤ρ
have µ̄(i) as our origin to simplify notation.) Since
π (i+1) = arg maxπ µ̂E · µ(π), we have µ̂E · µ(i+1) = have the same minimizing argument, which is given by
µ̂E · µ(π (i+1) ) ≥ µ̂E · µ̂E ≥ 0, which implies λ ≥ 0.
ρ (i) kµ̂E − µ̄(i) k2 − ρ
Now to prove λ ≤ 1, we start with Cauchy-Schwartz µρ,i = µ̄ + µ̂E (12)
kµ̂E − µ̄(i) k2 kµ̂E − µ̄(i) k2
inequality
We also have that
(µ(i+1) · µ̂E )2 ≤ (µ(i+1) · µ(i+1) )(µ̂E · µ̂E ), (7)
∃α > 0 such that µ̄(i) − µρ,i = α(µ̄(i) − µ̂E ) (13)
(i+1)
which combined with µ̂E · µ̂E ≤ µ · µ̂E gives
Proof. This can be verified by solving each of the prob-
(i+1) 2 (i+1) (i+1) (i+1) lems, which can be done by forming the Lagrangian,
(µ · µ̂E ) ≤ (µ ·µ )(µ̂E · µ ), (8)
taking derivatives and setting to zero. The deriva-
from which we immediately get tion is trivial but quite long, and thus omitted. Equa-
tion (13) follows immediately from Equation (12).
µ̂E · µ(i+1)
λ= (i+1)
≤1 (9) Lemma 5. Let there be given an MDP\R, features
µ · µ(i+1)
φ : S 7→ [0, 1]k , a set of policies Π, µ̄(i) ∈ M , and
ρ ∈ R. Suppose kµ̂E − µ̄(i) k2 ≥ ρ, and that there exists
some µ̄E ∈ M such that kµ̂E − µ̄E k2 ≤ ρ. Let π (i+1) be
A.2. More General Convergence Theorem the optimal policy for the MDP\R with reward R(s) =
We first review some definitions from the proofs sec- (µ̂E − µ̄(i) ) · φ(s), and µ(i+1) = µ(π (i+1) ). Further, let
tion in (Abbeel & Ng, 2004). Given a set of policies µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − µ̄(i) k2 . Then, we have
Π, we define M = M (Π) = Co{µ(π) : π ∈ Π} to that
be the convex hull of the set of feature expectations (µ(i+1) − µρ,i ) · (µ̄(i) − µρ,i ) ≤ 0 .
attained by policies π ∈ Π. Hence, given any vec-
Proof. Since µ(i+1) = µ(π (i+1) ), and π (i+1) is the
tor of feature expectations µ̃ ∈ M , there is a set of
n optimal policy for the MDP\R with reward R(s) =
policies πP1 ,n. . . , πn ∈ Π and mixture Pnweights {λi }i=1 (µ̂E − µ̄(i) ) · φ(s), we have
(λi ≥ 0, i=1 λi = 1), so that µ̃ = i=1 µ(πi ). Thus,
given any point µ̃ ∈ M , by mixing together policies in µ(i+1) = arg maxµ∈M (µ̂E − µ̄(i) ) · µ.
Π, we can obtain a new policy whose feature expec-
tations are exactly µ̃. (Here, mixture policies are as This implies
defined in Section 2 of the paper.
(µ̂E − µ̄(i) ) · µ(i+1) ≥ (µ̂E − µ̄(i) ) · µ̄E (14)
We also define M (i) = Co{µ(π (j) ) : j = 0, . . . , i} to
be the convex hull of the set of feature expectations of since µ̄E ∈ M . Using the equivalent definition of µρ,i
policies found after iterations 0, . . . , i of our algorithm. as given by Equation (11) in Lemma 4 and kµ̄E −
As mentioned previously, µ̂E is a noisy estimate of µ̂E k2 ≤ ρ, we have (µ̂E − µ̄(i) ) · µ̄E ≥ (µ̂E − µ̄(i) ) · µρ,i ,
µE . Thus, it may not be a valid feature expectation which combined with Equation (14) gives
vector for any policy; i.e., we do not necessarily have
(µ̂E − µ̄(i) ) · µ(i+1) ≥ (µ̂E − µ̄(i) ) · µρ,i
µ̂E ∈ M . So rather than proving convergence to µE ,
we will instead consider a small ball with radius ρ cen- Simple manipulation gives
tered around µ̂E and that intersects M , and prove con-
vergence to this ball. (µ̄(i) − µ̂E ) · (µ(i+1) − µρ,i ) ≤ 0
rule for right triangles; the sin rule for triangles1 ;
(i+1)
µ
a
ρ µ^
B E sin C ≤ 1 and kck22 = cT c = (a + b)T (a + b) =
C µρ, i aT a + bT b + 2aT b ≥ kak22 + kbk22 , where the inequality
c h follows from aT b = (µ(i+1) − µρ,i ) · (µ̄(i) − µρ,i ) ≤ 0,
b
which follows directly from Lemma 5. By taking
the first derivative, we easily verify that √ kak
~(i+1)
µ 2
2 2 kak2 +kbk2
w (i+1)
A is strictly increasing√ as a function of kak2 and thus
√
µ (i)
√ kak 2
2
2
≤√ k
2 2
since k/(1−γ) is an up-
kak2 +kbk2 k+(1−γ) kbk2
perbound on the distance between 2 points in M . This
Figure 2. Triangle characterizing improvement for one it- upperbound on the distance follows from φ ∈ [0, 1]k
1
eration. µ̄(i) ∈ M (i) , w(i+1) = α(µ̂E − µ̄(i) ), µ(i+1) = which implies µ ∈ 1−γ [0, 1]k . Combining this inequal-
µ(π (i+1) ), with π (i+1) the optimal policy for the given ity with Equation (18) and b = µ̄i −µρ,i (by definition)
MDP\R, and R(s) = w (i+1) ·φ(s). µ̂E is the estimate of the gives the lemma.
expert’s feature expectations. µρ,i is the point in the ρ-ball
around µ̂E closest to µ̄(i) . µ̃(i+1) is the projection of µρ,i
onto the line through µ(i+1) , µ̄(i) .A, B, C denote the 3 an- The above lemma describes how, starting from feature
gles of the triangle. a, b, c denote the vectors of each of the expectations µ̄(i) ∈ M , we can pick w = µ̂E − µ̄(i)
sides, i.e. a = µρ,i −µ(i+1) , b = µ̄(i) −µρ,i , c = µ̄(i) −µ(i+1) . such that the feature expectations µ(i+1) of the corre-
sponding optimal policy π (i+1) and µ̄(i) have a mixture
µ̃(i+1) such that kµ̃(i+1) −µρ,i k2 ≤ ckµ̄(i) −µρ,i k2 , with
By using Equation (13) of Lemma 4 we get the desired c < 1. Convergence of each version of the algorithm
result is proved below, by showing how in each iteration, we
(µ̄(i) − µρ,i ) · (µ(i+1) − µρ,i ) ≤ 0 achieve the improvement from the lemma.
For simplicity, in Section 3 of the paper we gave

the algorithm assuming we are given the exact
feature expectations µE . In the general case,
Lemma 6. Let there be an MDP\R, features φ : S 7→ the algorithm will use an estimate µ̂E instead,
[0, 1]k , a set of policies Π, µ̄(i) ∈ M , and ρ ∈ R. Sup- and convergence will be to a ball of radius ρ
pose kµ̂E − µ̄(i) k2 ≥ ρ and there is some µ̄E ∈ M such around µ̂E , for any ρ ≥ minµ∈M kµ − µ̂E k2 . So
that kµ̂E − µ̄E k2 ≤ ρ. Let π (i+1) be the optimal policy (i)
now we let tmm = minν∈M (i) ,µ:kµ̂E −µk2 ≤ρ kν −
for the MDP\R with reward R(s) = (µ̂E − µ̄(i) ) · φ(s), (i)
µk2 for the max-margin version, and tproj =
and µ(i+1) = µ(π (i+1) ). Further, let µ̃(i+1) be the or-
minν∈Co{µ̄(i) ,µ(i+1) },µ:kµ̂E −µk2 ≤ρ kν − µk2 . Note that
thogonal projection of µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − (i) (i) (i)
µ̄(i) k2 onto the line passing through the points µ(i+1) tproj ≥ tmm , because the tproj is defined using a
and µ̄(i) . We then have that minimize over a smaller domain (Co{µ̄(i) , µ(i+1) } ⊆
M (i+1) ). Here, as before, we use CoZ to denote the
√
kµ̃(i+1) − µρ,i k2 k convex hull of the set Z. Using our new definition of
(i)
≤p . t(i) , we can now state a more general version of the
kµ̄ − µρ,i k2 k + (1 − γ) kµ̄(i) − µρ,i k22
2
convergence theorem which can be seen to be the spe-
Proof. Consider the triangle formed by the three cial case where µ̂E = µE and thus we can choose ρ = 0
points µ̄(i) , µ(i+1) and µρ,i , and name the sides and and the old and new definition of t(i) coincide.
angles as in Figure 2. (Note that a, b, and c are vec- Theorem 1. Let an MDP\R, features φ : S 7→ [0, 1]k ,
tors; see figure caption.) Then we have any > 0 and any ρ ≥ minµ∈M kµ − µ̂E k2 be given.
Then the apprenticeship learning algorithm (both max-
h h margin and projection versions) will terminate with
= (15)
kµ̄(i) − µρ,i k2 kbk2 t(i) ≤ after at most
= sin A (16)
k k
kak2 n=O (1−γ)2 2 log (1−γ) (19)
= sin C (17)
kck2
a iterations.
≤ p , (18)
2
kak2 + kbk22 1
For any triangle with sides a, b, c and opposite angles
A, B, C the sin rule for triangles states sina A = sinb B =
sin C
where we used in order: the definition of b; the sin c
.
Proof. We first show for both versions
√
that we have Combining Equations (22) and (23) gives
k
geometric convergence at a rate √ 2 2
, and then
k+(1−γ) (i+1)
use this to compute the required number of iterations. tproj kµ̃(i+1) − µρ,i k2
(i)
≤
In the max-margin version, the computation of w (i) at tproj kµ̄i − µρ,i k2
every iteration, is easily seen to be equivalent to setting
w(i) = µ̂E − µ̄(i) , for µ̄(i) = arg minµ∈M (i) kµ̂E − µk2 . Applying Lemma 6 gives
Obviously µ̄(i) ∈ M (i) as is required to apply Lemma 6 √
(i+1)
later. If we define µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − tproj k
≤p
µ̄(i) k2 as in Lemma 6, then we see that (i)
tproj k + (1 − γ)2 kµ̄(i) − µρ,i k22
t(i)
mm = kµρ,i − µ̄i k2 . (20)
So in both cases we get the same guarantee for im-
Define µ̃(i+1) as in Lemma 6 and observe that µ̃(i+1) ∈ provement in every iteration. Throughout iterations
(i+1)
M (i+1) , and using the definition of tmm we have this results into
√ √ √
t(i+1)
mm ≤ kµ̃
(i+1)
− µρ,i k2 (21) (i) k i (0) k i k
t ≤( p )t ≤( p )
2 2
(1 − γ) + k 2 2
(1 − γ) + k 1 − γ
Combining Equations (20) and (21) gives
1 k
(i+1)
tmm kµ̃(i+1) − µρ,i k2 where the last inequality follows from M ⊆ 1−γ [0, 1] ,
√
≤ k
(i)
tmm kµ̄i − µρ,i k2 and so 1−γ is an upper bound on the distance between
any 2 points in M . So we have t(i) ≤ if and only if
Applying Lemma 6 gives
√ √
(i+1) √ k k
tmm k (p ) i
≤
(i)
≤p (1 − γ)2 2 + k 1 − γ
tmm k + (1 − γ)2 kµ̄(i) − µρ,i k22
which is equivalent to
For the projection version, at every iteration we have
µ̄(i) = µ̃(i) = arg minµ∈Af f {µ̄(i−1) ,µ(i) } kµ̂E − µk2 = √ p
(1 − γ)2 2 + k

k k k
arg minµ∈Co{µ̄(i) ,µ(i+1) } kµ̂E − µk2 , so obviously µ̄(i) ∈ i ≥ log /log √ =O log
(1 − γ) k (1 − γ)2 2 (1 − γ)
M (i) as is required to apply Lemma 6 later. Here,
Af f Z denotes the affine hull of Z; specifically, if Z
is a set of two points, then this is the line through
these 2 points.2 The first equality holds because in
step 2 of our algorithm we compute µ̄(i−1) exactly like Note that in practice, convergence might be much
we defined µ̃(i−1) . The second equality corresponds faster than predicted by the upperbound in the theo-
to the definition of orthogonal projection onto a line rem, since we have
√
an improvement by at least a fac-
k
(and thus corresponds to our definition of µ̃(i) ), the tor of √ 2 (i) 2
for every iteration, which
k+(1−γ) kµ̄ −µρ,i k2
last equality follows because of Lemma 5.3 If we define √
k
µρ,i = arg minµ:kµ̂E −µk2 ≤ρ kµ − µ̄(i) k2 as in Lemma 6, is typically much better than the bound √
k+(1−γ)2 2
then we see that used in the proof.
(i) Remark (using approximate RL algorithms).
tproj = kµ̄i − µρ,i k2 (22)
Sometimes, in each iteration of the algorithm we will
be able to solve the MDP only approximately. It is
Define µ̃(i+1) as in Lemma 6, and observe that µ̃(i+1) ∈
(i+1) straightforward to generalize our result to this setting.
M (i+1) , and using the definition of tproj we have
Assume on each iteration we can obtain an approxi-
(i+1) mately optimal policy π such that kµ(π) − µ(π ∗ )k2 ≤
tproj ≤ kµ̃(i+1) − µρ,i k2 (23) 1 , with π ∗ the optimal policy for that iteration. Then
2 P
More formally, Af f Z = { i λi zi : zi ∈ Z, i λi =
P for any ρ ≥ minµ∈M kµ − µ̂E k2 and any > 0
1, λi ∈ R}. our algorithm will converge to a policy π̃, such that
3
More precisely, Lemma 5 implies the 3 points kµ̂E − µ(π̃)k2 ≤ ρ + 1 + after at most the number of
(i+1)
µ , µρ,i , µ̄(i) form an obtuse angle at µρ,i , which implies iterations given in Equation (19) of Theorem 1. This
the orthogonal projection of µρ,i onto Af f {µ(i+1) , µ̄(i) } result is easily proved by changing the definition of the
falls into Co{µ(i+1) , µ̄(i) }. ρ-ball around µ̂E to a (ρ + 1 )-ball.
A.3. Sample Complexity where k is the dimension of the feature vectors φ and
We now consider the number of samples from the ex- feature expectations µ, and m is the number of sample
pert required. In the paper we assumed µ̂E ∈ M . Here trajectories used for the estimate µ̂E . So if we take
9k 2k
we do not assume this, which leads to slightly different m ≥ 2((1−γ)) 2 log δ then with probability at least
constant factors. The differences between this proof (1 − δ) we have that
and the proof in the paper are the following: we take
into account the possibility µ̂ ∈
/ M , the proof is a little kµE − µ̂E k∞ ≤ √
3 k
less dense.
√
Theorem 2. Let an MDP\R, features φ : S 7→ [0, 1]k , Using k · k2 ≤ kk · k∞ for k-dimensional space, we get
and any > 0, δ > 0 be given. Suppose the appren-

ticeship learning algorithm (either max-margin or pro- kµE − µ̂E k2 ≤ (31)
jection version) is run using an estimate µ̂E for µE 3
obtained by m Monte Carlo samples. In order to en- Since µE ∈ M , Theorem 1 together with Equation (31)
sure that with probability at least 1 − δ the algorithm guarantee convergence to the ball of radius ρ = 3
terminates after at most a number of iterations n given around µ̂E . After sufficient iterations of the algorithm
by Equation (19), and outputs a policy π̃ so that for as specified in Equation (19), we have t(i) ≤ 3 , and
any true reward R∗ (s) = w ∗ T φ(s) (kw ∗ k1 ≤ 1) 4 we thus the algorithm will return a policy π̃ such that
have µ̃ = µ(π̃) satisfies
P∞ P∞
E[ t=0 γ t R∗ (st )|π̃] ≥ E[ t=0 γ t R∗ (st )|πE ]−, (24) 2
kµ̃ − µ̂E k2 ≤ t(i) + ρ ≤ (32)
it suffices that 3
9k 2k Now (keeping in mind that k.k2 ≤ k.k1 and so kwk1 ≤
m≥ 2((1−γ))2 log δ .
1 implies kwk2 ≤ 1) we can easily prove the result
1 k
Proof. Recall φ ∈ [0, 1]k so µ ∈ [0, 1−γ ] . Let µi de- ∞ ∞
note the i’th component of µ then applying the Cher- |E[
X
γ t R∗ (s(t) )|π̃] − E[
X
γ t R∗ (s(t) )|πE ]| (33)
noff bound on the m-sample estimate (1 − γ)µ̂i of t=0 t=0
(1 − γ)µi ∈ [0, 1] gives
= |(w∗ )T (µ̃ − µE )|
P ((1 − γ)|µi − µ̂i | > τ ) ≤ 2 exp(−2τ 2 m). (25) = |(w∗ )T (µ̃ − µ̂E + µ̂E − µE )|
Using Equation (25) for all components i and the union ≤ |(w∗ )T (µ̃ − µ̂E )| + |(w∗ )T (µ̂E − µE )|
bound gives us ≤ kw∗ k2 kµ̃ − µ̂E k2 + kw∗ k2 kµ̂E − µE k2
2 9k 2k
P (∃i ∈ {1 . . . k}.(1−γ)|µi −µ̂i | > τ ) ≤ 2k exp(−2τ 2 m). ≤ 1( + ) w.p. (1 − δ) for m ≥ log
3 3 2((1 − γ))2 δ
(26)
Now we subtract both sides of Equation (26) from 1, 9k 2k
= w.p. (1 − δ) for m ≥ log
to find that 2((1 − γ))2 δ

P (¬∃i ∈ {1 . . . k}.(1 − γ)|µi − µ̂i | > τ ) (27) where we used in order the definition of w,µ; adding
subtracting the same terms; the triangle inequal-
= P ((1 − γ)kµE − µ̂E k∞ ≤ τ ) (28)
ity; Hölder’s inequality for p = q = 2;5 Equa-
≥ 1 − 2k exp(−2τ 2 m) (29) tions (31), (32); and simplification. The last line di-
√ rectly implies the theorem.
Substituting τ = (1 − γ)/(3 k) into Equation (29)
gives

(1 − γ) Note that in case the underlying reward function R∗

P (kµE −µ̂E k∞ ≤ √ ) ≥ 1−2k exp(−2( √ )2 m),
3 k 3 k does not lie exactly in the span of basis functions,
(30) we have graceful degradation of performance. Let
4
R∗ (s) = w ∗ · φ(s) + f (s), then it is easy to see from
The rationale for the 1-norm constraint on w and ∞- Equation (33) in the proof above, that performance
norm constraint on φ is that these are dual norms, and
thus we have w · φ(s) ≤ kwk1 kφ(s)k∞ (Hölder’s inequal- degradation is bounded by kf k∞
1−γ .
ity). So it corresponds to constraining the maximal reward
5
maxs R(s) ≤ 1. Taking any 2 dual norms for w and φ would Hölder’s inequality states that for any p ≥ 1, q ≥
imply maxs R(s) ≤ 1. Taking 2-norm constraints on w and 1, 1/p + 1/q = 1 and any x, y in some vector space we
φ results in exactly the same theorem. have x · y ≤ kxkp kykq .
B. Alternative Interpretation of the So the corresponding LP is the dual of a Bellman
(i)
Projection Algorithm
P
LP, with reward R(s) = k (µE,k − µk )φk (s) =
(i)
It is well-known that MDP’s can be solved via linear (µE − µ ) · φ(s). It is easily seen that the above al-
programming, more specifically by solving the Bellman gorithm corresponds to the projection algorithm. The
LP projection corresponds to the line search, and choos-
ing a reward weight vector w and finding the respective
minV e0 V (34) optimal policy corresponds to linearizing and solving
P the obtained LP.
s.t. ∀s, a V (s) ≥ R(s) + γ z P (z|s, a)V (z)

where e is an |S| dimensional vector of all ones. Al- References

though this LP is generally too big to solve exactly, Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning
there has been recent work on using this LP for- via inverse reinforcement learning. Proc. ICML.
mulation to get approximate solutions (de Farias & Censor, Y., & Zenios, S. (1997). Parallel optimization:
Van Roy, 2003; Guestrin et al., 2003). The dual of Theory, algorithms, and applications. Oxford University
this LP is Press.
P de Farias, D. P., & Van Roy, B. (2003). The linear program-
maxλ s,a λ(s, a)R(s) (35) ming approach to approximate dynamic programming.
P P
s.t. ∀s a λ(s, a) − γ z,a P (s|z, a)λ(z, a) = 1 Operations Research, 51, No. 6.
Guestrin, C., Koller, D., Parr, R., & Venkataraman, S.
The entry λ(s, a) represents the expected frequency
(2003). Efficient solution algorithms for factored mdps.
of the state-action pair s, a, the constraints ensure λ JAIR, 19, 399–468.
is consistent with the transition probabilities in the
MDP. So we can write the feature expectations for a
policy specified by λ explicitly as a function of λ
X
(µ(λ))k = λ(s, a)φk (s) (36)
s,a

We can explicitly formulate the problem of matching

the expert’s feature expectations µE as a QP

minλ k (µE,k − s,a λ(s, a)φk (s))2

P P
(37)
P P
s.t. ∀s a λ(s, a) − γ z,a P (s|z, a)λ(z, a) = 1

In practice, the above QP is typically too large to solve

exactly, just like the Bellman LP and its dual. We
will now derive an algorithm to solve the above QP,
assuming we have access to a reinforcement learning
algorithm, i.e. we assume we can get a solution to the
Bellman LP (and/or its dual). The idea is to linearize
the objective of (37) at the point of its current iterate,
and then use the RL algorithm to solve the correspond-
ing LP. Then the algorithm does a line search between
the current iterate’s point, and the solution to the LP
(which is feasible, but not necessarily optimal for the
QP). Then we linearize around the obtained point and
iterate the above steps. 6 The linearized objective at
a point λ(i) with corresponding feature expectations
µ(i) is easily computed to be
(i)
X X
λ(s, a) (µE,k − µk )φk (s) (38)
s,a k

6
Note our algorithm is an instantiation of the so called
Frank-Wolfe algorithm (Censor & Zenios, 1997).

A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
An Algorithm For Computing Risk Parity Weights - F. Spinu
No ratings yet
An Algorithm For Computing Risk Parity Weights - F. Spinu
6 pages
Sol3 2015
No ratings yet
Sol3 2015
8 pages
Sup LAWS
No ratings yet
Sup LAWS
11 pages
Dynamic Programming in Economics
No ratings yet
Dynamic Programming in Economics
17 pages
Article 3 - 241123 - 190345
No ratings yet
Article 3 - 241123 - 190345
11 pages
Advanced Algorithms Lecture 06,0
No ratings yet
Advanced Algorithms Lecture 06,0
6 pages
Introduction To Optimization - Jean-François Aujol
No ratings yet
Introduction To Optimization - Jean-François Aujol
51 pages
Outer Approximation Algorithms for EP
No ratings yet
Outer Approximation Algorithms for EP
6 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
9 pages
Sol 2
No ratings yet
Sol 2
7 pages
Unconstrained and Constrained Optimization Techniques
No ratings yet
Unconstrained and Constrained Optimization Techniques
25 pages
Softmax Policy Gradient Methods Can Take Exponenti
No ratings yet
Softmax Policy Gradient Methods Can Take Exponenti
65 pages
Advanced Online Learning Algorithms
No ratings yet
Advanced Online Learning Algorithms
125 pages
Akra-Bazzi Theorem Explained
No ratings yet
Akra-Bazzi Theorem Explained
10 pages
Online Prediction Algorithms
No ratings yet
Online Prediction Algorithms
8 pages
Notes ch4 1
No ratings yet
Notes ch4 1
7 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Econometrics PS9
No ratings yet
Econometrics PS9
9 pages
High Dimensional Probability MA3K0 Notes 3
No ratings yet
High Dimensional Probability MA3K0 Notes 3
108 pages
Polynomial Time Algorithms For Integer Programming and Unbounded Subset Sum in The Total Regime
No ratings yet
Polynomial Time Algorithms For Integer Programming and Unbounded Subset Sum in The Total Regime
5 pages
Bregman
No ratings yet
Bregman
9 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
Algorithm for Non-linear Equations
No ratings yet
Algorithm for Non-linear Equations
8 pages
(Barrientos O.) A Branch and Bound Method For Solv
No ratings yet
(Barrientos O.) A Branch and Bound Method For Solv
17 pages
Research: 1 Theorems and Open Problems
No ratings yet
Research: 1 Theorems and Open Problems
12 pages
Appendix PDF
No ratings yet
Appendix PDF
6 pages
Linear Programming: Extreme Points
No ratings yet
Linear Programming: Extreme Points
4 pages
Formula Sheet: Section 1 - Deterministic Dynamic Programming
No ratings yet
Formula Sheet: Section 1 - Deterministic Dynamic Programming
10 pages
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
No ratings yet
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
5 pages
Newton Methods for Optimization
No ratings yet
Newton Methods for Optimization
5 pages
Advanced Optimization Techniques
No ratings yet
Advanced Optimization Techniques
8 pages
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
No ratings yet
EE364a Homework 3 Solutions: 0 N 0 1 N N 1 1 N N 0 0
19 pages
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
0% (1)
Dynamic Programming and Optimal Control: Third Edition Dimitri P. Bertsekas
54 pages
Problems Chap7
No ratings yet
Problems Chap7
13 pages
A Symmetric Function Approach To Polynomial Regression - Hans-Christian Herbig, Daniel Herden, Christopher Seaton
No ratings yet
A Symmetric Function Approach To Polynomial Regression - Hans-Christian Herbig, Daniel Herden, Christopher Seaton
12 pages
Optimization Techniques in Engineering
No ratings yet
Optimization Techniques in Engineering
44 pages
Preliminary Solutions To FRT050 Adaptive Control Exam October 21, 2003. 1. A
No ratings yet
Preliminary Solutions To FRT050 Adaptive Control Exam October 21, 2003. 1. A
6 pages
An Improved Line-Point Low-Degree Test: Prahladh Harsha Mrinal Kumar Ramprasad Saptharishi Madhu Sudan
No ratings yet
An Improved Line-Point Low-Degree Test: Prahladh Harsha Mrinal Kumar Ramprasad Saptharishi Madhu Sudan
52 pages
Lemke Howson
No ratings yet
Lemke Howson
8 pages
Convexity of The Reachable Set of Nonlinear System
No ratings yet
Convexity of The Reachable Set of Nonlinear System
14 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
On Marginal Allocation KTH
No ratings yet
On Marginal Allocation KTH
7 pages
CMU Statistical ML Homework Solutions
No ratings yet
CMU Statistical ML Homework Solutions
16 pages
MATH3511 Assignment 4
No ratings yet
MATH3511 Assignment 4
13 pages
Slides Seance03
No ratings yet
Slides Seance03
48 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Advanced Nonlinear Optimization
No ratings yet
Advanced Nonlinear Optimization
12 pages
Note 3
No ratings yet
Note 3
9 pages
55 Befl05 55
No ratings yet
55 Befl05 55
11 pages
Linear Quadratic Control Analysis
No ratings yet
Linear Quadratic Control Analysis
4 pages
A Child's Guide To Dynamic Programming
No ratings yet
A Child's Guide To Dynamic Programming
20 pages
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
No ratings yet
Dynamic Programming Handout - : 14.451 Recitation, February 18, 2005 - Todd Gormley
11 pages
1 Online Linear Regression: COS 511: Theoretical Machine Learning
No ratings yet
1 Online Linear Regression: COS 511: Theoretical Machine Learning
7 pages
Lec 11
No ratings yet
Lec 11
7 pages
Lecture15 ValueFunctionIteration
No ratings yet
Lecture15 ValueFunctionIteration
30 pages
NetBackup105 AdminGuide SnapshotManager
No ratings yet
NetBackup105 AdminGuide SnapshotManager
272 pages
NetBackup105 AdvDisk Guide
No ratings yet
NetBackup105 AdvDisk Guide
98 pages
NetBackup105 AdminGuide RHV
No ratings yet
NetBackup105 AdminGuide RHV
50 pages
NetBackup105 AdminGuide SharePoint
No ratings yet
NetBackup105 AdminGuide SharePoint
114 pages
NetBackup105 AdminGuide NAS
No ratings yet
NetBackup105 AdminGuide NAS
244 pages
NetBackup105 AdminGuide DB2
No ratings yet
NetBackup105 AdminGuide DB2
136 pages
2018 Scheme VII AND VIII Final Before BOS
No ratings yet
2018 Scheme VII AND VIII Final Before BOS
44 pages
NetBackup105 AdminGuide Oracle
No ratings yet
NetBackup105 AdminGuide Oracle
320 pages
NetBackup105 AdminGuide Informix
No ratings yet
NetBackup105 AdminGuide Informix
59 pages
NetBackup105 AdminGuide MariaDB
No ratings yet
NetBackup105 AdminGuide MariaDB
35 pages
NetBackup83 AdminGuideII Server
No ratings yet
NetBackup83 AdminGuideII Server
221 pages
NetBackup83 AdminGuide Cloud
100% (1)
NetBackup83 AdminGuide Cloud
223 pages
NetBackup812 AdminGuide BMR
No ratings yet
NetBackup812 AdminGuide BMR
275 pages
B.E. Industrial & Production Syllabus
No ratings yet
B.E. Industrial & Production Syllabus
31 pages
NetBackup83 AdminGuide OpsCenter
No ratings yet
NetBackup83 AdminGuide OpsCenter
709 pages
NetBackup10 EEB Guide
No ratings yet
NetBackup10 EEB Guide
184 pages
NetBackup83 CloudPoint InstallGuide
No ratings yet
NetBackup83 CloudPoint InstallGuide
163 pages
NetBackup83 AdminGuide VMware
No ratings yet
NetBackup83 AdminGuide VMware
386 pages
NetBackup82 Add-In For SCVMM Console Guide
No ratings yet
NetBackup82 Add-In For SCVMM Console Guide
56 pages
NetBackup812 AdminGuide MSExchg Win
No ratings yet
NetBackup812 AdminGuide MSExchg Win
190 pages
NetBackup83 AzureStack AdminGuide
No ratings yet
NetBackup83 AzureStack AdminGuide
51 pages
NetBackup8301 AdminGuideI Server
No ratings yet
NetBackup8301 AdminGuideI Server
1,225 pages
NetBackup102 VSphereClient HTML5 Plug-In Guide
No ratings yet
NetBackup102 VSphereClient HTML5 Plug-In Guide
79 pages
NetBackup102 Self Service Installation Guide
No ratings yet
NetBackup102 Self Service Installation Guide
47 pages
NetBackup83 MySQL AdminGuide
No ratings yet
NetBackup83 MySQL AdminGuide
43 pages
NetBackup8301 Troubleshooting Guide
No ratings yet
NetBackup8301 Troubleshooting Guide
290 pages
NetBackup8301 SANClient Guide
No ratings yet
NetBackup8301 SANClient Guide
114 pages
WS 12 (Inverse Trigonometric Functions) (2025-26)
No ratings yet
WS 12 (Inverse Trigonometric Functions) (2025-26)
2 pages
Infosheet MATH140 2024-3
No ratings yet
Infosheet MATH140 2024-3
2 pages
BCA Practice Questions
No ratings yet
BCA Practice Questions
2 pages
Exploring Bijections in Combinatorics
No ratings yet
Exploring Bijections in Combinatorics
4 pages
Learn Cbse: Maths Mcqs For Class 12 With Answers Chapter 1 Relations and Functions
No ratings yet
Learn Cbse: Maths Mcqs For Class 12 With Answers Chapter 1 Relations and Functions
12 pages
Solving Systems by Elimination Worksheet
No ratings yet
Solving Systems by Elimination Worksheet
2 pages
Class XII Matrices & Determinants Assignment
No ratings yet
Class XII Matrices & Determinants Assignment
6 pages
Surface Area and Volume Formulas
67% (3)
Surface Area and Volume Formulas
17 pages
1994, Ligget - Inverse Transient Analysis in Pipe Networks
No ratings yet
1994, Ligget - Inverse Transient Analysis in Pipe Networks
22 pages
General Math Quarter 1 Week 3: Competencies
No ratings yet
General Math Quarter 1 Week 3: Competencies
10 pages
Lagrange's Mean Value Theorem
No ratings yet
Lagrange's Mean Value Theorem
4 pages
Decision Sciences Theory and Practice 1st Edition Raghu Nandan Sengupta PDF Download
100% (6)
Decision Sciences Theory and Practice 1st Edition Raghu Nandan Sengupta PDF Download
47 pages
11em Question Bank24f - 121625
No ratings yet
11em Question Bank24f - 121625
11 pages
Book Differential Topology
No ratings yet
Book Differential Topology
323 pages
NASTRAN Analysis Capabilities Guide
No ratings yet
NASTRAN Analysis Capabilities Guide
32 pages
C Is A: Mathematics Richard Courant Herbert Robbins England Oxford University Press
No ratings yet
C Is A: Mathematics Richard Courant Herbert Robbins England Oxford University Press
3 pages
Factor I Sing Exercises 132
No ratings yet
Factor I Sing Exercises 132
4 pages
Mae 331 Lecture 9
No ratings yet
Mae 331 Lecture 9
8 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
4 pages
ESS101-Modeling and Simulation: Paolo Falcone
No ratings yet
ESS101-Modeling and Simulation: Paolo Falcone
24 pages
Text PDF
0% (1)
Text PDF
26 pages
Pulse Load Response Characteristics and Impulse Loading
No ratings yet
Pulse Load Response Characteristics and Impulse Loading
17 pages
Math Pi
No ratings yet
Math Pi
2 pages
Iygb Gce: Mathematics MP1 Advanced Level
No ratings yet
Iygb Gce: Mathematics MP1 Advanced Level
6 pages
Finite Difference Equations
No ratings yet
Finite Difference Equations
3 pages
Computer-Based Simulation of Multiphase Flow: Dr. Abdullah Shah
No ratings yet
Computer-Based Simulation of Multiphase Flow: Dr. Abdullah Shah
21 pages
Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 3. Newton-Type Methods For Unconstrained Optimization (2010)
No ratings yet
Nonlinear Programming (Concepts, Algorithms, and Applications To Chemical Processes) - 3. Newton-Type Methods For Unconstrained Optimization (2010)
23 pages
Engineering mathematics-II (2nd Sem)
No ratings yet
Engineering mathematics-II (2nd Sem)
1 page
Intro To Transformations
No ratings yet
Intro To Transformations
5 pages

Icml04 Apprentice Extended

Uploaded by

Icml04 Apprentice Extended

Uploaded by

Supplementary Material to ICML04 Submission

Apprenticeship Learning via

Abstract 1. The definition of µ̃(i+1) = µ(i+1) ·µ̂E

Figure 1. Progress in one iteration step.

 (1 − γ) Note that in case the underlying reward function R∗

where e is an |S| dimensional vector of all ones. Al- References

We can explicitly formulate the problem of matching

minλ k (µE,k − s,a λ(s, a)φk (s))2

In practice, the above QP is typically too large to solve

You might also like

(1 − γ) Note that in case the underlying reward function R∗