Deep Inverse Optimal Control Via Policy Optimization
Deep Inverse Optimal Control Via Policy Optimization
Ddemo
human
Abstract demonstrations
run policy qi
on robot
Dsamp
Reinforcement learning can acquire complex be- initial
Dtraj
our method must be able to handle unknown dynamics and tant component of many prior methods is the inclusion of
high-dimensional systems. To that end, we propose a cost detailed features created using domain knowledge, which
learning algorithm based on policy optimization with lo- can be linearly combined into a cost, including: indicators
cal linear models, building on prior work in reinforcement for successful completion of the task for a robotic ball-in-
learning (Levine & Abbeel, 2014). In this approach, as cup game (Boularias et al., 2011), learning table tennis with
illustrated in Figure 1, the cost function is learned in the features that include distance of the ball to the opponent’s
inner loop of a policy search procedure, using samples col- elbow (Muelling et al., 2014), providing the goal position
lected for policy improvement to also update the cost func- as a known constraint for robotic grasping (Doerr et al.,
tion. The cost learning method itself is a nonlinear gener- 2015), and learning highway driving with indicators for
alization of maximum entropy IOC (Ziebart et al., 2008), collision and driving on the grass (Abbeel & Ng, 2004).
with samples used to approximate the partition function. In While these features allow for the user to impose struc-
contrast to previous work that optimizes the policy in the ture on the cost, they substantially increase the engineering
inner loop of cost learning, our approach instead updates burden. Several methods have proposed to learn nonlinear
the cost in the inner loop of policy search, making it prac- costs using Gaussian processes (Levine et al., 2011) and
tical and efficient. One of the benefits of this approach is boosting (Ratliff et al., 2007; 2009), but even these meth-
that we can couple learning the cost with learning the pol- ods generally operate on features rather than raw states. We
icy for that cost. For tasks that are too complex to acquire a instead use rich, expressive function approximators, in the
good global cost function from a small number of demon- form of neural networks, to learn cost functions directly
strations, our method can still recover effective behaviors on raw state representations. While neural network cost
by running our policy learning method and retaining the representations have previously been proposed in the lit-
learned policy. We elaborate on this further in Section 4.4. erature (Wulfmeier et al., 2015), they have only been ap-
plied to small, synthetic domains. Previous work has also
The main contribution of our work is an algorithm that
suggested simple regularization methods for cost functions,
learns nonlinear cost functions from user demonstrations,
based on minimizing `1 or `2 norms of the parameter vec-
at the same time as learning a policy to perform the task.
tor (Ziebart, 2010; Kalakrishnan et al., 2013) or by using
Since the policy optimization “guides” the cost toward
unlabeled trajectories (Audiffren et al., 2015). When using
good regions of the space, we call this method guided cost
expressive function approximators in complex real-world
learning. Unlike prior methods, our algorithm can handle
tasks, we must design substantially more powerful regular-
complex, nonlinear cost function representations and high-
ization techniques to mitigate the underspecified nature of
dimensional unknown dynamics, and can be used on real
the problem, which we introduce in Section 5.
physical systems with a modest number of samples. Our
evaluation demonstrates the performance of our method on Another challenge in IOC is that, in order to determine the
a set of simulated benchmark tasks, showing that it outper- quality of a given cost function, we must solve some variant
forms previous methods. We also evaluate our method on of the forward control problem to obtain the correspond-
two real-world tasks learned directly from human demon- ing policy, and then compare this policy to the demon-
strations. These tasks require using torque control and vi- strated actions. Most early IRL algorithms required solv-
sion to perform a variety of robotic manipulation behaviors, ing an MDP in the inner loop of an iterative optimization
without any hand-specified cost features. (Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008).
This requires perfect knowledge of the system dynamics
2. Related Work and access to an efficient offline solver, neither of which
is available in, for instance, complex robotic control do-
One of the basic challenges in inverse optimal control mains. Several works have proposed to relax this require-
(IOC), also known as inverse reinforcement learning (IRL), ment, for example by learning a value function instead of
is that finding a cost or reward under which a set of demon- a cost (Todorov, 2006), solving an approximate local con-
strations is near-optimal is underdefined. Many different trol problem (Levine & Koltun, 2012; Dragan & Srinivasa,
costs can explain a given set of demonstrations. Prior work 2012), generating a discrete graph of states (Byravan et al.,
has tackled this issue using maximum margin formulations 2015; Monfort et al., 2015), or only obtaining an optimal
(Abbeel & Ng, 2004; Ratliff et al., 2006), as well as prob- trajectory rather than policy (Ratliff et al., 2006; 2009).
abilistic models that explain suboptimal behavior as noise However, these methods still require knowledge of the sys-
(Ramachandran & Amir, 2007; Ziebart et al., 2008). We tem dynamics. Given the size and complexity of the prob-
take the latter approach in this work, building on the max- lems addressed in this work, solving the optimal control
imum entropy IOC model (Ziebart, 2010). Although the problem even approximately in the inner loop of the cost
probabilistic model mitigates some of the difficulties with optimization is impractical. We show that good cost func-
IOC, there is still a great deal of ambiguity, and an impor- tions can be learned by instead learning the cost in the inner
Guided Cost Learning
loop of a policy optimization. Our inverse optimal control cussed in Section 4.1, we take the sample-based approach
algorithm is most closely related to other previous sample- in this work, because it allows us to perform inverse opti-
based methods based on the principle of maximum entropy, mal control without a known model of the system dynam-
including relative entropy IRL (Boularias et al., 2011) and ics. This is especially important in robotic manipulation
path integral IRL (Kalakrishnan et al., 2013), which can domains, where the robot might interact with a variety of
also handle unknown dynamics. However, unlike these objects with unknown physical properties.
prior methods, we adapt the sampling distribution using
To represent the cost function cθ (xt , ut ), IOC or IRL meth-
policy optimization. We demonstrate in our experiments
ods typically use a linear combination of hand-crafted fea-
that this adaptation is crucial for obtaining good results on
tures, given by cθ (ut , ut ) = θT f (ut , xt ) (Abbeel & Ng,
complex robotic platforms, particularly when using com-
2004). This representation is difficult to apply to more
plex, nonlinear cost functions.
complex domains, especially when the cost must be com-
To summarize, our proposed method is the first to combine puted from raw sensory input. In this work, we explore
several desirable features into a single, effective algorithm: the use of high-dimensional, expressive function approxi-
it can handle unknown dynamics, which is crucial for real- mators for representing cθ (xt , ut ). As we discuss in Sec-
world robotic tasks, it can deal with high-dimensional, tion 6, we use neural networks that operate directly on the
complex systems, as in the case of real torque-controlled robot’s state, though other parameterizations could also be
robotic arms, and it can learn complex, expressive cost used with our method. Complex representations are gener-
functions, such as multilayer neural networks, which re- ally considered to be poorly suited for IOC, since learning
moves the requirement for meticulous hand-engineering of costs that associate the right element of the state with the
cost features. While some prior methods have shown good goal of the task is already quite difficult even with simple
results with unknown dynamics on real robots (Boularias linear representations. However, as we discuss in our evalu-
et al., 2011; Kalakrishnan et al., 2013) and some have pro- ation, we found that such representations could be learned
posed using nonlinear cost functions (Ratliff et al., 2006; effectively by adaptively generating samples as part of a
2009; Levine et al., 2011), to our knowledge no prior policy optimization procedure, as discussed in Section 4.2.
method has been demonstrated that can provide all of these
benefits in the context of complex real-world tasks. 4. Guided Cost Learning
3. Preliminaries and Overview In this section, we describe the guided cost learning al-
gorithm, which combines sample-based maximum en-
We build on the probabilistic maximum entropy inverse op- tropy IOC with forward reinforcement learning using time-
timal control framework (Ziebart et al., 2008). The demon- varying linear models. The central idea behind this method
strated behavior is assumed to be the result of an expert is to adapt the sampling distribution to match the maximum
acting stochastically and near-optimally with respect to an entropy cost distribution p(τ ) = Z1 exp(−cθ (τ )), by di-
unknown cost function. Specifically, the model assumes rectly optimizing a trajectory distribution with respect to
that the expert samples the demonstrated trajectories {τi } the current cost cθ (τ ) using a sample-efficient reinforce-
from the distribution ment learning algorithm. Samples generated on the physi-
1 cal system are used both to improve the policy and more ac-
p(τ ) = exp(−cθ (τ )), (1)
Z curately estimate the partition function Z. In this way, the
reinforcement learning step acts to “guide” the sampling
where τ = P {x1 , u1 , . . . , xT , uT } is a trajectory sample,
cθ (τ ) = t cθ (xt , ut ) is an unknown cost function pa-
distribution toward regions where the samples are more
rameterized by θ, and xt and ut are the state and action useful for estimating the partition function. We will first
at time step t. Under this model, the expert is most likely describe how the IOC objective in Equation (1) can be esti-
to act optimally, and can generate suboptimal trajectories mated with samples, and then describe how reinforcement
with a probability that decreases exponentially as the tra- learning can adapt the sampling distribution.
jectories become more costly. The partition function Z
is difficult to compute for large or continuous domains, 4.1. Sample-Based Inverse Optimal Control
and presents the main computational challenge in max-
In the sample-based approach R to maximum entropy IOC,
imum entropy IOC. The first applications of this model
the partition function Z = exp(−cθ (τ ))dτ is estimated
computed Z exactly with dynamic programming (Ziebart
with samples from a background distribution q(τ ). Prior
et al., 2008). However, this is only practical in small, dis-
sample-based IOC methods use a linear representation of
crete domains. More recent methods have proposed to es-
the cost function, which simplifies the corresponding cost
timate Z by using the Laplace approximation (Levine &
learning problem (Boularias et al., 2011; Kalakrishnan
Koltun, 2012), value function approximation (Huang & Ki-
et al., 2013). In this section, we instead derive a sample-
tani, 2014), and samples (Boularias et al., 2011). As dis-
Guided Cost Learning
based approximation for the IOC objective for a gen- Algorithm 1 Guided cost learning
eral nonlinear parameterization of the cost function. The 1: Initialize qk (τ ) as either a random initial controller or from
negative log-likelihood corresponding to the IOC model demonstrations
in Equation (1) is given by: 2: for iteration i = 1 to I do
3: Generate samples Dtraj from qk (τ )
1 X 4: Append samples: Dsamp ← Dsamp ∪ Dtraj
LIOC (θ) = cθ (τi ) + log Z 5: Use Dsamp to update cost cθ using Algorithm 2
N
τi ∈Ddemo 6: Update qk (τ ) using Dtraj and the method from (Levine &
1 X 1 X exp(−cθ (τj )) Abbeel, 2014) to obtain qk+1 (τ )
≈ cθ (τi ) +log , 7: end for
N M q(τj ) 8: return optimized cost parameters θ and trajectory distribu-
τi ∈Ddemo τj ∈Dsamp
tion q(τ )
where Ddemo denotes the set of N demonstrated trajecto-
ries, Dsamp the set of M background samples, and q de-
notes the background distribution from which trajectories
τj were sampled. Prior methods have chosen q to be uni- ward pass, and generating more samples for the next itera-
form (Boularias et al., 2011) or to lie in the vicinity of the tion. The trajectory distributions generated by this method
demonstrations (Kalakrishnan et al., 2013). To compute the are Gaussian, and each iteration of the policy optimiza-
gradients of this objective with respect to the cost parame- tion procedure satisfies a KL-divergence constraint of the
exp(−cθ (τj )) P
ters θ, let wj = q(τj ) and Z = j wj . The gradient form DKL (q(τ )kq̂(τ )) ≤ , which prevents the policy from
is then given by: changing too rapidly (Bagnell & Schneider, 2003; Peters
et al., 2010; Rawlik & Vijayakumar, 2013). This has the
dLIOC 1 X dcθ 1 X dcθ additional benefit of not overfitting to poor initial estimates
= (τi ) − wj (τj )
dθ N dθ Z dθ of the cost function. With a small modification, we can
τi ∈Ddemo τj ∈Dsamp
use this algorithm to optimize a maximum entropy version
When the cost is represented by a neural network or some of the objective, given by minq Eq [cθ (τ )] − H(τ ), as dis-
other function approximator, this gradient can be com- cussed in prior work (Levine & Abbeel, 2014). This variant
w
puted efficiently by backpropagating − Zj for each trajec- of the algorithm allows us to recover the trajectory distribu-
1
tory τj ∈ Dsamp and N for each trajectory τi ∈ Ddemo . tion q(τ ) ∝ exp(−cθ (τ )) at convergence (Ziebart, 2010), a
good distribution for sampling. For completeness, this pol-
4.2. Adaptive Sampling via Policy Optimization icy optimization procedure is summarized in Appendix A.
The choice of background sample distribution q(τ ) for es- Our sample-based IOC algorithm with adaptive sampling
timating the objective LIOC is critical for successfully ap- is summarized in Algorithm 1. We call this method guided
plying the sample-based IOC algorithm. The optimal im- cost learning because policy optimization is used to guide
portance Rsampling distribution for estimating the partition sampling toward regions with lower cost. The algorithm
function exp(−cθ (τ ))dτ is q(τ ) ∝ | exp(−cθ (τ ))| = consists of taking successive policy optimization steps,
exp(−cθ (τ )). Designing a single background distribution each of which generates samples Dtraj from the latest tra-
q(τ ) is therefore quite difficult when the cost cθ is un- jectory distribution qk (τ ). After sampling, the cost func-
known. Instead, we can adaptively refine q(τ ) to gener- tion is updated using all samples collected thus far for the
ate more samples in those regions of the trajectory space purpose of policy optimization. No additional background
that are good according to the current cost function cθ (τ ). samples are required for this method. This procedure re-
To this end, we interleave the IOC optimization, which at- turns both a learned cost function cθ (xt , ut ) and a trajec-
tempts to find the cost function that maximizes the like- tory distribution q(τ ), which corresponds to a time-varying
lihood of the demonstrations, with a policy optimization linear-Gaussian controller q(ut |xt ). This controller can be
procedure, which improves the trajectory distribution q(τ ) used to execute the learned behavior.
with respect to the current cost.
4.3. Cost Optimization and Importance Weights
Since one of the main advantages of the sample-based
IOC approach is the ability to handle unknown dynam- The IOC objective can be optimized using standard
ics, we must also choose a policy optimization procedure nonlinear optimization methods and the gradient dLdθIOC .
that can handle unknown dynamics. To this end, we adapt Stochastic gradient methods are often preferred for high-
the method presented by Levine & Abbeel (2014), which dimensional function approximators, such as the neural
performs policy optimization under unknown dynamics networks. Such methods are straightforward to apply to
by iteratively fitting time-varying linear dynamics to sam- objectives that factorize over the training samples, but the
ples from the current trajectory distribution q(τ ), updat- partition function does not factorize trivially in this way.
ing the trajectory distribution using a modified LQR back- Nonetheless, we found that our objective could still be op-
Guided Cost Learning
timized with stochastic gradient methods by sampling a the demos, such as a new position of a target cup for a pour-
subset of the demonstrations and background samples at ing task, as shown in our experiments. Since the algorithm
each iteration. When the number of samples in the batch is produces both a cost function cθ (xt , ut ) and a controller
small, we found it necessary to add the sampled demon- q(ut |xt ) that optimizes this cost on the new task instance,
strations to the background sample set as well; without we can directly use this controller to execute the desired
adding the demonstrations to the sample set, the objective behavior. In this way, the method actually learns a policy
can become unbounded and frequently does in practice. from demonstration, using the additional knowledge that
The stochastic optimization procedure is presented in Al- the demonstrations are near-optimal under some unknown
gorithm 2, and is straightforward to implement with most cost function, similar to recent work on IOC by direct loss
neural network libraries based on backpropagation. minimization (Doerr et al., 2015). The learned cost func-
tion cθ (xt , ut ) can often also be used to optimize new poli-
Estimating the partition function requires us to use impor-
cies for new instances of the task without additional cost
tance sampling. Although prior work has suggested drop-
learning. However, we found that on the most challenging
ping the importance weights (Kalakrishnan et al., 2013;
tasks we tested, running policy learning with IOC in the
Aghasadeghi & Bretl, 2011), we show in Appendix B
loop for each new task instance typically succeeded more
that this produces an inconsistent likelihood estimate and
frequently than running IOC once and reusing the learned
fails to recover good cost functions. Since our sam-
cost. We hypothesize that this is because training the policy
ples are drawn from multiple distributions, we compute
on a new instance of the task provides the algorithm with
a fusion distribution to evaluate the importance weights.
additional information about task variation, thus producing
Specifically, if we have samples from k distributions
a better cost function and reducing overfitting. The intu-
q1 (τ ), . . . , qκ (τ ), we can construct a consistent estimator
ition behind this hypothesis is that the demonstrations only
of the expectation of a function
1
P f (τ )1under a uniform dis- cover a small portion of the degrees of variation in the task.
tribution as E[f (τ )] ≈ M 1
τj k
P f (τj ). Accord-
κ qκ (τj ) P Observing samples from a new task instance provides the
ingly, the importance weights are zj = [ k κ qκ (τj )]−1 ,
1
algorithm with a better idea of the particular factors that
and the objective is now: distinguish successful task executions from failures.
1 X 1 X
LIOC (θ) = cθ (τi ) + log zj exp(−cθ (τj )) 5. Representation and Regularization
N M
τi ∈Ddemo τj ∈Dsamp
We parametrize our cost functions as neural networks, ex-
The distributions qκ underlying background samples are panding their expressive power and enabling IOC to be ap-
obtained from the controller at iteration k. We must also plied to the state of a robotic system directly, without hand-
append the demonstrations to the samples in Algorithm 2, designed features. Our experiments in Section 6.2 confirm
yet the distribution that generated the demonstrations is un- that an affine cost function is not expressive enough to learn
known. To estimate it, we assume the demonstrations come some behaviors. Neural network parametrizations are par-
from a single Gaussian trajectory distribution and compute ticularly useful for learning visual representations on raw
its empirical mean and variance. We found this approxi- image pixels. In our experiments, we make use of the unsu-
mation sufficiently accurate for estimating the importance pervised visual feature learning method developed by Finn
weights of the demonstrations, as shown in Appendix B. et al. (2016) to learn cost functions that depend on visual in-
put. Learning cost functions on raw pixels is an interesting
4.4. Learning Costs and Controllers direction for future work, which we discuss in Section 7.
In contrast to many previous IOC and IRL methods, our While the expressive power of nonlinear cost functions pro-
approach can be used to learn a cost while simultaneously vide a range of benefits, they introduce significant model
optimizing the policy for a new instance of the task not in complexity to an already underspecified IOC objective.
To mitigate this challenge, we propose two regularization
methods for IOC. Prior methods regularize the IOC objec-
Algorithm 2 Nonlinear IOC with stochastic gradients
tive by penalizing the `1 or `2 norm of the cost parame-
1: for iteration k = 1 to K do
2: Sample demonstration batch D̂demo ⊂ Ddemo
ters θ (Ziebart, 2010; Kalakrishnan et al., 2013). For high-
3: Sample background batch D̂samp ⊂ Dsamp dimensional nonlinear cost functions, this regularizer is of-
4: Append demonstration batch to background batch: ten insufficient, since different entries in the parameter vec-
D̂samp ← D̂demo ∪ D̂samp tor can have drastically different effects on the cost. We use
5: Estimate dLdθIOC (θ) using D̂demo and D̂samp two regularization terms. The first term encourages the cost
6: Update parameters θ using gradient dLdθIOC (θ) of demo and sample trajectories to change locally at a con-
7: end for stant rate (lcr), by penalizing the second time derivative:
8: return optimized cost parameters θ
Guided Cost Learning
PIIRL, demo init
2D Navigation
X
glcr (τ ) = [(cθ (xt+1 ) − cθ (xt )) − (cθ (xt ) − cθ (xt−1 ))]2 PIIRL, rand. init
RelEnt, demo init
xt ∈τ -350 RelEnt, rand. init
ours, demo init
This term reduces high-frequency variation that is often ours, rand. init
true cost
uniform
-400
symptomatic of overfitting, making the cost easier to reop-
timize. Although sharp changes in the cost slope are some-
-450
times preferred, we found that temporally slow-changing
costs were able to adequately capture all of the behaviors 5 25 45 65 85
samples
in our experiments. Reaching
0.8
The second regularizer is more specific to one-shot
0.6
episodic tasks, and it encourages the cost of the states of a
distance
demo trajectory to decrease strictly monotonically in time 0.4
distance
0.3
ward the goal on some (potentially nonlinear) manifold. 0.2 goal
While this assumption does not always hold perfectly, we 0.1 state
again found that this type of regularizer improved perfor- 0
mance on the tasks in our evaluation. We show a detailed 5 25 45
samples
65 85
comparison with regard to both regularizers in Appendix E. Figure 2. Comparison to prior work on simulated 2D navigation,
reaching, and peg insertion tasks. Reported performance is aver-
6. Experimental Evaluation aged over 4 runs of IOC on 4 different initial conditions . For peg
insertion, the depth of the hole is 0.1m, marked as a dashed line.
We evaluated our sampling-based IOC algorithm on a set
Distances larger than this amount failed to insert the peg.
of robotic control tasks, both in simulation and on a real
robotic platform. Each of the experiments involve complex
alize in Figure 2. The second task involves a 3-link arm
second order dynamics with force or torque control and no
reaching towards a goal location in 2D, in the presence of
manually designed cost function features, with the raw state
physical obstacles. The third, most challenging, task is 3D
provided as input to the learned cost function.
peg insertion with a 7 DOF arm. This task is significantly
We also tested the consistency of our algorithm on a toy more difficult than tasks evaluated in prior IOC work as it
point mass example for which the ground truth distribu- involves complex contact dynamics between the peg and
tion is known. These experiments, discussed fully in Ap- the table and high-dimensional, continuous state and ac-
pendix B, show that using a maximum entropy version of tion spaces. The arm is controlled by selecting torques at
the policy optimization objective (see Section 4.2) and us- the joint motors at 20 Hz. More details on the experimental
ing importance weights are both necessary for recovering setup are provided in Appendix D.
the true distribution.
In addition to the expert demonstrations, prior methods re-
quire a set of “suboptimal” samples for estimating the parti-
6.1. Simulated Comparisons
tion function. We obtain these samples in one of two ways:
In this section, we provide simulated comparisons between by using a baseline random controller that randomly ex-
guided cost learning and prior sample-based methods. We plores around the initial state (random), and by fitting a
focus on task performance and sample complexity, and also linear-Gaussian controller to the demonstrations (demo).
perform comparisons across two different sampling distri- The latter initialization typically produces a motion that
bution initializations and regularizations (in Appendix E). tracks the average demonstration with variance propor-
tional to the variation between demonstrated motions.
To compare guided cost learning to prior methods, we ran
experiments on three simulated tasks of varying difficulty, Between 20 and 32 demonstrations were generated from
all using the MuJoCo physics simulator (Todorov et al., a policy learned using the method of Levine & Abbeel
2012). The first task is 2D navigation around obstacles, (2014), with a ground truth cost function determined by
modeled on the task by Levine & Koltun (2012). This the agent’s pose relative to the goal. We found that for
task has simple, linear dynamics and a low-dimensional the more precise peg insertion task, a relatively complex
state space, but a complex cost function, which we visu- ground truth cost function was needed to afford the neces-
Guided Cost Learning
sary degree of precision. We used a cost function of the human demo initial pose final pose
form wd2 + v log(d2 + α), where d is the distance be-
dish
tween the two tips of the peg and their target positions, and
v and α are constants. Note that the affine cost is inca-
pable of exactly representing this function. We generated
demonstration trajectories under several different starting
conditions. For 2D navigation, we varied the initial posi-
pouring
tion of the agent, and for peg insertion, we varied the posi-
tion of the peg hole. We then evaluated the performance of
our method and prior sample-based methods (Kalakrishnan
et al., 2013; Boularias et al., 2011) on each task from four Figure 3. Dish placement and pouring tasks. The robot learned
arbitrarily-chosen test states. We chose these prior meth- to place the plate gently into the correct slot, and to pour al-
ods because, to our knowledge, they are the only methods monds, localizing the target cup using unsupervised visual fea-
which can handle unknown dynamics. tures. A video of the learned controllers can be found at
http://rll.berkeley.edu/gcl
We used a neural network cost function with two hidden
layers with 24–52 units and rectifying nonlinearities of the in Section 5, and compared to an affine cost function on one
form max(z, 0) followed by linear connections to a set of task to evaluate the importance of non-linear cost represen-
features yt , which had a size of 20 for the 2D navigation tations. The affine cost followed the form of equation 2
task and 100 for the other two tasks. The cost is then given but with yt equal to the input xt .1 For both tasks, between
by 25 and 30 human demonstrations were provided via kines-
c (x , u ) = kAy + bk2 + w ku k2
θ t t t u t (2) thetic teaching, and each IOC algorithm was initialized by
with a fixed torque weight wu and the parameters consist- automatically fitting a controller to the demonstrations that
ing of A, b, and the network weights. These cost functions roughly tracked the average trajectory. Full details on both
range from about 3,000 parameters for the 2D navigation tasks are in Appendix D, and summaries are below.
task to 16,000 parameters for peg insertion. For further de- In the first task, illustrated in Figure 3, the robot must gen-
tails, see Appendix C. Although the prior methods learn tly place a grasped plate into a specific slot of dish rack.
only linear cost functions, we can extend them to the non- The state space consists of the joint angles, the pose of the
linear setting following the derivation in Section 4.1. gripper relative to the target pose, and the time derivatives
Figure 2 illustrates the tasks and shows results for each of each; the actions correspond to torques on the robot’s
method after different numbers of samples from the test motors; and the input to the cost function is the pose and
condition. In our method, five samples were used at each velocity of the gripper relative to the target position. Note
iteration of policy optimization, while for the prior meth- that we do not provide the robot with an existing trajec-
ods, the number of samples corresponds to the number of tory tracking controller or any manually-designed policy
“suboptimal” samples provided for cost learning. For the representation beyond linear-Gaussian controllers, in con-
prior methods, additional samples were used to optimize trast to prior methods that use trajectory following (Kalakr-
the learned cost. The results indicate that our method is ishnan et al., 2013) or dynamic movement primitives with
generally capable of learning tasks that are more complex features (Boularias et al., 2011). Our attempt to design a
than the prior methods, and is able to effectively handle hand-crafted cost function for inserting the plate into the
complex, high-dimensional neural network cost functions. dish rack produced a fast but overly aggressive behavior
In particular, adding more samples for the prior methods that cracked one of the plates during learning.
generally does not improve their performance, because all The second task, also shown in Figure 3, consisted of pour-
of the samples are drawn from the same distribution. ing almonds from one cup to another. In order to succeed,
the robot must keep the cup upright until reaching the target
6.2. Real-World Robotic Control cup, then rotate the cup so that the almonds are poured. In-
We also evaluated our method on a set of real robotic ma- stead of including the position of the target cup in the state
nipulation tasks using the PR2 robot, with comparisons to space, we train autoencoder features from camera images
relative entropy IRL, which we found to be the better of the captured from the demonstrations and add a pruned feature
two prior methods in our simulated experiments. We chose point representation and its time derivative to the state, as
two robotic manipulation tasks which involve complex dy- proposed by Finn et al. (2016). The input to the cost func-
namics and interactions with delicate objects, for which tion includes these visual features, as well as the pose and
it is challenging to write down a cost function by hand. 1
Note that a cost function that is quadratic in the state is linear
For all methods, we used a two-layer neural network cost in the coefficients of the monomials, and therefore corresponds to
parametrization and the regularization objective described a linear parameterization.
Guided Cost Learning
dish (NN) RelEnt IRL GCL q(ut |xt ) GCL reopt. task until enough global training data is available to pro-
success rate 0% 100% 100% duce a cost function that is a good fit to the demonstrations
# samples 100 90 90
in previously unseen parts of the state space.
pouring (NN) RelEnt IRL GCL q(ut |xt ) GCL reopt.
success rate 10% 84.7% 34%
# samples 150,150 75,130 75,130 7. Discussion and Future Work
pouring (affine) RelEnt IRL GCL q(ut |xt ) GCL reopt.
success rate 0% 0% – We presented an inverse optimal control algorithm that
# samples 150 120 – can learn complex, nonlinear cost representations, such as
neural networks, and can be applied to high-dimensional
Table 1. Performance of guided cost learning (GCL) and relative
systems with unknown dynamics. Our algorithm uses a
entropy (RelEnt) IRL on placing a dish into a rack and pouring
almonds into a cup. Sample counts are for IOC, omitting those
sample-based approximation of the maximum entropy IOC
for optimizing the learned cost. An affine cost is insufficient for objective, with samples generated from a policy learning
representing the pouring task, thus motivating using a neural net- algorithm based on local linear models (Levine & Abbeel,
work cost (NN). The pouring task with a neural network cost is 2014). To our knowledge, this approach is the first to com-
evaluated for two positions of the target cup; average performance bine the benefits of sample-based IOC under unknown dy-
is reported. namics with nonlinear cost representations that directly use
the raw state of the system, without the need for manual
velocity of the gripper. Note that the position of the target
feature engineering. This allows us to apply our method
cup can only be obtained from the visual features, so the al-
to a variety of real-world robotic manipulation tasks. Our
gorithm must learn to use them in the cost function in order
evaluation demonstrates that our method outperforms prior
to succeed at the task.
IOC algorithms on a set of simulated benchmarks, and
The results, presented in Table 1, show that our algorithm achieves good results on several real-world tasks.
successfully learned both tasks. The prior relative entropy
IRL algorithm could not acquire a suitable cost function, Our evaluation shows that our approach can learn good
due to the complexity of this domain. On the pouring task, cost functions for a variety of simulated tasks. For com-
where we also evaluated a simpler affine cost function, we plex robotic motion skills, the learned cost functions tend
found that only the neural network representation could re- to explain the demonstrations only locally. This makes
cover a successful behavior, illustrating the need for rich them difficult to reoptimize from scratch for new condi-
and expressive function approximators when learning cost tions. It should be noted that this challenge is not unique
functions directly on raw state representations.2 to our method. In our comparisons, no prior sample-based
method was able to learn good global costs for these tasks.
The results in Table 1 also evaluate the generalizability of However, since our method interleaves cost optimization
the cost function learned by our method and prior work. On with policy learning, it still recovers successful policies for
the dish rack task, we can use the learned cost to optimize these tasks. For this reason, we can still learn from demon-
new policies for different target dish positions successfully, stration simply by retaining the learned policy, and discard-
while the prior method does not produce a generalizable ing the cost function. This allows us to tackle substantially
cost function. On the harder pouring task, we found that the more challenging tasks that involve direct torque control of
learned cost succeeded less often on new positions. How- real robotic systems with feedback from vision.
ever, as discussed in Section 4.4, our method produces both
a policy q(ut |xt ) and a cost function cθ when trained on a To incorporate vision into our experiments, we used unsu-
novel instance of the task, and although the learned cost pervised learning to acquire a vision-based state represen-
functions for this task were worse, the learned policy suc- tation, following prior work (Finn et al., 2016). An ex-
ceeded on the test positions when optimized with IOC in citing avenue for future work is to extend our approach
the inner loop using our algorithm. This indicates an inter- to learn cost functions directly from natural images. The
esting property of our approach: although the learned cost principal challenge for this extension is to avoid overfit-
function is local in nature due to the choice of sampling ting when using substantially larger and more expressive
distribution, the learned policy tends to succeed even when networks. Our current regularization techniques mitigate
the cost function is too local to produce good results in very overfitting to a high degree, but visual inputs tend to vary
different situations. An interesting avenue for future work dramatically between demonstrations and on-policy sam-
is to further explore the implications of this property, and to ples, particularly when the demonstrations are provided by
improve the generalizability of the learned cost by succes- a human via kinesthetic teaching. One promising avenue
sively training policies on different novel instances of the for mitigating these challenges is to introduce regulariza-
2
tion methods developed for domain adaptation in computer
We did attempt to learn costs directly on image pixels, but vision (Tzeng et al., 2015), to encode the prior knowledge
found that the problem was too underdetermined to succeed. Bet-
ter image-specific regularization is likely required for this. that demonstrations have similar visual features to samples.
Guided Cost Learning
Abbeel, P. and Ng, A. Apprenticeship learning via inverse Levine, S., Popovic, Z., and Koltun, V. Nonlinear in-
reinforcement learning. In International Conference on verse reinforcement learning with gaussian processes.
Machine Learning (ICML), 2004. In Advances in Neural Information Processing Systems
(NIPS), 2011.
Aghasadeghi, N. and Bretl, T. Maximum entropy inverse
reinforcement learning in continuous state spaces with Levine, S., Wagener, N., and Abbeel, P. Learning contact-
path integrals. In International Conference on Intelligent rich manipulation skills with guided policy search. In
Robots and Systems (IROS), 2011. International Conference on Robotics and Automation
Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, (ICRA), 2015.
M. Maximum Entropy Semi-Supervised Inverse Rein-
Monfort, M., Lake, B. M., Ziebart, B., Lucey, P., and
forcement Learning. In International Joint Conference
Tenenbaum, J. Softstar: Heuristic-guided probabilistic
on Artificial Intelligence (IJCAI), July 2015.
inference. In Advances in Neural Information Process-
Bagnell, J. A. and Schneider, J. Covariant policy search. In ing Systems, pp. 2746–2754, 2015.
International Joint Conference on Artificial Intelligence
(IJCAI), 2003. Muelling, K., Boularias, A., Mohler, B., Schölkopf, B., and
Peters, J. Learning strategies in table tennis using inverse
Boularias, A., Kober, J., and Peters, J. Relative entropy reinforcement learning. Biological Cybernetics, 108(5),
inverse reinforcement learning. In International Confer- 2014.
ence on Artificial Intelligence and Statistics (AISTATS),
2011. Ng, A., Harada, D., and Russell, S. Policy invariance under
reward transformations: Theory and application to re-
Byravan, A., Monfort, M., Ziebart, B., Boots, B., and Fox, ward shaping. In International Conference on Machine
D. Graph-based inverse optimal control for robot manip- Learning (ICML), 1999.
ulation. In International Joint Conference on Artificial
Intelligence (IJCAI), 2015. Ng, A., Russell, S., et al. Algorithms for inverse reinforce-
ment learning. In International Conference on Machine
Doerr, A., Ratliff, N., Bohg, J., Toussaint, M., and Schaal,
Learning (ICML), 2000.
S. Direct loss minimization inverse optimal control. In
Proceedings of Robotics: Science and Systems (R:SS), Peters, J., Mülling, K., and Altün, Y. Relative entropy pol-
Rome, Italy, July 2015. icy search. In AAAI Conference on Artificial Intelligence,
Dragan, Anca and Srinivasa, Siddhartha. Formalizing as- 2010.
sistive teleoperation. In Proceedings of Robotics: Sci- Ramachandran, D. and Amir, E. Bayesian inverse rein-
ence and Systems (R:SS), Sydney, Australia, July 2012. forcement learning. In AAAI Conference on Artificial
Finn, Chelsea, Tan, Xin Yu, Duan, Yan, Darrell, Trevor, Intelligence, volume 51, 2007.
Levine, Sergey, and Abbeel, Pieter. Deep spatial autoen-
coders for visuomotor learning. International Confer- Ratliff, N., Bagnell, J. A., and Zinkevich, M. A. Maxi-
ence on Robotics and Automation (ICRA), 2016. mum margin planning. In International Conference on
Machine Learning (ICML), 2006.
Huang, D. and Kitani, K. Action-reaction: Forecasting the
dynamics of human interaction. In European Conference Ratliff, N., Bradley, D., Bagnell, J. A., and Chestnutt,
on Computer Vision (ECCV), 2014. J. Boosting structured prediction for imitation learning.
2007.
Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal,
S. Learning objective functions for manipulation. In Ratliff, N., Silver, D., and Bagnell, J. A. Learning to
International Conference on Robotics and Automation search: Functional gradient techniques for imitation
(ICRA), 2013. learning. Autonomous Robots, 27(1), 2009.
Guided Cost Learning