0% found this document useful (0 votes)

43 views10 pages

Deep Inverse Optimal Control Via Policy Optimization

Uploaded by

Milena França

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views10 pages

Deep Inverse Optimal Control Via Policy Optimization

Uploaded by

Milena França

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization

Chelsea Finn CBFINN @ EECS . BERKELEY. EDU

Sergey Levine SVLEVINE @ EECS . BERKELEY. EDU
Pieter Abbeel PABBEEL @ EECS . BERKELEY. EDU
University of California, Berkeley, Berkeley, CA 94709 USA

Ddemo
human
Abstract demonstrations
run policy qi
on robot
Dsamp
Reinforcement learning can acquire complex be- initial
Dtraj

haviors from high-level specifications. However, distribution q0

defining a cost function that can be optimized qi+1
effectively and encodes the correct task is chal-
lenging in practice. We explore how inverse op- policy optimize cost cθ
optimization step
timal control (IOC) can be used to learn behav- cθ θ argmin LIOC
θ
iors from demonstrations, with applications to Guided Cost Learning

torque control of high-dimensional robotic sys-

Figure 1. Right: Guided cost learning uses policy optimization
tems. Our method addresses two key challenges
to adaptively sample trajectories for estimating the IOC partition
in inverse optimal control: first, the need for in- function. Bottom left: PR2 learning to gently place a dish in a
formative features and effective regularization to plate rack.
impose structure on the cost, and second, the dif-
ficulty of learning the cost function under un- cost function directly from expert demonstrations, e.g. Ng
known dynamics for high-dimensional continu- et al. (2000); Abbeel & Ng (2004); Ziebart et al. (2008).
ous systems. To address the former challenge, However, designing an effective IOC algorithm for learn-
we present an algorithm capable of learning ar- ing from demonstration is difficult for two reasons. First,
bitrary nonlinear cost functions, such as neural IOC is fundamentally underdefined in that many costs in-
networks, without meticulous feature engineer- duce the same behavior. Most practical algorithms there-
ing. To address the latter challenge, we formu- fore require carefully designed features to impose structure
late an efficient sample-based approximation for on the learned cost. Second, many standard IRL and IOC
MaxEnt IOC. We evaluate our method on a series methods require solving the forward problem (finding an
of simulated tasks and real-world robotic manip- optimal policy given the current cost) in the inner loop of
ulation problems, demonstrating substantial im- an iterative cost optimization. This makes them difficult
provement over prior methods both in terms of to apply to complex, high-dimensional systems, where the
task complexity and sample efficiency. forward problem is itself exceedingly difficult, particularly
real-world robotic systems with unknown dynamics.

1. Introduction To address the challenge of representation, we propose

to use expressive, nonlinear function approximators, such
Reinforcement learning can be used to acquire complex be- as neural networks, to represent the cost. This reduces
haviors from high-level specifications. However, defining the engineering burden required to deploy IOC methods,
a cost function that can be optimized effectively and en- and makes it possible to learn cost functions for which
codes the correct task can be challenging in practice, and expert intuition is insufficient for designing good fea-
techniques like cost shaping are often used to solve com- tures. Such expressive function approximators, however,
plex real-world problems (Ng et al., 1999). Inverse optimal can learn complex cost functions that lack the structure
control (IOC) or inverse reinforcement learning (IRL) pro- typically imposed by hand-designed features. To mitigate
vide an avenue for addressing this challenge by learning a this challenge, we propose two regularization techniques
Proceedings of the 33 rd International Conference on Machine
for IOC, one which is general and one which is specific to
Learning, New York, NY, USA, 2016. JMLR: W&CP volume episodic domains, such as to robotic manipulation skills.
48. Copyright 2016 by the author(s).
In order to learn cost functions for real-world robotic tasks,
Guided Cost Learning

our method must be able to handle unknown dynamics and tant component of many prior methods is the inclusion of
high-dimensional systems. To that end, we propose a cost detailed features created using domain knowledge, which
learning algorithm based on policy optimization with lo- can be linearly combined into a cost, including: indicators
cal linear models, building on prior work in reinforcement for successful completion of the task for a robotic ball-in-
learning (Levine & Abbeel, 2014). In this approach, as cup game (Boularias et al., 2011), learning table tennis with
illustrated in Figure 1, the cost function is learned in the features that include distance of the ball to the opponent’s
inner loop of a policy search procedure, using samples col- elbow (Muelling et al., 2014), providing the goal position
lected for policy improvement to also update the cost func- as a known constraint for robotic grasping (Doerr et al.,
tion. The cost learning method itself is a nonlinear gener- 2015), and learning highway driving with indicators for
alization of maximum entropy IOC (Ziebart et al., 2008), collision and driving on the grass (Abbeel & Ng, 2004).
with samples used to approximate the partition function. In While these features allow for the user to impose struc-
contrast to previous work that optimizes the policy in the ture on the cost, they substantially increase the engineering
inner loop of cost learning, our approach instead updates burden. Several methods have proposed to learn nonlinear
the cost in the inner loop of policy search, making it prac- costs using Gaussian processes (Levine et al., 2011) and
tical and efficient. One of the benefits of this approach is boosting (Ratliff et al., 2007; 2009), but even these meth-
that we can couple learning the cost with learning the pol- ods generally operate on features rather than raw states. We
icy for that cost. For tasks that are too complex to acquire a instead use rich, expressive function approximators, in the
good global cost function from a small number of demon- form of neural networks, to learn cost functions directly
strations, our method can still recover effective behaviors on raw state representations. While neural network cost
by running our policy learning method and retaining the representations have previously been proposed in the lit-
learned policy. We elaborate on this further in Section 4.4. erature (Wulfmeier et al., 2015), they have only been ap-
plied to small, synthetic domains. Previous work has also
The main contribution of our work is an algorithm that
suggested simple regularization methods for cost functions,
learns nonlinear cost functions from user demonstrations,
based on minimizing `1 or `2 norms of the parameter vec-
at the same time as learning a policy to perform the task.
tor (Ziebart, 2010; Kalakrishnan et al., 2013) or by using
Since the policy optimization “guides” the cost toward
unlabeled trajectories (Audiffren et al., 2015). When using
good regions of the space, we call this method guided cost
expressive function approximators in complex real-world
learning. Unlike prior methods, our algorithm can handle
tasks, we must design substantially more powerful regular-
complex, nonlinear cost function representations and high-
ization techniques to mitigate the underspecified nature of
dimensional unknown dynamics, and can be used on real
the problem, which we introduce in Section 5.
physical systems with a modest number of samples. Our
evaluation demonstrates the performance of our method on Another challenge in IOC is that, in order to determine the
a set of simulated benchmark tasks, showing that it outper- quality of a given cost function, we must solve some variant
forms previous methods. We also evaluate our method on of the forward control problem to obtain the correspond-
two real-world tasks learned directly from human demon- ing policy, and then compare this policy to the demon-
strations. These tasks require using torque control and vi- strated actions. Most early IRL algorithms required solv-
sion to perform a variety of robotic manipulation behaviors, ing an MDP in the inner loop of an iterative optimization
without any hand-specified cost features. (Ng et al., 2000; Abbeel & Ng, 2004; Ziebart et al., 2008).
This requires perfect knowledge of the system dynamics
2. Related Work and access to an efficient offline solver, neither of which
is available in, for instance, complex robotic control do-
One of the basic challenges in inverse optimal control mains. Several works have proposed to relax this require-
(IOC), also known as inverse reinforcement learning (IRL), ment, for example by learning a value function instead of
is that finding a cost or reward under which a set of demon- a cost (Todorov, 2006), solving an approximate local con-
strations is near-optimal is underdefined. Many different trol problem (Levine & Koltun, 2012; Dragan & Srinivasa,
costs can explain a given set of demonstrations. Prior work 2012), generating a discrete graph of states (Byravan et al.,
has tackled this issue using maximum margin formulations 2015; Monfort et al., 2015), or only obtaining an optimal
(Abbeel & Ng, 2004; Ratliff et al., 2006), as well as prob- trajectory rather than policy (Ratliff et al., 2006; 2009).
abilistic models that explain suboptimal behavior as noise However, these methods still require knowledge of the sys-
(Ramachandran & Amir, 2007; Ziebart et al., 2008). We tem dynamics. Given the size and complexity of the prob-
take the latter approach in this work, building on the max- lems addressed in this work, solving the optimal control
imum entropy IOC model (Ziebart, 2010). Although the problem even approximately in the inner loop of the cost
probabilistic model mitigates some of the difficulties with optimization is impractical. We show that good cost func-
IOC, there is still a great deal of ambiguity, and an impor- tions can be learned by instead learning the cost in the inner
Guided Cost Learning

loop of a policy optimization. Our inverse optimal control cussed in Section 4.1, we take the sample-based approach
algorithm is most closely related to other previous sample- in this work, because it allows us to perform inverse opti-
based methods based on the principle of maximum entropy, mal control without a known model of the system dynam-
including relative entropy IRL (Boularias et al., 2011) and ics. This is especially important in robotic manipulation
path integral IRL (Kalakrishnan et al., 2013), which can domains, where the robot might interact with a variety of
also handle unknown dynamics. However, unlike these objects with unknown physical properties.
prior methods, we adapt the sampling distribution using
To represent the cost function cθ (xt , ut ), IOC or IRL meth-
policy optimization. We demonstrate in our experiments
ods typically use a linear combination of hand-crafted fea-
that this adaptation is crucial for obtaining good results on
tures, given by cθ (ut , ut ) = θT f (ut , xt ) (Abbeel & Ng,
complex robotic platforms, particularly when using com-
2004). This representation is difficult to apply to more
plex, nonlinear cost functions.
complex domains, especially when the cost must be com-
To summarize, our proposed method is the first to combine puted from raw sensory input. In this work, we explore
several desirable features into a single, effective algorithm: the use of high-dimensional, expressive function approxi-
it can handle unknown dynamics, which is crucial for real- mators for representing cθ (xt , ut ). As we discuss in Sec-
world robotic tasks, it can deal with high-dimensional, tion 6, we use neural networks that operate directly on the
complex systems, as in the case of real torque-controlled robot’s state, though other parameterizations could also be
robotic arms, and it can learn complex, expressive cost used with our method. Complex representations are gener-
functions, such as multilayer neural networks, which re- ally considered to be poorly suited for IOC, since learning
moves the requirement for meticulous hand-engineering of costs that associate the right element of the state with the
cost features. While some prior methods have shown good goal of the task is already quite difficult even with simple
results with unknown dynamics on real robots (Boularias linear representations. However, as we discuss in our evalu-
et al., 2011; Kalakrishnan et al., 2013) and some have pro- ation, we found that such representations could be learned
posed using nonlinear cost functions (Ratliff et al., 2006; effectively by adaptively generating samples as part of a
2009; Levine et al., 2011), to our knowledge no prior policy optimization procedure, as discussed in Section 4.2.
method has been demonstrated that can provide all of these
benefits in the context of complex real-world tasks. 4. Guided Cost Learning
3. Preliminaries and Overview In this section, we describe the guided cost learning al-
gorithm, which combines sample-based maximum en-
We build on the probabilistic maximum entropy inverse op- tropy IOC with forward reinforcement learning using time-
timal control framework (Ziebart et al., 2008). The demon- varying linear models. The central idea behind this method
strated behavior is assumed to be the result of an expert is to adapt the sampling distribution to match the maximum
acting stochastically and near-optimally with respect to an entropy cost distribution p(τ ) = Z1 exp(−cθ (τ )), by di-
unknown cost function. Specifically, the model assumes rectly optimizing a trajectory distribution with respect to
that the expert samples the demonstrated trajectories {τi } the current cost cθ (τ ) using a sample-efficient reinforce-
from the distribution ment learning algorithm. Samples generated on the physi-
1 cal system are used both to improve the policy and more ac-
p(τ ) = exp(−cθ (τ )), (1)
Z curately estimate the partition function Z. In this way, the
reinforcement learning step acts to “guide” the sampling
where τ = P {x1 , u1 , . . . , xT , uT } is a trajectory sample,
cθ (τ ) = t cθ (xt , ut ) is an unknown cost function pa-
distribution toward regions where the samples are more
rameterized by θ, and xt and ut are the state and action useful for estimating the partition function. We will first
at time step t. Under this model, the expert is most likely describe how the IOC objective in Equation (1) can be esti-
to act optimally, and can generate suboptimal trajectories mated with samples, and then describe how reinforcement
with a probability that decreases exponentially as the tra- learning can adapt the sampling distribution.
jectories become more costly. The partition function Z
is difficult to compute for large or continuous domains, 4.1. Sample-Based Inverse Optimal Control
and presents the main computational challenge in max-
In the sample-based approach R to maximum entropy IOC,
imum entropy IOC. The first applications of this model
the partition function Z = exp(−cθ (τ ))dτ is estimated
computed Z exactly with dynamic programming (Ziebart
with samples from a background distribution q(τ ). Prior
et al., 2008). However, this is only practical in small, dis-
sample-based IOC methods use a linear representation of
crete domains. More recent methods have proposed to es-
the cost function, which simplifies the corresponding cost
timate Z by using the Laplace approximation (Levine &
learning problem (Boularias et al., 2011; Kalakrishnan
Koltun, 2012), value function approximation (Huang & Ki-
et al., 2013). In this section, we instead derive a sample-
tani, 2014), and samples (Boularias et al., 2011). As dis-
Guided Cost Learning

based approximation for the IOC objective for a gen- Algorithm 1 Guided cost learning
eral nonlinear parameterization of the cost function. The 1: Initialize qk (τ ) as either a random initial controller or from
negative log-likelihood corresponding to the IOC model demonstrations
in Equation (1) is given by: 2: for iteration i = 1 to I do
3: Generate samples Dtraj from qk (τ )
1 X 4: Append samples: Dsamp ← Dsamp ∪ Dtraj
LIOC (θ) = cθ (τi ) + log Z 5: Use Dsamp to update cost cθ using Algorithm 2
N
τi ∈Ddemo 6: Update qk (τ ) using Dtraj and the method from (Levine &
1 X 1 X exp(−cθ (τj )) Abbeel, 2014) to obtain qk+1 (τ )
≈ cθ (τi ) +log , 7: end for
N M q(τj ) 8: return optimized cost parameters θ and trajectory distribu-
τi ∈Ddemo τj ∈Dsamp
tion q(τ )
where Ddemo denotes the set of N demonstrated trajecto-
ries, Dsamp the set of M background samples, and q de-
notes the background distribution from which trajectories
τj were sampled. Prior methods have chosen q to be uni- ward pass, and generating more samples for the next itera-
form (Boularias et al., 2011) or to lie in the vicinity of the tion. The trajectory distributions generated by this method
demonstrations (Kalakrishnan et al., 2013). To compute the are Gaussian, and each iteration of the policy optimiza-
gradients of this objective with respect to the cost parame- tion procedure satisfies a KL-divergence constraint of the
exp(−cθ (τj )) P
ters θ, let wj = q(τj ) and Z = j wj . The gradient form DKL (q(τ )kq̂(τ )) ≤ , which prevents the policy from
is then given by: changing too rapidly (Bagnell & Schneider, 2003; Peters
et al., 2010; Rawlik & Vijayakumar, 2013). This has the
dLIOC 1 X dcθ 1 X dcθ additional benefit of not overfitting to poor initial estimates
= (τi ) − wj (τj )
dθ N dθ Z dθ of the cost function. With a small modification, we can
τi ∈Ddemo τj ∈Dsamp
use this algorithm to optimize a maximum entropy version
When the cost is represented by a neural network or some of the objective, given by minq Eq [cθ (τ )] − H(τ ), as dis-
other function approximator, this gradient can be com- cussed in prior work (Levine & Abbeel, 2014). This variant
w
puted efficiently by backpropagating − Zj for each trajec- of the algorithm allows us to recover the trajectory distribu-
1
tory τj ∈ Dsamp and N for each trajectory τi ∈ Ddemo . tion q(τ ) ∝ exp(−cθ (τ )) at convergence (Ziebart, 2010), a
good distribution for sampling. For completeness, this pol-
4.2. Adaptive Sampling via Policy Optimization icy optimization procedure is summarized in Appendix A.
The choice of background sample distribution q(τ ) for es- Our sample-based IOC algorithm with adaptive sampling
timating the objective LIOC is critical for successfully ap- is summarized in Algorithm 1. We call this method guided
plying the sample-based IOC algorithm. The optimal im- cost learning because policy optimization is used to guide
portance Rsampling distribution for estimating the partition sampling toward regions with lower cost. The algorithm
function exp(−cθ (τ ))dτ is q(τ ) ∝ | exp(−cθ (τ ))| = consists of taking successive policy optimization steps,
exp(−cθ (τ )). Designing a single background distribution each of which generates samples Dtraj from the latest tra-
q(τ ) is therefore quite difficult when the cost cθ is un- jectory distribution qk (τ ). After sampling, the cost func-
known. Instead, we can adaptively refine q(τ ) to gener- tion is updated using all samples collected thus far for the
ate more samples in those regions of the trajectory space purpose of policy optimization. No additional background
that are good according to the current cost function cθ (τ ). samples are required for this method. This procedure re-
To this end, we interleave the IOC optimization, which at- turns both a learned cost function cθ (xt , ut ) and a trajec-
tempts to find the cost function that maximizes the like- tory distribution q(τ ), which corresponds to a time-varying
lihood of the demonstrations, with a policy optimization linear-Gaussian controller q(ut |xt ). This controller can be
procedure, which improves the trajectory distribution q(τ ) used to execute the learned behavior.
with respect to the current cost.
4.3. Cost Optimization and Importance Weights
Since one of the main advantages of the sample-based
IOC approach is the ability to handle unknown dynam- The IOC objective can be optimized using standard
ics, we must also choose a policy optimization procedure nonlinear optimization methods and the gradient dLdθIOC .
that can handle unknown dynamics. To this end, we adapt Stochastic gradient methods are often preferred for high-
the method presented by Levine & Abbeel (2014), which dimensional function approximators, such as the neural
performs policy optimization under unknown dynamics networks. Such methods are straightforward to apply to
by iteratively fitting time-varying linear dynamics to sam- objectives that factorize over the training samples, but the
ples from the current trajectory distribution q(τ ), updat- partition function does not factorize trivially in this way.
ing the trajectory distribution using a modified LQR back- Nonetheless, we found that our objective could still be op-
Guided Cost Learning

timized with stochastic gradient methods by sampling a the demos, such as a new position of a target cup for a pour-
subset of the demonstrations and background samples at ing task, as shown in our experiments. Since the algorithm
each iteration. When the number of samples in the batch is produces both a cost function cθ (xt , ut ) and a controller
small, we found it necessary to add the sampled demon- q(ut |xt ) that optimizes this cost on the new task instance,
strations to the background sample set as well; without we can directly use this controller to execute the desired
adding the demonstrations to the sample set, the objective behavior. In this way, the method actually learns a policy
can become unbounded and frequently does in practice. from demonstration, using the additional knowledge that
The stochastic optimization procedure is presented in Al- the demonstrations are near-optimal under some unknown
gorithm 2, and is straightforward to implement with most cost function, similar to recent work on IOC by direct loss
neural network libraries based on backpropagation. minimization (Doerr et al., 2015). The learned cost func-
tion cθ (xt , ut ) can often also be used to optimize new poli-
Estimating the partition function requires us to use impor-
cies for new instances of the task without additional cost
tance sampling. Although prior work has suggested drop-
learning. However, we found that on the most challenging
ping the importance weights (Kalakrishnan et al., 2013;
tasks we tested, running policy learning with IOC in the
Aghasadeghi & Bretl, 2011), we show in Appendix B
loop for each new task instance typically succeeded more
that this produces an inconsistent likelihood estimate and
frequently than running IOC once and reusing the learned
fails to recover good cost functions. Since our sam-
cost. We hypothesize that this is because training the policy
ples are drawn from multiple distributions, we compute
on a new instance of the task provides the algorithm with
a fusion distribution to evaluate the importance weights.
additional information about task variation, thus producing
Specifically, if we have samples from k distributions
a better cost function and reducing overfitting. The intu-
q1 (τ ), . . . , qκ (τ ), we can construct a consistent estimator
ition behind this hypothesis is that the demonstrations only
of the expectation of a function
1
P f (τ )1under a uniform dis- cover a small portion of the degrees of variation in the task.
tribution as E[f (τ )] ≈ M 1
τj k
P f (τj ). Accord-
κ qκ (τj ) P Observing samples from a new task instance provides the
ingly, the importance weights are zj = [ k κ qκ (τj )]−1 ,
1
algorithm with a better idea of the particular factors that
and the objective is now: distinguish successful task executions from failures.
1 X 1 X
LIOC (θ) = cθ (τi ) + log zj exp(−cθ (τj )) 5. Representation and Regularization
N M
τi ∈Ddemo τj ∈Dsamp
We parametrize our cost functions as neural networks, ex-
The distributions qκ underlying background samples are panding their expressive power and enabling IOC to be ap-
obtained from the controller at iteration k. We must also plied to the state of a robotic system directly, without hand-
append the demonstrations to the samples in Algorithm 2, designed features. Our experiments in Section 6.2 confirm
yet the distribution that generated the demonstrations is un- that an affine cost function is not expressive enough to learn
known. To estimate it, we assume the demonstrations come some behaviors. Neural network parametrizations are par-
from a single Gaussian trajectory distribution and compute ticularly useful for learning visual representations on raw
its empirical mean and variance. We found this approxi- image pixels. In our experiments, we make use of the unsu-
mation sufficiently accurate for estimating the importance pervised visual feature learning method developed by Finn
weights of the demonstrations, as shown in Appendix B. et al. (2016) to learn cost functions that depend on visual in-
put. Learning cost functions on raw pixels is an interesting
4.4. Learning Costs and Controllers direction for future work, which we discuss in Section 7.
In contrast to many previous IOC and IRL methods, our While the expressive power of nonlinear cost functions pro-
approach can be used to learn a cost while simultaneously vide a range of benefits, they introduce significant model
optimizing the policy for a new instance of the task not in complexity to an already underspecified IOC objective.
To mitigate this challenge, we propose two regularization
methods for IOC. Prior methods regularize the IOC objec-
Algorithm 2 Nonlinear IOC with stochastic gradients
tive by penalizing the `1 or `2 norm of the cost parame-
1: for iteration k = 1 to K do
2: Sample demonstration batch D̂demo ⊂ Ddemo
ters θ (Ziebart, 2010; Kalakrishnan et al., 2013). For high-
3: Sample background batch D̂samp ⊂ Dsamp dimensional nonlinear cost functions, this regularizer is of-
4: Append demonstration batch to background batch: ten insufficient, since different entries in the parameter vec-
D̂samp ← D̂demo ∪ D̂samp tor can have drastically different effects on the cost. We use
5: Estimate dLdθIOC (θ) using D̂demo and D̂samp two regularization terms. The first term encourages the cost
6: Update parameters θ using gradient dLdθIOC (θ) of demo and sample trajectories to change locally at a con-
7: end for stant rate (lcr), by penalizing the second time derivative:
8: return optimized cost parameters θ
Guided Cost Learning
PIIRL, demo init
2D Navigation
X
glcr (τ ) = [(cθ (xt+1 ) − cθ (xt )) − (cθ (xt ) − cθ (xt−1 ))]2 PIIRL, rand. init
RelEnt, demo init
xt ∈τ -350 RelEnt, rand. init
ours, demo init

This term reduces high-frequency variation that is often ours, rand. init

true cost
uniform
-400
symptomatic of overfitting, making the cost easier to reop-
timize. Although sharp changes in the cost slope are some-
-450
times preferred, we found that temporally slow-changing
costs were able to adequately capture all of the behaviors 5 25 45 65 85
samples
in our experiments. Reaching
0.8
The second regularizer is more specific to one-shot
0.6
episodic tasks, and it encourages the cost of the states of a

distance
demo trajectory to decrease strictly monotonically in time 0.4

using a squared hinge loss: 0.2

X green: goal
gmono (τ ) = [max(0, cθ (xt ) − cθ (xt−1 ) − 1)]2 0 red: obstacles
5 25 45 65 85
xt ∈τ samples
Peg Insertion initial
The rationale behind this regularizer is that, for tasks that 0.5
state
essentially consist of reaching a target condition or state, 0.4

the demonstrations typically make monotonic progress to-

distance
0.3
ward the goal on some (potentially nonlinear) manifold. 0.2 goal
While this assumption does not always hold perfectly, we 0.1 state
again found that this type of regularizer improved perfor- 0
mance on the tasks in our evaluation. We show a detailed 5 25 45
samples
65 85

comparison with regard to both regularizers in Appendix E. Figure 2. Comparison to prior work on simulated 2D navigation,
reaching, and peg insertion tasks. Reported performance is aver-
6. Experimental Evaluation aged over 4 runs of IOC on 4 different initial conditions . For peg
insertion, the depth of the hole is 0.1m, marked as a dashed line.
We evaluated our sampling-based IOC algorithm on a set
Distances larger than this amount failed to insert the peg.
of robotic control tasks, both in simulation and on a real
robotic platform. Each of the experiments involve complex
alize in Figure 2. The second task involves a 3-link arm
second order dynamics with force or torque control and no
reaching towards a goal location in 2D, in the presence of
manually designed cost function features, with the raw state
physical obstacles. The third, most challenging, task is 3D
provided as input to the learned cost function.
peg insertion with a 7 DOF arm. This task is significantly
We also tested the consistency of our algorithm on a toy more difficult than tasks evaluated in prior IOC work as it
point mass example for which the ground truth distribu- involves complex contact dynamics between the peg and
tion is known. These experiments, discussed fully in Ap- the table and high-dimensional, continuous state and ac-
pendix B, show that using a maximum entropy version of tion spaces. The arm is controlled by selecting torques at
the policy optimization objective (see Section 4.2) and us- the joint motors at 20 Hz. More details on the experimental
ing importance weights are both necessary for recovering setup are provided in Appendix D.
the true distribution.
In addition to the expert demonstrations, prior methods re-
quire a set of “suboptimal” samples for estimating the parti-
6.1. Simulated Comparisons
tion function. We obtain these samples in one of two ways:
In this section, we provide simulated comparisons between by using a baseline random controller that randomly ex-
guided cost learning and prior sample-based methods. We plores around the initial state (random), and by fitting a
focus on task performance and sample complexity, and also linear-Gaussian controller to the demonstrations (demo).
perform comparisons across two different sampling distri- The latter initialization typically produces a motion that
bution initializations and regularizations (in Appendix E). tracks the average demonstration with variance propor-
tional to the variation between demonstrated motions.
To compare guided cost learning to prior methods, we ran
experiments on three simulated tasks of varying difficulty, Between 20 and 32 demonstrations were generated from
all using the MuJoCo physics simulator (Todorov et al., a policy learned using the method of Levine & Abbeel
2012). The first task is 2D navigation around obstacles, (2014), with a ground truth cost function determined by
modeled on the task by Levine & Koltun (2012). This the agent’s pose relative to the goal. We found that for
task has simple, linear dynamics and a low-dimensional the more precise peg insertion task, a relatively complex
state space, but a complex cost function, which we visu- ground truth cost function was needed to afford the neces-
Guided Cost Learning

sary degree of precision. We used a cost function of the human demo initial pose final pose
form wd2 + v log(d2 + α), where d is the distance be-

dish
tween the two tips of the peg and their target positions, and
v and α are constants. Note that the affine cost is inca-
pable of exactly representing this function. We generated
demonstration trajectories under several different starting
conditions. For 2D navigation, we varied the initial posi-

pouring
tion of the agent, and for peg insertion, we varied the posi-
tion of the peg hole. We then evaluated the performance of
our method and prior sample-based methods (Kalakrishnan
et al., 2013; Boularias et al., 2011) on each task from four Figure 3. Dish placement and pouring tasks. The robot learned
arbitrarily-chosen test states. We chose these prior meth- to place the plate gently into the correct slot, and to pour al-
ods because, to our knowledge, they are the only methods monds, localizing the target cup using unsupervised visual fea-
which can handle unknown dynamics. tures. A video of the learned controllers can be found at
http://rll.berkeley.edu/gcl
We used a neural network cost function with two hidden
layers with 24–52 units and rectifying nonlinearities of the in Section 5, and compared to an affine cost function on one
form max(z, 0) followed by linear connections to a set of task to evaluate the importance of non-linear cost represen-
features yt , which had a size of 20 for the 2D navigation tations. The affine cost followed the form of equation 2
task and 100 for the other two tasks. The cost is then given but with yt equal to the input xt .1 For both tasks, between
by 25 and 30 human demonstrations were provided via kines-
c (x , u ) = kAy + bk2 + w ku k2
θ t t t u t (2) thetic teaching, and each IOC algorithm was initialized by
with a fixed torque weight wu and the parameters consist- automatically fitting a controller to the demonstrations that
ing of A, b, and the network weights. These cost functions roughly tracked the average trajectory. Full details on both
range from about 3,000 parameters for the 2D navigation tasks are in Appendix D, and summaries are below.
task to 16,000 parameters for peg insertion. For further de- In the first task, illustrated in Figure 3, the robot must gen-
tails, see Appendix C. Although the prior methods learn tly place a grasped plate into a specific slot of dish rack.
only linear cost functions, we can extend them to the non- The state space consists of the joint angles, the pose of the
linear setting following the derivation in Section 4.1. gripper relative to the target pose, and the time derivatives
Figure 2 illustrates the tasks and shows results for each of each; the actions correspond to torques on the robot’s
method after different numbers of samples from the test motors; and the input to the cost function is the pose and
condition. In our method, five samples were used at each velocity of the gripper relative to the target position. Note
iteration of policy optimization, while for the prior meth- that we do not provide the robot with an existing trajec-
ods, the number of samples corresponds to the number of tory tracking controller or any manually-designed policy
“suboptimal” samples provided for cost learning. For the representation beyond linear-Gaussian controllers, in con-
prior methods, additional samples were used to optimize trast to prior methods that use trajectory following (Kalakr-
the learned cost. The results indicate that our method is ishnan et al., 2013) or dynamic movement primitives with
generally capable of learning tasks that are more complex features (Boularias et al., 2011). Our attempt to design a
than the prior methods, and is able to effectively handle hand-crafted cost function for inserting the plate into the
complex, high-dimensional neural network cost functions. dish rack produced a fast but overly aggressive behavior
In particular, adding more samples for the prior methods that cracked one of the plates during learning.
generally does not improve their performance, because all The second task, also shown in Figure 3, consisted of pour-
of the samples are drawn from the same distribution. ing almonds from one cup to another. In order to succeed,
the robot must keep the cup upright until reaching the target
6.2. Real-World Robotic Control cup, then rotate the cup so that the almonds are poured. In-
We also evaluated our method on a set of real robotic ma- stead of including the position of the target cup in the state
nipulation tasks using the PR2 robot, with comparisons to space, we train autoencoder features from camera images
relative entropy IRL, which we found to be the better of the captured from the demonstrations and add a pruned feature
two prior methods in our simulated experiments. We chose point representation and its time derivative to the state, as
two robotic manipulation tasks which involve complex dy- proposed by Finn et al. (2016). The input to the cost func-
namics and interactions with delicate objects, for which tion includes these visual features, as well as the pose and
it is challenging to write down a cost function by hand. 1
Note that a cost function that is quadratic in the state is linear
For all methods, we used a two-layer neural network cost in the coefficients of the monomials, and therefore corresponds to
parametrization and the regularization objective described a linear parameterization.
Guided Cost Learning

dish (NN) RelEnt IRL GCL q(ut |xt ) GCL reopt. task until enough global training data is available to pro-
success rate 0% 100% 100% duce a cost function that is a good fit to the demonstrations
# samples 100 90 90
in previously unseen parts of the state space.
pouring (NN) RelEnt IRL GCL q(ut |xt ) GCL reopt.
success rate 10% 84.7% 34%
# samples 150,150 75,130 75,130 7. Discussion and Future Work
pouring (affine) RelEnt IRL GCL q(ut |xt ) GCL reopt.
success rate 0% 0% – We presented an inverse optimal control algorithm that
# samples 150 120 – can learn complex, nonlinear cost representations, such as
neural networks, and can be applied to high-dimensional
Table 1. Performance of guided cost learning (GCL) and relative
systems with unknown dynamics. Our algorithm uses a
entropy (RelEnt) IRL on placing a dish into a rack and pouring
almonds into a cup. Sample counts are for IOC, omitting those
sample-based approximation of the maximum entropy IOC
for optimizing the learned cost. An affine cost is insufficient for objective, with samples generated from a policy learning
representing the pouring task, thus motivating using a neural net- algorithm based on local linear models (Levine & Abbeel,
work cost (NN). The pouring task with a neural network cost is 2014). To our knowledge, this approach is the first to com-
evaluated for two positions of the target cup; average performance bine the benefits of sample-based IOC under unknown dy-
is reported. namics with nonlinear cost representations that directly use
the raw state of the system, without the need for manual
velocity of the gripper. Note that the position of the target
feature engineering. This allows us to apply our method
cup can only be obtained from the visual features, so the al-
to a variety of real-world robotic manipulation tasks. Our
gorithm must learn to use them in the cost function in order
evaluation demonstrates that our method outperforms prior
to succeed at the task.
IOC algorithms on a set of simulated benchmarks, and
The results, presented in Table 1, show that our algorithm achieves good results on several real-world tasks.
successfully learned both tasks. The prior relative entropy
IRL algorithm could not acquire a suitable cost function, Our evaluation shows that our approach can learn good
due to the complexity of this domain. On the pouring task, cost functions for a variety of simulated tasks. For com-
where we also evaluated a simpler affine cost function, we plex robotic motion skills, the learned cost functions tend
found that only the neural network representation could re- to explain the demonstrations only locally. This makes
cover a successful behavior, illustrating the need for rich them difficult to reoptimize from scratch for new condi-
and expressive function approximators when learning cost tions. It should be noted that this challenge is not unique
functions directly on raw state representations.2 to our method. In our comparisons, no prior sample-based
method was able to learn good global costs for these tasks.
The results in Table 1 also evaluate the generalizability of However, since our method interleaves cost optimization
the cost function learned by our method and prior work. On with policy learning, it still recovers successful policies for
the dish rack task, we can use the learned cost to optimize these tasks. For this reason, we can still learn from demon-
new policies for different target dish positions successfully, stration simply by retaining the learned policy, and discard-
while the prior method does not produce a generalizable ing the cost function. This allows us to tackle substantially
cost function. On the harder pouring task, we found that the more challenging tasks that involve direct torque control of
learned cost succeeded less often on new positions. How- real robotic systems with feedback from vision.
ever, as discussed in Section 4.4, our method produces both
a policy q(ut |xt ) and a cost function cθ when trained on a To incorporate vision into our experiments, we used unsu-
novel instance of the task, and although the learned cost pervised learning to acquire a vision-based state represen-
functions for this task were worse, the learned policy suc- tation, following prior work (Finn et al., 2016). An ex-
ceeded on the test positions when optimized with IOC in citing avenue for future work is to extend our approach
the inner loop using our algorithm. This indicates an inter- to learn cost functions directly from natural images. The
esting property of our approach: although the learned cost principal challenge for this extension is to avoid overfit-
function is local in nature due to the choice of sampling ting when using substantially larger and more expressive
distribution, the learned policy tends to succeed even when networks. Our current regularization techniques mitigate
the cost function is too local to produce good results in very overfitting to a high degree, but visual inputs tend to vary
different situations. An interesting avenue for future work dramatically between demonstrations and on-policy sam-
is to further explore the implications of this property, and to ples, particularly when the demonstrations are provided by
improve the generalizability of the learned cost by succes- a human via kinesthetic teaching. One promising avenue
sively training policies on different novel instances of the for mitigating these challenges is to introduce regulariza-
2
tion methods developed for domain adaptation in computer
We did attempt to learn costs directly on image pixels, but vision (Tzeng et al., 2015), to encode the prior knowledge
found that the problem was too underdetermined to succeed. Bet-
ter image-specific regularization is likely required for this. that demonstrations have similar visual features to samples.
Guided Cost Learning

Acknowledgements Levine, S. and Abbeel, P. Learning neural network poli-

cies with guided policy search under unknown dynamics.
This research was funded in part by ONR through a Young In Advances in Neural Information Processing Systems
Investigator Program award, the Army Research Office (NIPS), 2014.
through the MAST program, and an NSF fellowship. We
thank Anca Dragan for thoughtful discussions. Levine, S. and Koltun, V. Continuous inverse optimal con-
trol with locally optimal examples. In International Con-
References ference on Machine Learning (ICML), 2012.

Abbeel, P. and Ng, A. Apprenticeship learning via inverse Levine, S., Popovic, Z., and Koltun, V. Nonlinear in-
reinforcement learning. In International Conference on verse reinforcement learning with gaussian processes.
Machine Learning (ICML), 2004. In Advances in Neural Information Processing Systems
(NIPS), 2011.
Aghasadeghi, N. and Bretl, T. Maximum entropy inverse
reinforcement learning in continuous state spaces with Levine, S., Wagener, N., and Abbeel, P. Learning contact-
path integrals. In International Conference on Intelligent rich manipulation skills with guided policy search. In
Robots and Systems (IROS), 2011. International Conference on Robotics and Automation
Audiffren, J., Valko, M., Lazaric, A., and Ghavamzadeh, (ICRA), 2015.
M. Maximum Entropy Semi-Supervised Inverse Rein-
Monfort, M., Lake, B. M., Ziebart, B., Lucey, P., and
forcement Learning. In International Joint Conference
Tenenbaum, J. Softstar: Heuristic-guided probabilistic
on Artificial Intelligence (IJCAI), July 2015.
inference. In Advances in Neural Information Process-
Bagnell, J. A. and Schneider, J. Covariant policy search. In ing Systems, pp. 2746–2754, 2015.
International Joint Conference on Artificial Intelligence
(IJCAI), 2003. Muelling, K., Boularias, A., Mohler, B., Schölkopf, B., and
Peters, J. Learning strategies in table tennis using inverse
Boularias, A., Kober, J., and Peters, J. Relative entropy reinforcement learning. Biological Cybernetics, 108(5),
inverse reinforcement learning. In International Confer- 2014.
ence on Artificial Intelligence and Statistics (AISTATS),
2011. Ng, A., Harada, D., and Russell, S. Policy invariance under
reward transformations: Theory and application to re-
Byravan, A., Monfort, M., Ziebart, B., Boots, B., and Fox, ward shaping. In International Conference on Machine
D. Graph-based inverse optimal control for robot manip- Learning (ICML), 1999.
ulation. In International Joint Conference on Artificial
Intelligence (IJCAI), 2015. Ng, A., Russell, S., et al. Algorithms for inverse reinforce-
ment learning. In International Conference on Machine
Doerr, A., Ratliff, N., Bohg, J., Toussaint, M., and Schaal,
Learning (ICML), 2000.
S. Direct loss minimization inverse optimal control. In
Proceedings of Robotics: Science and Systems (R:SS), Peters, J., Mülling, K., and Altün, Y. Relative entropy pol-
Rome, Italy, July 2015. icy search. In AAAI Conference on Artificial Intelligence,
Dragan, Anca and Srinivasa, Siddhartha. Formalizing as- 2010.
sistive teleoperation. In Proceedings of Robotics: Sci- Ramachandran, D. and Amir, E. Bayesian inverse rein-
ence and Systems (R:SS), Sydney, Australia, July 2012. forcement learning. In AAAI Conference on Artificial
Finn, Chelsea, Tan, Xin Yu, Duan, Yan, Darrell, Trevor, Intelligence, volume 51, 2007.
Levine, Sergey, and Abbeel, Pieter. Deep spatial autoen-
coders for visuomotor learning. International Confer- Ratliff, N., Bagnell, J. A., and Zinkevich, M. A. Maxi-
ence on Robotics and Automation (ICRA), 2016. mum margin planning. In International Conference on
Machine Learning (ICML), 2006.
Huang, D. and Kitani, K. Action-reaction: Forecasting the
dynamics of human interaction. In European Conference Ratliff, N., Bradley, D., Bagnell, J. A., and Chestnutt,
on Computer Vision (ECCV), 2014. J. Boosting structured prediction for imitation learning.
2007.
Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal,
S. Learning objective functions for manipulation. In Ratliff, N., Silver, D., and Bagnell, J. A. Learning to
International Conference on Robotics and Automation search: Functional gradient techniques for imitation
(ICRA), 2013. learning. Autonomous Robots, 27(1), 2009.
Guided Cost Learning

Rawlik, K. and Vijayakumar, S. On stochastic optimal

control and reinforcement learning by approximate in-
ference. Robotics, 2013.
Todorov, E. Linearly-solvable markov decision problems.
In Advances in Neural Information Processing Systems
(NIPS), 2006.

Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics

engine for model-based control. In International Con-
ference on Intelligent Robots and Systems (IROS), 2012.
Tzeng, E., Hoffman, J., Darrell, T., and Saenko, K. Si-
multaneous deep transfer across domains and tasks. In
International Conference on Computer Vision (ICCV),
2015.
Wulfmeier, M., Ondruska, P., and Posner, I. Maximum
entropy deep inverse reinforcement learning. arXiv
preprint arXiv:1507.04888, 2015.

Ziebart, B. Modeling purposeful adaptive behavior with

the principle of maximum causal entropy. PhD thesis,
Carnegie Mellon University, 2010.
Ziebart, B., Maas, A., Bagnell, J. A., and Dey, A. K. Max-
imum entropy inverse reinforcement learning. In AAAI
Conference on Artificial Intelligence, 2008.

Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Generative Adversarial Imitation Learning
No ratings yet
Generative Adversarial Imitation Learning
9 pages
tmp5056 TMP
No ratings yet
tmp5056 TMP
6 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
No ratings yet
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
23 pages
Gaussian Processes For Data-Efficient Learning in Robotics and Control
No ratings yet
Gaussian Processes For Data-Efficient Learning in Robotics and Control
20 pages
Maximum-Entropy Multi-Agent Dynamic Games
No ratings yet
Maximum-Entropy Multi-Agent Dynamic Games
15 pages
Inverse Optimal Control With Linearly-Solvable MDPs
No ratings yet
Inverse Optimal Control With Linearly-Solvable MDPs
8 pages
Learning Optimization Algorithms via RL
No ratings yet
Learning Optimization Algorithms via RL
13 pages
Online Policy Optimization in Unknown Nonlinear Systems
No ratings yet
Online Policy Optimization in Unknown Nonlinear Systems
48 pages
Imitation Learning
No ratings yet
Imitation Learning
188 pages
Good
No ratings yet
Good
10 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
Value-Policy Integration in RL
No ratings yet
Value-Policy Integration in RL
21 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Auv RL
No ratings yet
Auv RL
11 pages
Reinforcement Learning for Control Systems
100% (1)
Reinforcement Learning for Control Systems
111 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
Drive in Trafic PDF
No ratings yet
Drive in Trafic PDF
20 pages
Reinforcement Learning Algorithms
No ratings yet
Reinforcement Learning Algorithms
98 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
2019 RL Control Review
No ratings yet
2019 RL Control Review
27 pages
Cutler16 ICRA Final Submission
No ratings yet
Cutler16 ICRA Final Submission
7 pages
Inverse Reinforcement Learning Through Policy Gradient Minimization
No ratings yet
Inverse Reinforcement Learning Through Policy Gradient Minimization
7 pages
Opinion Critic
No ratings yet
Opinion Critic
9 pages
Continuous Deep Q-Learning With Model-Based Acceleration: Shixiang Gu Timothy Lillicrap Ilya Sutskever Sergey Levine
No ratings yet
Continuous Deep Q-Learning With Model-Based Acceleration: Shixiang Gu Timothy Lillicrap Ilya Sutskever Sergey Levine
10 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Function Approximation in Reinforcement Learning
No ratings yet
Function Approximation in Reinforcement Learning
9 pages
Reinforcement Learning Textbook Draft
No ratings yet
Reinforcement Learning Textbook Draft
11 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
RLDM 2013: Extended Abstracts
No ratings yet
RLDM 2013: Extended Abstracts
210 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Continuous Deep Q-Learning Acceleration
No ratings yet
Continuous Deep Q-Learning Acceleration
13 pages
No-Regret Learning in AI
No ratings yet
No-Regret Learning in AI
14 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
37 RL
No ratings yet
37 RL
18 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
1 s2.0 S000510981500343X Main
No ratings yet
1 s2.0 S000510981500343X Main
8 pages
NIPS 2012 Imitation Learning by Coaching Paper
No ratings yet
NIPS 2012 Imitation Learning by Coaching Paper
9 pages
Project
No ratings yet
Project
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
Model-Free RL for Linear Quadratic Control
No ratings yet
Model-Free RL for Linear Quadratic Control
16 pages
Class X Maths Sample Paper 2
No ratings yet
Class X Maths Sample Paper 2
5 pages
10
No ratings yet
10
1 page
Chapters Ticks Table Final
No ratings yet
Chapters Ticks Table Final
3 pages
Understanding Random Events in Probability
No ratings yet
Understanding Random Events in Probability
25 pages
Logarithmic Equations Worksheet
No ratings yet
Logarithmic Equations Worksheet
6 pages
Regression Notes-I
No ratings yet
Regression Notes-I
10 pages
Mathematics Major Curricular Plan, 2004-: General Education
No ratings yet
Mathematics Major Curricular Plan, 2004-: General Education
2 pages
Grade 7 End of Year EXAM Paper 1 Assessment Framework 2024
No ratings yet
Grade 7 End of Year EXAM Paper 1 Assessment Framework 2024
5 pages
BCSL 032 Solved Assignments 2016
No ratings yet
BCSL 032 Solved Assignments 2016
7 pages
Introduction to Automata Theory
No ratings yet
Introduction to Automata Theory
51 pages
Random-Number Generation: Discrete-Event System Simulation
No ratings yet
Random-Number Generation: Discrete-Event System Simulation
32 pages
F22 Midterm 2
No ratings yet
F22 Midterm 2
8 pages
Digital Electronics Lecture Notes Jan 29, 2018 - 4
No ratings yet
Digital Electronics Lecture Notes Jan 29, 2018 - 4
68 pages
Unit 1 Pure Mathematics MCQ (2008 - 2015) Answers PDF
No ratings yet
Unit 1 Pure Mathematics MCQ (2008 - 2015) Answers PDF
1 page
CSC225 Computational Methods Exam
No ratings yet
CSC225 Computational Methods Exam
1 page
MEd 212-Module 1
No ratings yet
MEd 212-Module 1
11 pages
Principles of Building Measurement
No ratings yet
Principles of Building Measurement
13 pages
Real Car Steering with NFQ in 20 Min
No ratings yet
Real Car Steering with NFQ in 20 Min
8 pages
Cambridge IGCSE: MATHEMATICS 0580/11
No ratings yet
Cambridge IGCSE: MATHEMATICS 0580/11
12 pages
Integral Calculus Formulas Guide
No ratings yet
Integral Calculus Formulas Guide
2 pages
Partial Differential Equations Course
No ratings yet
Partial Differential Equations Course
4 pages
CBSE Class 11 Mathematics Sets Solutions
No ratings yet
CBSE Class 11 Mathematics Sets Solutions
13 pages
Adversary Arguments for Lower Bounds
0% (1)
Adversary Arguments for Lower Bounds
12 pages
Introduction to Elementary Number Theory
100% (1)
Introduction to Elementary Number Theory
72 pages
Grade 11 Euclidean Geometry
No ratings yet
Grade 11 Euclidean Geometry
25 pages
Python Programming Exercises
No ratings yet
Python Programming Exercises
44 pages
CATIA V5 R19 204 - Intermediate Surfacing
No ratings yet
CATIA V5 R19 204 - Intermediate Surfacing
3 pages
CBA - I Self-Assessment Model Paper - I Pattern and Syllabus 2025-2026
100% (1)
CBA - I Self-Assessment Model Paper - I Pattern and Syllabus 2025-2026
6 pages
General Solutions for Linear Systems
No ratings yet
General Solutions for Linear Systems
10 pages
Multiplication
No ratings yet
Multiplication
45 pages

Deep Inverse Optimal Control Via Policy Optimization

Uploaded by

Deep Inverse Optimal Control Via Policy Optimization

Uploaded by

Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization

Chelsea Finn CBFINN @ EECS . BERKELEY. EDU

haviors from high-level specifications. However, distribution q0

torque control of high-dimensional robotic sys-

1. Introduction To address the challenge of representation, we propose

using a squared hinge loss: 0.2

the demonstrations typically make monotonic progress to-

Acknowledgements Levine, S. and Abbeel, P. Learning neural network poli-

Rawlik, K. and Vijayakumar, S. On stochastic optimal

Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics

Ziebart, B. Modeling purposeful adaptive behavior with

You might also like