F - B - Gan-B L D: W W: Eature Ased Vs Ased Earning From Emonstrations Hen and HY
F - B - Gan-B L D: W W: Eature Ased Vs Ased Earning From Emonstrations Hen and HY
A BSTRACT
ACKNOWLEDGMENTS
This survey was proofread by Zhiyang Dou, Tairan He, Xuxin Cheng, Zhengyi Luo, and Chen Tessler.
Their expertise in the field and constructive suggestions were instrumental in shaping the final form
of this work.
1 D ISCLAIMER
The terminology surrounding the use of offline reference data in reinforcement learning (RL) varies
widely across the literature. Terms such as imitation learning, learning from demonstrations, and
demonstration learning are often used interchangeably, despite referring to subtly different method-
ologies or assumptions.
In this survey, we adopt the term learning from demonstrations to specifically denote a class of
methods that utilize state-based, offline reference data to derive a reward signal. This reward signal
quantifies the similarity between the behavior of a learning agent and that of the reference trajectories,
and it is used to guide policy optimization.
This definition intentionally excludes methods based on behavior cloning that require action anno-
tations, such as those used in recent large-scale manipulation datasets (e.g., Gr00t N1 (Bjorck et al.,
2025), diffusion policy (Chi et al., 2023), Gemini Robotics (Team et al., 2025)). These approaches
assume access to expert action labels and thus follow a different paradigm than the class of methods
discussed here, which operate solely on state observations and rely on RL to generate control.
1
methods, such as feature-based versus GAN-based approaches. Practitioners often adopt one method
over another based on precedent or anecdotal success, without a systematic analysis of the algorithmic
factors that underlie their performance. As a result, conclusions drawn from empirical success may
conflate algorithmic merit with incidental choices in reward design, data selection, or architecture.
The objective of this article is to provide a principled comparison between feature-based and
GAN-based imitation methods, focusing on their fundamental assumptions, inductive biases, and
operational regimes. The exposition proceeds in two stages. First, we review the problem setting
from the perspective of physics-based control and reinforcement learning, including the formulation
of reward functions based on reference trajectories. Second, we examine the historical development
and current landscape of imitation methods, organized around the type of reward structure they use,
explicit, feature-based formulations versus implicit, adversarially learned metrics.
Our goal is not to advocate for one approach over the other in general, but to clarify the conditions
under which each is more suitable. By articulating the trade-offs involved—including scalability,
stability, generalization, and representation learning, we aim to provide a conceptual framework that
supports more informed method selection in future work.
In both character animation and robotics, physics-based control refers to a paradigm in which an
agent’s behavior is governed by the underlying physical dynamics of the system, either simulated or
real. Rather than prescribing trajectories explicitly, such as joint angles or end-effector poses, this
approach formulates control as a process of goal-directed optimization, where a policy generates
control signals (e.g., torques or muscle activations) to maximize an objective function under physical
constraints. This stands in contrast to kinematics-based or keyframe-based methods, which
often disregard dynamics and focus on geometrically feasible but potentially physically implausible
motions. Physics-based control ensures that resulting behaviors are not only kinematically valid but
also dynamically consistent, energy-conservative, and responsive to interaction forces, making it
particularly suited for tasks involving locomotion, balance, and physical interaction in uncertain or
dynamic environments.
The canonical formalism for this control paradigm is the Markov Decision Process (MDP), defined
by a tuple (S, A, T, R, γ), where S and A denote the state and action spaces, respectively. The
transition kernel T : S × A → S captures the environment dynamics p (st+1 | st , at ), while the
reward function R : S × A × S → R maps transitions to scalar rewards.
hPThe agenti seeks to learn a
t
policy πθ : S → A that maximizes the expected discounted return Eπθ t≥0 γ rt , where rt is the
reward at time t and γ ∈ [0, 1] is the discount factor.
In this context, the state s ∈ S typically encodes the agent’s physical configuration and dynamics,
such as joint positions, joint velocities, root orientation, and may include exteroceptive inputs like
terrain geometry or object pose. The action a ∈ A corresponds to the control input applied to
the system, most commonly joint torques in torque-controlled settings, or target positions in PD-
controlled systems. In biomechanical models, actions may also represent muscle activations. By
integrating these elements within a physics simulator or physical system, physics-based control
enables emergent behaviors that are compatible with real-world dynamics, allowing policies to
discover strategies that are not only effective but also physically feasible.
In the context of learning from demonstrations, reward functions are typically derived from reference
data, rather than being manually engineered to reflect task success or motion quality. This setup
leverages recorded trajectories, often collected from motion capture, teleoperation, or other expert
sources, to define a notion of behavioral similarity. The policy is then optimized to minimize this
discrepancy, encouraging it to reproduce motions that are consistent with those in the demonstration
dataset.
Critically, the reward derived from demonstrations may serve either as a pure imitation objective,
where the policy is expected to replicate the demonstrated behavior as closely as possible, or as a
2
policy 𝜋
𝑎
reference
𝑟𝐼
target state 𝑠Ƹ - state 𝑠
Figure 1: DeepMimic-style feature-based methods. The policy receives dense, per-frame rewards
by comparing hand-crafted features—such as joint positions and end-effector poses—between
its current state and a time-aligned reference state. A phase variable synchronizes policy and
demonstration trajectories, enabling accurate motion reproduction but limiting generalization across
diverse behaviors due to the lack of structured motion representation.
regularizing component that biases learning while allowing task-specific objectives to dominate.
This dual role makes demonstration-based rewards particularly valuable in high-dimensional control
problems where exploration is difficult and task-based rewards are sparse or poorly shaped. As such,
learning from demonstrations transforms the design of the reward function from a manual engineering
problem into one of defining or learning an appropriate similarity metric between agent and expert
behavior, either explicitly, through features, or implicitly, through discriminators or encoders.
While reference trajectories are often valued for their visual realism or naturalness, this perspective
underemphasizes their algorithmic utility: reference data serves as a critical mechanism for improv-
ing learning efficiency in high-dimensional control problems. Rather than functioning merely as
a constraint or prior, demonstrations provide structured guidance that biases policy exploration
toward plausible and meaningful behaviors.
This role becomes especially important as the complexity of the environment and agent increases. In
lower-dimensional settings, carefully engineered reward functions or manually designed curricula
have proven sufficient to elicit sophisticated behaviors through reinforcement learning alone (Rudin
et al., 2022). However, such strategies do not scale effectively to systems with high-dimensional
state-action spaces, where naïve exploration is inefficient and reward shaping becomes brittle or
intractable. Under these conditions, demonstration data offers a practical alternative to reward or
environment shaping, acting as an inductive bias that accelerates the discovery of viable behaviors.
In this light, reference motions are not ancillary constraints but primary learning signals, particularly
in regimes where task-based supervision is sparse or difficult to specify. This reframing justifies the
use of demonstrations not only for imitation but as a foundation for scalable and data-efficient policy
learning.
Feature-based imitation approaches can be traced back to DeepMimic (Peng et al., 2018), which
established a now-standard formulation for constructing reward signals based on explicit motion
matching. In this framework, the policy is aligned with a reference trajectory by introducing a phase
variable, which serves as a learned proxy for temporal progress through the motion. The reward is
computed by evaluating feature-wise distances—such as joint positions, velocities, orientations, and
end-effector positions—between the policy-generated trajectory and the reference, synchronized via
the phase. An abstracted overview is shown in Fig. 1.
Owing to their dense and explicit reward structure, these methods are highly effective at reproducing
fine-grained motion details. However, their scalability to diverse motion datasets is limited. While
DeepMimic introduces a one-hot motion identifier to enable multi-clip training, this encoding does
not model semantic or structural relationships between different motions. As a result, the policy
treats each motion clip as an isolated objective, which precludes generalization and often leads to
discontinuities at transition points.
3
policy 𝜋
𝑎
reference
𝑟𝐷𝐼
discriminator 𝐷
state 𝑠
Figure 2: GAN-based methods via adversarial rewards. A discriminator learns to distinguish short
transition snippets from policy-generated and demonstration data, providing an implicit reward signal
that guides the policy toward expert-like behavior. By operating on short windows without explicit
time alignment, this approach scales to diverse motion datasets and captures distributional similarity,
enabling smoother transitions across unstructured behaviors.
Although the phase variable handles temporal alignment within a given clip, there is no analogous
mechanism for enforcing spatial or semantic coherence across clips. Transitions between motions
are implemented via hard switching on motion identifiers, which can result in abrupt behavioral
changes and visually unnatural trajectories. What is missing in this setup is a structured representa-
tion space over motions—one that captures both temporal progression and the underlying topology
of behavioral variation. Such representations enable not only smoother transitions between behaviors
but also facilitate interpolation, compositionality, and improved generalization to motions not seen
during training. Policies trained over these structured motion spaces are better equipped to synthesize
new behaviors while preserving physical plausibility and stylistic fidelity.
To address the limitations of feature-based approaches in handling diverse motion data, Adversarial
Motion Priors (AMP) (Peng et al., 2021) introduced the use of adversarial training, building on earlier
frameworks such as GAIL (Ho & Ermon, 2016), where expert action labels are assumed. In the AMP
setting, a discriminator is trained to distinguish between state transitions generated by the policy
and those sampled from a dataset of reference trajectories. As the policy improves, its transitions
become increasingly indistinguishable from the expert data, thereby reducing the discriminator’s
ability to classify them correctly. The discriminator’s output serves as a reward signal, guiding the
policy toward behavioral fidelity. The system is illustrated in Fig. 2.
From an optimization standpoint, GAN-based methods treat the policy as a generator in a two-player
minimax game. These methods scale naturally to large and diverse motion datasets, as they operate
on short, fixed-length transition windows, typically spanning two to eight frames, rather than
full trajectories. This removes the need for phase-based or time-indexed alignment, making them
particularly effective in unstructured datasets. Additionally, the discriminator implicitly defines a
similarity metric over motion fragments, allowing transitions that are behaviorally similar to receive
comparable rewards even when not temporally aligned. As a result, policies trained under adversarial
objectives tend to exhibit smoother transitions across behaviors compared to methods relying on
discrete motion identifiers and hard switching. Because the reward is defined over distributional
similarity, rather than matching a specific trajectory, AMP and related techniques are well-suited
for stylization tasks or for serving as general motion priors that can be composed with task-specific
objectives.
Despite their empirical success across domains, including character animation (e.g., InterPhys (Hassan
et al., 2023), PACER (Rempe et al., 2023)) and robotics (Escontrela et al., 2022), adversarial imitation
introduces fundamental challenges that impact training reliability and policy expressiveness.
Discriminator Saturation A key challenge in adversarial setups is that the discriminator can rapidly
become overconfident, especially early in training when the policy generates trajectories that diverge
significantly from the reference distribution. In this regime, the discriminator easily classifies all
transitions correctly, producing near-zero gradients and leaving the policy without informative reward
4
encoding
discriminator 𝐷
state 𝑠
Figure 3: Latent-conditioned GAN-based methods. The policy and discriminator are jointly con-
ditioned on learned motion embeddings, which are derived from demonstration data through unsu-
pervised or supervised representation learning. These latent variables structure the imitation space,
promoting behavioral diversity, stabilizing training, and enabling controllable skill generation beyond
what implicit adversarial objectives can achieve alone.
5
encoding
Figure 4: Feature-based methods with structured motion representations. The policy receives per-
frame rewards based on feature differences with reference states and is conditioned on compact
motion embeddings derived from demonstration data. This design preserves the interpretability
of hand-crafted objectives while enabling smoother transitions and broader generalization across
behaviors through learned motion structure.
While adversarial imitation methods offer flexibility and scalability with diverse reference data, they
impose significant practical burdens. Ensuring training stability, managing discriminator saturation,
and preventing mode collapse often require extensive architectural tuning. These limitations have
motivated a return to feature-based methods, now enhanced with structured motion representations,
as a more interpretable and controllable alternative to adversarial training. The core insight behind
this renewed direction is the importance of a well-structured motion representation space for enabling
smooth transitions and generalization across behaviors. While GAN-based methods rely on the
discriminator to implicitly induce such a representation, often requiring additional mechanisms to
extract, control, or condition on it, feature-based approaches allow for the explicit construction
of motion embeddings that are either precomputed or learned in parallel with policy training. This
explicitness simplifies conditioning and reward design, often reducing the reward to weighted feature
differences relative to a reference state. Such systems can be abstracted with a structure illustrated in
Fig. 4.
As a result, a new class of imitation approaches has emerged that maintains the explicit reward
structure of traditional feature-based methods, but augments it with representation learning to
scale across tasks and motions. In many cases, reference frames, or compact summaries thereof, are
injected directly into the policy, providing frame-level tracking targets that guide behavior.
Sophisticated Motion Representation A central challenge for this class of methods is the construc-
tion of motion representations that support smooth transitions and structural generalization. Compact,
low-dimensional embeddings promote semantic understanding of inter-motion relationships and
improve sample efficiency.
To this end, some methods inject reference features or full motion states directly into the policy (e.g.,
PhysHOI (Wang et al., 2023), ExBody (Cheng et al., 2024), H2O (He et al., 2024b), HumanPlus (Fu
et al., 2024), MaskedMimic (Tessler et al., 2024), ExBody2 (Ji et al., 2024), OmniH2O (He et al.,
2024a), AMO (Li et al., 2025), TWIST (Ze et al., 2025), GMT (Chen et al., 2025)), preserving spatial
coherence in the motion space. Others pursue more abstract embeddings through self-supervised or
policy-conditioned learning. For instance, ControlVAE (Yao et al., 2022), PhysicsVAE (Won et al.,
2022), and NCP (Zhu et al., 2023) build representations via policy interaction, while VMP (Serifi et al.,
2024b) and RobotMDM (Serifi et al., 2024a) construct temporally and spatially coherent embeddings
using self-supervision. Frequency-domain methods such as PAE (Starke et al., 2022), FLD (Li et al.,
2024), and DFM (Watanabe et al., 2025) impose motion-inductive biases that capture the periodic
and hierarchical structure of motion. These techniques collectively extend the DeepMimic paradigm
by generalizing phase alignment and structural similarity beyond heuristics.
Inflexible Imitation Adaptation A limitation of these representation-driven feature-based methods
is that they often rely on explicit tracking of full trajectories, enforced by dense per-step rewards.
6
This design makes it difficult to adapt or deviate from the reference when auxiliary tasks require
flexibility, as is common in goal-directed or interaction-heavy settings.
To address this, some approaches introduce mechanisms to adaptively relax imitation constraints. For
example, MCP (Sleiman et al., 2024) introduces a fallback mechanism that adjusts phase progression
when key task objectives are not met. RobotKeyframing (Zargarbashi et al., 2024) proposes a
transformer-based attention model that encodes arbitrary sets of keyframes with flexible temporal
spacing. Other works incorporate high-level planning components to dictate intermediate reference
states, such as diffusion-based models in PARC (Xu et al., 2025) and HMI (Fan et al., 2025), or
planners that directly modulate the learned motion representations (e.g., VQ-PMC (Han et al., 2024),
Motion Priors Reimagined (Zhang et al., 2025)).
Together, these developments illustrate the interpretability and stability of feature-based imitation
when paired with structured motion representations. However, despite avoiding the instability of
adversarial training, these methods remain constrained by their reliance on explicit tracking and
overengineered representations, which can hinder adaptation in tasks requiring flexible deviation
from demonstrations.
GAN-based
AMP (Peng et al., 2021; Escontrela et al., 2022), InterPhys (Hassan et al., 2023),
PACER (Rempe et al., 2023), WASABI (Li et al., 2023b), HumanMimic (Tang et al., 2024),
CASSI (Li et al., 2023a), ASE (Peng et al., 2022), CALM (Tessler et al., 2023),
Multi-AMP (Vollenweider et al., 2023), CASE (Dou et al., 2023),
SMPLOlympics (Luo et al., 2024), FB-CPR (Tirinzoni et al., 2025), PHC (Luo et al., 2023a),
PHC+ (Luo et al., 2023b), PULSE (Luo et al., 2023b)
Feature-based
DeepMimic (Peng et al., 2018), PhysHOI (Wang et al., 2023), ExBody (Cheng et al., 2024),
H2O (He et al., 2024b), HumanPlus (Fu et al., 2024), MaskedMimic (Tessler et al., 2024),
ExBody2 (Ji et al., 2024), OmniH2O (He et al., 2024a), AMO (Li et al., 2025),
TWIST (Ze et al., 2025), GMT (Chen et al., 2025), ControlVAE (Yao et al., 2022),
PhysicsVAE (Won et al., 2022), NCP (Zhu et al., 2023), VMP (Serifi et al., 2024b),
RobotMDM (Serifi et al., 2024a), PAE (Starke et al., 2022), FLD (Li et al., 2024),
DFM (Watanabe et al., 2025), MCP (Sleiman et al., 2024),
RobotKeyframing (Zargarbashi et al., 2024), PARC (Xu et al., 2025), HMI (Fan et al., 2025),
VQ-PMC (Han et al., 2024), Motion Priors Reimagined (Zhang et al., 2025)
GAN-based approaches, such as AMP and its derivatives, use a discriminator to assign reward signals
based on the realism of short transition snippets. This formulation dispenses with time-aligned
supervision, allowing policies to imitate motion in a distributional sense rather than reproducing
specific trajectories. As a result, these methods scale naturally to unstructured or unlabeled data,
enabling smoother transitions between behaviors and generalization beyond the demonstrated clips.
Recent advances mitigate some of the core challenges of GAN-based imitation, namely, discriminator
saturation and mode collapse, by introducing latent structure. Techniques learn motion embeddings
that condition both policy and discriminator, thereby stabilizing training and supporting controllable
7
behavior generation. These latent-conditioned GANs can also model semantic structure in motion
space, facilitating interpolation and compositionality.
Despite these benefits, GAN-based methods remain prone to training instability, require careful dis-
criminator design, and often offer coarser control over motion details. Their implicit reward structure
can obscure performance tuning and requires auxiliary mechanisms for precise task alignment.
In contrast, feature-based imitation methods like DeepMimic start with dense, per-frame reward
functions derived from specific motion features. This yields strong supervision for motion matching,
making them highly effective for replicating fine-grained details in demonstrated behavior. However,
traditional approaches are limited by their dependence on hard-coded alignment and lack of
structured motion representation, which restricts scalability and generalization.
Recent developments address these limitations by integrating learned motion representations into the
reward and policy structure. These efforts construct latent motion embeddings to structure behavior
across clips, enabling smoother transitions and support for more diverse or compositional motions.
This new generation of feature-based methods retains interpretability and strong reward signals while
gaining some of the flexibility previously unique to GAN-based setups.
Nevertheless, feature-based systems still face challenges in adapting to auxiliary tasks or goals that
require deviation from the reference trajectory. Their strong reliance on explicit tracking and dense
supervision can make them brittle in dynamic or multi-objective settings, where flexibility is crucial.
8
should focus on the properties most directly influenced by algorithmic design. These include reward
signal quality, training stability, generalization to novel motions or environments, and adaptability to
auxiliary tasks. By focusing on such factors, researchers and practitioners can better understand the
operational trade-offs between feature-based and GAN-based approaches, avoiding overgeneralized
claims and grounding comparisons in algorithmic substance rather than incidental outcome metrics.
This is partially true. GAN-based methods implicitly learn a similarity function via the discriminator.
However, this function may be ill-defined in early training, leading to discriminator saturation, where
the discriminator assigns uniformly high distances regardless of policy improvement. Moreover, the
discriminator may conflate resemblance to a single exemplar with similarity to the overall distribution,
resulting in mode collapse. Thus, while a learned metric exists, its utility and stability depend heavily
on discriminator design and representation quality.
No. This assertion overlooks a key implementation detail: the discriminator operates on selected
features of the agent state. Choosing these features is analogous to defining reward components in
feature-based methods. Insufficient features can prevent the discriminator from detecting meaningful
discrepancies, while overly complex inputs can lead to rapid overfitting and saturation. This trade-off
is particularly critical in tasks involving partially observed context (e.g., terrain or object interactions),
where feature selection significantly impacts training stability and convergence.
Not quite. While adversarial methods circumvent explicit manual weighting of reward components,
they are still sensitive to feature scaling and normalization. Input magnitudes shape the discriminator’s
sensitivity and therefore act as an implicit weighting scheme. Poorly calibrated inputs can bias the
reward signal, undermining the interpretability and reliability of the learned policy.
This holds true only relative to early feature-based methods that lacked structured representations
and relied on hard switching between clips. Modern feature-based methods that leverage structured
motion embeddings can produce smooth, semantically meaningful transitions. Interpolation in learned
latent spaces supports temporally and spatially coherent motion generation, rivaling or exceeding
GAN-based transitions when appropriate representation learning is applied.
No. Both GAN-based and feature-based methods can incorporate task objectives. Feature-based
methods provide dense, frame-aligned imitation rewards, making them effective when the task aligns
closely with the reference motion, but less flexible when deviation is required. In contrast, GAN-based
methods offer distribution-level supervision, enabling greater adaptability to auxiliary goals. This
flexibility, however, comes at the cost of lower fidelity to the reference and a risk of mode collapse.
9
the cost of discarding fine motion details. Feature-based approaches, especially those employing
probabilistic or variational models, can also handle noise effectively through regularization and
representation smoothing.
Not necessarily. Scalability is more a function of motion representation quality than of paradigm.
Both GAN-based and feature-based methods can scale with large datasets if equipped with appropriate
latent encodings. The difference lies in when and how these representations are learned—feature-
based methods often rely on supervised or self-supervised embeddings, while GAN-based methods
may induce representations via adversarial feedback. Neither approach guarantees scalability without
careful design.
No. There is no intrinsic connection between the choice of imitation algorithm and sim-to-real transfer
efficacy. Transferability is determined primarily by external strategies such as domain randomization,
system identification, and regularization. While GAN-based approaches may respond more flexibly to
auxiliary rewards, they are also more sensitive to regularization, which can create the false impression
that certain regularizers are more effective in these methods.
Generalization depends less on the reward structure and more on the quality and organization of the
motion representation space. Both GAN-based and feature-based methods can generalize effectively
when equipped with well-structured embeddings. Failure modes arise not from the paradigm itself but
from inadequate inductive biases, insufficient diversity in training data, or poor temporal modeling.
Not necessarily. Designing robust feature-based systems involves selecting appropriate reward
features, constructing phase functions or embeddings, and managing temporal alignment. These
tasks can be as complex as designing a discriminator, particularly when the goal is to scale across
tasks or environments. Moreover, effective latent representations often require pretraining and careful
architectural choices to avoid collapse or disentanglement failure.
11 F INAL R EMARKS
This survey has examined two major paradigms in learning from demonstrations: feature-based
and GAN-based methods, through the lens of reward structure, scalability, generalization, and
representation. The core distinction lies not merely in architectural components but in their respective
philosophies of supervision: explicit, hand-crafted rewards versus implicit, adversarially learned
objectives.
Feature-based methods offer dense, interpretable rewards that strongly anchor the policy to reference
trajectories, making them well-suited for tasks requiring high-fidelity reproduction of demonstrated
motions. However, they often struggle with generalization, particularly in multi-clip or unstructured
settings, due to the need for manually specified features and aligned references.
GAN-based methods, in contrast, provide more flexible and data-driven reward structures through
discriminative objectives. This enables them to scale naturally to diverse datasets and to support
smoother transitions and behavior interpolation. Yet, they often encounter challenges related to
training stability, reward sparsity, and loss of fine-grained motion detail.
It is important to recognize that many problems commonly attributed to one paradigm reappear
in different forms in the other. For instance, mode collapse in GANs mirrors the brittleness of poor
motion representations in feature-based methods. Similarly, while feature-based methods offer strong
guidance for motion tracking, they may fail to generalize or adapt when rigid reward definitions are
misaligned with auxiliary tasks or dynamic environments.
10
Rather than presenting these two paradigms as mutually exclusive, recent trends point toward a
convergent perspective, one that emphasizes the centrality of structured motion representations.
Whether derived from self-supervised learning, latent encodings, or manually designed summaries,
these representations serve as a bridge between the strengths of each approach: the interpretability
and controllability of explicit rewards and the scalability and adaptability of adversarial training.
Ultimately, the decision between using a feature-based or GAN-based approach is not a question of
universal superiority. Instead, it should be guided by the specific constraints and priorities of the
application: fidelity versus diversity, interpretability versus flexibility, or training simplicity versus
large-scale generalization. Understanding these trade-offs and their relationship to reward structure
and motion representation is essential for designing robust, scalable, and expressive imitation learning
systems.
11
R EFERENCES
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang,
Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist
humanoid robots. arXiv preprint arXiv:2503.14734, 2025.
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt:
General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770,
2025.
Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive
whole-body control for humanoid robots. arXiv preprint arXiv:2402.16796, 2024.
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake,
and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The
International Journal of Robotics Research, pp. 02783649241273668, 2023.
Zhiyang Dou, Xuelin Chen, Qingnan Fan, Taku Komura, and Wenping Wang. C· ase: Learning
conditional adversarial skill embeddings for physics-based characters. In SIGGRAPH Asia 2023
Conference Papers, pp. 1–11, 2023.
Alejandro Escontrela, Xue Bin Peng, Wenhao Yu, Tingnan Zhang, Atil Iscen, Ken Goldberg, and
Pieter Abbeel. Adversarial motion priors make good substitutes for complex reward functions. In
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 25–32.
IEEE, 2022.
Yahao Fan, Tianxiang Gui, Kaiyang Ji, Shutong Ding, Chixuan Zhang, Jiayuan Gu, Jingyi Yu, Jingya
Wang, and Ye Shi. One policy but many worlds: A scalable unified policy for versatile humanoid
locomotion. arXiv preprint arXiv:2505.18780, 2025.
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid
shadowing and imitation from humans. arXiv preprint arXiv:2406.10454, 2024.
Lei Han, Qingxu Zhu, Jiapeng Sheng, Chong Zhang, Tingguang Li, Yizheng Zhang, He Zhang,
Yuzhen Liu, Cheng Zhou, Rui Zhao, et al. Lifelike agility and play in quadrupedal robots using
reinforcement learning and generative pre-trained models. Nature Machine Intelligence, 6(7):
787–798, 2024.
Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Syn-
thesizing physical character-scene interactions. In ACM SIGGRAPH 2023 Conference Proceedings,
pp. 1–9, 2023.
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu
Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body
teleoperation and learning. arXiv preprint arXiv:2406.08858, 2024a.
Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi.
Learning human-to-humanoid real-time whole-body teleoperation. In 2024 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp. 8944–8951. IEEE, 2024b.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural
information processing systems, 29, 2016.
Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang.
Exbody2: Advanced expressive humanoid whole-body control. arXiv preprint arXiv:2412.13196,
2024.
Chenhao Li, Sebastian Blaes, Pavel Kolev, Marin Vlastelica, Jonas Frey, and Georg Martius. Versatile
skill control via self-supervised adversarial imitation of unlabeled mixed motions. In 2023 IEEE
international conference on robotics and automation (ICRA), pp. 2944–2950. IEEE, 2023a.
Chenhao Li, Marin Vlastelica, Sebastian Blaes, Jonas Frey, Felix Grimminger, and Georg Martius.
Learning agile skills via adversarial imitation of rough partial demonstrations. In Conference on
Robot Learning, pp. 342–352. PMLR, 2023b.
12
Chenhao Li, Elijah Stanger-Jones, Steve Heim, and Sangbae Kim. Fld: Fourier latent dynamics for
structured motion representation and learning. arXiv preprint arXiv:2402.13820, 2024.
Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo:
Adaptive motion optimization for hyper-dexterous humanoid whole-body control. arXiv preprint
arXiv:2505.03738, 2025.
Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time
simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pp. 10895–10904, 2023a.
Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng
Xu. Universal humanoid motion representations for physics-based control. arXiv preprint
arXiv:2310.04582, 2023b.
Zhengyi Luo, Jiashun Wang, Kangni Liu, Haotian Zhang, Chen Tessler, Jingbo Wang, Ye Yuan,
Jinkun Cao, Zihui Lin, Fengyi Wang, et al. Smplolympics: Sports environments for physically
simulated humanoids. arXiv preprint arXiv:2407.00187, 2024.
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-
guided deep reinforcement learning of physics-based character skills. ACM Transactions On
Graphics (TOG), 37(4):1–14, 2018.
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial
motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG),
40(4):1–20, 2021.
Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable
adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics
(TOG), 41(4):1–17, 2022.
Davis Rempe, Zhengyi Luo, Xue Bin Peng, Ye Yuan, Kris Kitani, Karsten Kreis, Sanja Fidler, and
Or Litany. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
13756–13766, 2023.
Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using
massively parallel deep reinforcement learning. In Conference on Robot Learning, pp. 91–100.
PMLR, 2022.
Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz Bächer. Robot motion
diffusion model: Motion generation for robotic characters. In SIGGRAPH Asia 2024 Conference
Papers, pp. 1–9, 2024a.
Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz Bächer. Vmp: Versatile
motion priors for robustly tracking motion on physical characters. In Computer Graphics Forum,
volume 43, pp. e15175. Wiley Online Library, 2024b.
Jean-Pierre Sleiman, Mayank Mittal, and Marco Hutter. Guided reinforcement learning for robust
multi-contact loco-manipulation. In 8th Annual Conference on Robot Learning (CoRL 2024),
2024.
Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning
motion phase manifolds. ACM Transactions on Graphics (ToG), 41(4):1–13, 2022.
Annan Tang, Takuma Hiraoka, Naoki Hiraoka, Fan Shi, Kento Kawaharazuka, Kunio Kojima, Kei
Okada, and Masayuki Inaba. Humanmimic: Learning natural locomotion and transitions for
humanoid robot via wasserstein adversarial imitation. In 2024 IEEE International Conference on
Robotics and Automation (ICRA), pp. 13107–13114. IEEE, 2024.
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser-
rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza,
Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint
arXiv:2503.20020, 2025.
13
Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. Calm:
Conditional adversarial latent models for directable virtual characters. In ACM SIGGRAPH 2023
Conference Proceedings, pp. 1–9, 2023.
Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified
physics-based character control through masked motion inpainting. ACM Transactions on Graphics
(TOG), 43(6):1–21, 2024.
Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu,
Alessandro Lazaric, and Matteo Pirotta. Zero-shot whole-body humanoid control via behavioral
foundation models. arXiv preprint arXiv:2504.11054, 2025.
Eric Vollenweider, Marko Bjelonic, Victor Klemm, Nikita Rudin, Joonho Lee, and Marco Hutter.
Advanced skills through multiple adversarial motion priors in reinforcement learning. In 2023
IEEE International Conference on Robotics and Automation (ICRA), pp. 5120–5126. IEEE, 2023.
Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-
based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
Ryo Watanabe, Chenhao Li, and Marco Hutter. Dfm: Deep fourier mimic for expressive dance
motion learning. arXiv preprint arXiv:2502.10980, 2025.
Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics-based character controllers using
conditional vaes. ACM Transactions on Graphics (TOG), 41(4):1–12, 2022.
Michael Xu, Yi Shi, KangKang Yin, and Xue Bin Peng. Parc: Physics-based augmentation with
reinforcement learning for character controllers. arXiv preprint arXiv:2505.04002, 2025.
Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of
generative controllers for physics-based characters. ACM Transactions on Graphics (TOG), 41(6):
1–16, 2022.
Fatemeh Zargarbashi, Jin Cheng, Dongho Kang, Robert Sumner, and Stelian Coros. Robotkeyframing:
Learning locomotion with high-level objectives via mixture of dense and sparse rewards. arXiv
preprint arXiv:2407.11562, 2024.
Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÚjo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen
Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025.
Zewei Zhang, Chenhao Li, Takahiro Miki, and Marco Hutter. Motion priors reimagined: Adapting
flat-terrain skills for complex quadruped mobility. arXiv preprint arXiv:2505.16084, 2025.
Qingxu Zhu, He Zhang, Mengting Lan, and Lei Han. Neural categorical priors for physics-based
character control. ACM Transactions on Graphics (TOG), 42(6):1–16, 2023.
14