0% found this document useful (0 votes)
35 views14 pages

F - B - Gan-B L D: W W: Eature Ased Vs Ased Earning From Emonstrations Hen and HY

This survey compares feature-based and GAN-based approaches to learning from demonstrations, focusing on reward function structures and their impact on policy learning. Feature-based methods provide interpretable rewards for high-fidelity motion imitation but struggle with generalization, while GAN-based methods offer scalability and adaptability at the cost of training stability. The authors argue for a nuanced understanding of both paradigms, emphasizing task-specific priorities in method selection and outlining algorithmic trade-offs for informed decision-making.

Uploaded by

emmanuellobe18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

F - B - Gan-B L D: W W: Eature Ased Vs Ased Earning From Emonstrations Hen and HY

This survey compares feature-based and GAN-based approaches to learning from demonstrations, focusing on reward function structures and their impact on policy learning. Feature-based methods provide interpretable rewards for high-fidelity motion imitation but struggle with generalization, while GAN-based methods offer scalability and adaptability at the cost of training stability. The authors argue for a nuanced understanding of both paradigms, emphasizing task-specific priorities in method selection and outlining algorithmic trade-offs for informed decision-making.

Uploaded by

emmanuellobe18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

F EATURE -BASED VS .

GAN-BASED L EARNING FROM


D EMONSTRATIONS : W HEN AND W HY
Chenhao Li Marco Hutter Andreas Krause
ETH AI Center ETH Zurich ETH Zurich
chenhli@[Link] mahutter@[Link] krausea@[Link]

A BSTRACT

This survey provides a comparative analysis of feature-based and GAN-based ap-


arXiv:2507.05906v1 [[Link]] 8 Jul 2025

proaches to learning from demonstrations, with a focus on the structure of reward


functions and their implications for policy learning. Feature-based methods offer
dense, interpretable rewards that excel at high-fidelity motion imitation, yet often
require sophisticated representations of references and struggle with generalization
in unstructured settings. GAN-based methods, in contrast, use implicit, distribu-
tional supervision that enables scalability and adaptation flexibility, but are prone
to training instability and coarse reward signals. Recent advancements in both
paradigms converge on the importance of structured motion representations, which
enable smoother transitions, controllable synthesis, and improved task integration.
We argue that the dichotomy between feature-based and GAN-based methods is
increasingly nuanced: rather than one paradigm dominating the other, the choice
should be guided by task-specific priorities such as fidelity, diversity, interpretabil-
ity, and adaptability. This work outlines the algorithmic trade-offs and design
considerations that underlie method selection, offering a framework for principled
decision-making in learning from demonstrations.

ACKNOWLEDGMENTS

This survey was proofread by Zhiyang Dou, Tairan He, Xuxin Cheng, Zhengyi Luo, and Chen Tessler.
Their expertise in the field and constructive suggestions were instrumental in shaping the final form
of this work.

1 D ISCLAIMER
The terminology surrounding the use of offline reference data in reinforcement learning (RL) varies
widely across the literature. Terms such as imitation learning, learning from demonstrations, and
demonstration learning are often used interchangeably, despite referring to subtly different method-
ologies or assumptions.
In this survey, we adopt the term learning from demonstrations to specifically denote a class of
methods that utilize state-based, offline reference data to derive a reward signal. This reward signal
quantifies the similarity between the behavior of a learning agent and that of the reference trajectories,
and it is used to guide policy optimization.
This definition intentionally excludes methods based on behavior cloning that require action anno-
tations, such as those used in recent large-scale manipulation datasets (e.g., Gr00t N1 (Bjorck et al.,
2025), diffusion policy (Chi et al., 2023), Gemini Robotics (Team et al., 2025)). These approaches
assume access to expert action labels and thus follow a different paradigm than the class of methods
discussed here, which operate solely on state observations and rely on RL to generate control.

2 M OTIVATION AND S COPE


While learning from demonstrations has become a widely adopted strategy in both robotics and
character animation, the field lacks consistent guidance on when to prefer particular classes of

1
methods, such as feature-based versus GAN-based approaches. Practitioners often adopt one method
over another based on precedent or anecdotal success, without a systematic analysis of the algorithmic
factors that underlie their performance. As a result, conclusions drawn from empirical success may
conflate algorithmic merit with incidental choices in reward design, data selection, or architecture.
The objective of this article is to provide a principled comparison between feature-based and
GAN-based imitation methods, focusing on their fundamental assumptions, inductive biases, and
operational regimes. The exposition proceeds in two stages. First, we review the problem setting
from the perspective of physics-based control and reinforcement learning, including the formulation
of reward functions based on reference trajectories. Second, we examine the historical development
and current landscape of imitation methods, organized around the type of reward structure they use,
explicit, feature-based formulations versus implicit, adversarially learned metrics.
Our goal is not to advocate for one approach over the other in general, but to clarify the conditions
under which each is more suitable. By articulating the trade-offs involved—including scalability,
stability, generalization, and representation learning, we aim to provide a conceptual framework that
supports more informed method selection in future work.

3 P HYSICS -BASED C ONTROL , S TATES AND ACTIONS

In both character animation and robotics, physics-based control refers to a paradigm in which an
agent’s behavior is governed by the underlying physical dynamics of the system, either simulated or
real. Rather than prescribing trajectories explicitly, such as joint angles or end-effector poses, this
approach formulates control as a process of goal-directed optimization, where a policy generates
control signals (e.g., torques or muscle activations) to maximize an objective function under physical
constraints. This stands in contrast to kinematics-based or keyframe-based methods, which
often disregard dynamics and focus on geometrically feasible but potentially physically implausible
motions. Physics-based control ensures that resulting behaviors are not only kinematically valid but
also dynamically consistent, energy-conservative, and responsive to interaction forces, making it
particularly suited for tasks involving locomotion, balance, and physical interaction in uncertain or
dynamic environments.
The canonical formalism for this control paradigm is the Markov Decision Process (MDP), defined
by a tuple (S, A, T, R, γ), where S and A denote the state and action spaces, respectively. The
transition kernel T : S × A → S captures the environment dynamics p (st+1 | st , at ), while the
reward function R : S × A × S → R maps transitions to scalar rewards.
hPThe agenti seeks to learn a
t
policy πθ : S → A that maximizes the expected discounted return Eπθ t≥0 γ rt , where rt is the
reward at time t and γ ∈ [0, 1] is the discount factor.
In this context, the state s ∈ S typically encodes the agent’s physical configuration and dynamics,
such as joint positions, joint velocities, root orientation, and may include exteroceptive inputs like
terrain geometry or object pose. The action a ∈ A corresponds to the control input applied to
the system, most commonly joint torques in torque-controlled settings, or target positions in PD-
controlled systems. In biomechanical models, actions may also represent muscle activations. By
integrating these elements within a physics simulator or physical system, physics-based control
enables emergent behaviors that are compatible with real-world dynamics, allowing policies to
discover strategies that are not only effective but also physically feasible.

4 R ETHINKING L EARNING FROM D EMONSTRATIONS

In the context of learning from demonstrations, reward functions are typically derived from reference
data, rather than being manually engineered to reflect task success or motion quality. This setup
leverages recorded trajectories, often collected from motion capture, teleoperation, or other expert
sources, to define a notion of behavioral similarity. The policy is then optimized to minimize this
discrepancy, encouraging it to reproduce motions that are consistent with those in the demonstration
dataset.
Critically, the reward derived from demonstrations may serve either as a pure imitation objective,
where the policy is expected to replicate the demonstrated behavior as closely as possible, or as a

2
policy 𝜋
𝑎
reference

𝑟𝐼
target state 𝑠Ƹ - state 𝑠

Figure 1: DeepMimic-style feature-based methods. The policy receives dense, per-frame rewards
by comparing hand-crafted features—such as joint positions and end-effector poses—between
its current state and a time-aligned reference state. A phase variable synchronizes policy and
demonstration trajectories, enabling accurate motion reproduction but limiting generalization across
diverse behaviors due to the lack of structured motion representation.

regularizing component that biases learning while allowing task-specific objectives to dominate.
This dual role makes demonstration-based rewards particularly valuable in high-dimensional control
problems where exploration is difficult and task-based rewards are sparse or poorly shaped. As such,
learning from demonstrations transforms the design of the reward function from a manual engineering
problem into one of defining or learning an appropriate similarity metric between agent and expert
behavior, either explicitly, through features, or implicitly, through discriminators or encoders.
While reference trajectories are often valued for their visual realism or naturalness, this perspective
underemphasizes their algorithmic utility: reference data serves as a critical mechanism for improv-
ing learning efficiency in high-dimensional control problems. Rather than functioning merely as
a constraint or prior, demonstrations provide structured guidance that biases policy exploration
toward plausible and meaningful behaviors.
This role becomes especially important as the complexity of the environment and agent increases. In
lower-dimensional settings, carefully engineered reward functions or manually designed curricula
have proven sufficient to elicit sophisticated behaviors through reinforcement learning alone (Rudin
et al., 2022). However, such strategies do not scale effectively to systems with high-dimensional
state-action spaces, where naïve exploration is inefficient and reward shaping becomes brittle or
intractable. Under these conditions, demonstration data offers a practical alternative to reward or
environment shaping, acting as an inductive bias that accelerates the discovery of viable behaviors.
In this light, reference motions are not ancillary constraints but primary learning signals, particularly
in regimes where task-based supervision is sparse or difficult to specify. This reframing justifies the
use of demonstrations not only for imitation but as a foundation for scalable and data-efficient policy
learning.

5 F EATURE -BASED I MITATION : O RIGINS AND L IMITATIONS

Feature-based imitation approaches can be traced back to DeepMimic (Peng et al., 2018), which
established a now-standard formulation for constructing reward signals based on explicit motion
matching. In this framework, the policy is aligned with a reference trajectory by introducing a phase
variable, which serves as a learned proxy for temporal progress through the motion. The reward is
computed by evaluating feature-wise distances—such as joint positions, velocities, orientations, and
end-effector positions—between the policy-generated trajectory and the reference, synchronized via
the phase. An abstracted overview is shown in Fig. 1.
Owing to their dense and explicit reward structure, these methods are highly effective at reproducing
fine-grained motion details. However, their scalability to diverse motion datasets is limited. While
DeepMimic introduces a one-hot motion identifier to enable multi-clip training, this encoding does
not model semantic or structural relationships between different motions. As a result, the policy
treats each motion clip as an isolated objective, which precludes generalization and often leads to
discontinuities at transition points.

3
policy 𝜋
𝑎
reference
𝑟𝐷𝐼

discriminator 𝐷
state 𝑠

Figure 2: GAN-based methods via adversarial rewards. A discriminator learns to distinguish short
transition snippets from policy-generated and demonstration data, providing an implicit reward signal
that guides the policy toward expert-like behavior. By operating on short windows without explicit
time alignment, this approach scales to diverse motion datasets and captures distributional similarity,
enabling smoother transitions across unstructured behaviors.

Although the phase variable handles temporal alignment within a given clip, there is no analogous
mechanism for enforcing spatial or semantic coherence across clips. Transitions between motions
are implemented via hard switching on motion identifiers, which can result in abrupt behavioral
changes and visually unnatural trajectories. What is missing in this setup is a structured representa-
tion space over motions—one that captures both temporal progression and the underlying topology
of behavioral variation. Such representations enable not only smoother transitions between behaviors
but also facilitate interpolation, compositionality, and improved generalization to motions not seen
during training. Policies trained over these structured motion spaces are better equipped to synthesize
new behaviors while preserving physical plausibility and stylistic fidelity.

6 I MPLICIT R EWARDS FOR M OTION D IVERSITY: GAN-BASED I MITATION

To address the limitations of feature-based approaches in handling diverse motion data, Adversarial
Motion Priors (AMP) (Peng et al., 2021) introduced the use of adversarial training, building on earlier
frameworks such as GAIL (Ho & Ermon, 2016), where expert action labels are assumed. In the AMP
setting, a discriminator is trained to distinguish between state transitions generated by the policy
and those sampled from a dataset of reference trajectories. As the policy improves, its transitions
become increasingly indistinguishable from the expert data, thereby reducing the discriminator’s
ability to classify them correctly. The discriminator’s output serves as a reward signal, guiding the
policy toward behavioral fidelity. The system is illustrated in Fig. 2.
From an optimization standpoint, GAN-based methods treat the policy as a generator in a two-player
minimax game. These methods scale naturally to large and diverse motion datasets, as they operate
on short, fixed-length transition windows, typically spanning two to eight frames, rather than
full trajectories. This removes the need for phase-based or time-indexed alignment, making them
particularly effective in unstructured datasets. Additionally, the discriminator implicitly defines a
similarity metric over motion fragments, allowing transitions that are behaviorally similar to receive
comparable rewards even when not temporally aligned. As a result, policies trained under adversarial
objectives tend to exhibit smoother transitions across behaviors compared to methods relying on
discrete motion identifiers and hard switching. Because the reward is defined over distributional
similarity, rather than matching a specific trajectory, AMP and related techniques are well-suited
for stylization tasks or for serving as general motion priors that can be composed with task-specific
objectives.
Despite their empirical success across domains, including character animation (e.g., InterPhys (Hassan
et al., 2023), PACER (Rempe et al., 2023)) and robotics (Escontrela et al., 2022), adversarial imitation
introduces fundamental challenges that impact training reliability and policy expressiveness.
Discriminator Saturation A key challenge in adversarial setups is that the discriminator can rapidly
become overconfident, especially early in training when the policy generates trajectories that diverge
significantly from the reference distribution. In this regime, the discriminator easily classifies all
transitions correctly, producing near-zero gradients and leaving the policy without informative reward

4
encoding

motion embedding 𝜃 policy 𝜋


𝑎
reference
𝑟𝐷𝐼

discriminator 𝐷
state 𝑠

Figure 3: Latent-conditioned GAN-based methods. The policy and discriminator are jointly con-
ditioned on learned motion embeddings, which are derived from demonstration data through unsu-
pervised or supervised representation learning. These latent variables structure the imitation space,
promoting behavioral diversity, stabilizing training, and enabling controllable skill generation beyond
what implicit adversarial objectives can achieve alone.

signals. This phenomenon is particularly problematic in high-dimensional or difficult environments,


such as rough terrain locomotion or manipulation tasks, where meaningful exploration is essential
but sparse.
Solutions such as Wasserstein-based objectives (e.g., WASABI (Li et al., 2023b), HumanMimic (Tang
et al., 2024)) aim to retain useful gradients and therefore reward signals even in the face of a strong
discriminator.
Mode Collapse Another failure mode is the collapse of behavioral diversity: the policy may converge
to producing only a narrow subset of trajectories that reliably fool the discriminator, ignoring the
wider variation present in the demonstrations. While the discriminator implicitly encourages local
smoothness in the reward landscape, AMP lacks a structured motion representation that would
enable global diversity or controllable behavior synthesis. Consequently, the resulting policies often
underutilize the full range of skills present in the data.
To counteract this limitation, a variety of techniques introduce latent representations to provide
structured control over motion variation as shown in Fig. 3.
Unsupervised approaches like CASSI (Li et al., 2023a), ASE (Peng et al., 2022), and CALM (Tessler
et al., 2023) learn continuous embeddings over motion space, optimizing mutual information be-
tween latent codes and observed behaviors to preserve diversity. These embeddings are then used
to condition the policy, enabling the generation of distinct behaviors from different regions of the
latent space. Other approaches rely on category-level supervision to guide the learning process. For
example, Multi-AMP (Vollenweider et al., 2023), CASE (Dou et al., 2023), and SMPLOlympics (Luo
et al., 2024) use motion class annotations to condition both the discriminator and the policy, thereby
restricting collapse to occur only within class-specific subregions. In contrast, FB-CPR (Tirinzoni
et al., 2025) adopts a representation-based solution, learning forward-backward encodings to structure
the discriminator’s feedback. Several other extensions train individual motion primitives progres-
sively (e.g., PHC (Luo et al., 2023a), PHC+ (Luo et al., 2023b)). A conditioned skill composer is
utilized to recover the motion diversity. Others introduce representation distillation with variational
bottlenecks, as in PULSE (Luo et al., 2023b), to form compressed yet expressive motion embeddings
for controllable generation.
Together, these developments highlight both the flexibility and complexity of adversarial imitation
learning. While GAN-based methods naturally scale to large and diverse datasets, they benefit
substantially from the addition of structured motion representations, whether learned, annotated, or
composed, to stabilize training and recover controllable, diverse behavior.

5
encoding

motion embedding 𝜃 policy 𝜋


𝑎
reference
decoding
𝑟𝐼
target state 𝑠Ƹ - state 𝑠

Figure 4: Feature-based methods with structured motion representations. The policy receives per-
frame rewards based on feature differences with reference states and is conditioned on compact
motion embeddings derived from demonstration data. This design preserves the interpretability
of hand-crafted objectives while enabling smoother transitions and broader generalization across
behaviors through learned motion structure.

7 F EATURE -BASED I MITATION WITH S TRUCTURED R EPRESENTATIONS

While adversarial imitation methods offer flexibility and scalability with diverse reference data, they
impose significant practical burdens. Ensuring training stability, managing discriminator saturation,
and preventing mode collapse often require extensive architectural tuning. These limitations have
motivated a return to feature-based methods, now enhanced with structured motion representations,
as a more interpretable and controllable alternative to adversarial training. The core insight behind
this renewed direction is the importance of a well-structured motion representation space for enabling
smooth transitions and generalization across behaviors. While GAN-based methods rely on the
discriminator to implicitly induce such a representation, often requiring additional mechanisms to
extract, control, or condition on it, feature-based approaches allow for the explicit construction
of motion embeddings that are either precomputed or learned in parallel with policy training. This
explicitness simplifies conditioning and reward design, often reducing the reward to weighted feature
differences relative to a reference state. Such systems can be abstracted with a structure illustrated in
Fig. 4.
As a result, a new class of imitation approaches has emerged that maintains the explicit reward
structure of traditional feature-based methods, but augments it with representation learning to
scale across tasks and motions. In many cases, reference frames, or compact summaries thereof, are
injected directly into the policy, providing frame-level tracking targets that guide behavior.
Sophisticated Motion Representation A central challenge for this class of methods is the construc-
tion of motion representations that support smooth transitions and structural generalization. Compact,
low-dimensional embeddings promote semantic understanding of inter-motion relationships and
improve sample efficiency.
To this end, some methods inject reference features or full motion states directly into the policy (e.g.,
PhysHOI (Wang et al., 2023), ExBody (Cheng et al., 2024), H2O (He et al., 2024b), HumanPlus (Fu
et al., 2024), MaskedMimic (Tessler et al., 2024), ExBody2 (Ji et al., 2024), OmniH2O (He et al.,
2024a), AMO (Li et al., 2025), TWIST (Ze et al., 2025), GMT (Chen et al., 2025)), preserving spatial
coherence in the motion space. Others pursue more abstract embeddings through self-supervised or
policy-conditioned learning. For instance, ControlVAE (Yao et al., 2022), PhysicsVAE (Won et al.,
2022), and NCP (Zhu et al., 2023) build representations via policy interaction, while VMP (Serifi et al.,
2024b) and RobotMDM (Serifi et al., 2024a) construct temporally and spatially coherent embeddings
using self-supervision. Frequency-domain methods such as PAE (Starke et al., 2022), FLD (Li et al.,
2024), and DFM (Watanabe et al., 2025) impose motion-inductive biases that capture the periodic
and hierarchical structure of motion. These techniques collectively extend the DeepMimic paradigm
by generalizing phase alignment and structural similarity beyond heuristics.
Inflexible Imitation Adaptation A limitation of these representation-driven feature-based methods
is that they often rely on explicit tracking of full trajectories, enforced by dense per-step rewards.

6
This design makes it difficult to adapt or deviate from the reference when auxiliary tasks require
flexibility, as is common in goal-directed or interaction-heavy settings.
To address this, some approaches introduce mechanisms to adaptively relax imitation constraints. For
example, MCP (Sleiman et al., 2024) introduces a fallback mechanism that adjusts phase progression
when key task objectives are not met. RobotKeyframing (Zargarbashi et al., 2024) proposes a
transformer-based attention model that encodes arbitrary sets of keyframes with flexible temporal
spacing. Other works incorporate high-level planning components to dictate intermediate reference
states, such as diffusion-based models in PARC (Xu et al., 2025) and HMI (Fan et al., 2025), or
planners that directly modulate the learned motion representations (e.g., VQ-PMC (Han et al., 2024),
Motion Priors Reimagined (Zhang et al., 2025)).
Together, these developments illustrate the interpretability and stability of feature-based imitation
when paired with structured motion representations. However, despite avoiding the instability of
adversarial training, these methods remain constrained by their reliance on explicit tracking and
overengineered representations, which can hinder adaptation in tasks requiring flexible deviation
from demonstrations.

8 S UMMARY: S TRENGTHS , L IMITATIONS , AND E MERGING D IRECTIONS


Learning from demonstrations has evolved into two primary methodological paradigms: feature-
based methods, which use explicit, hand-crafted reward formulations, and GAN-based methods,
which employ discriminators to implicitly shape behavior. Each offers distinct advantages and faces
unique challenges, especially as the field shifts toward learning from large, diverse, and unstructured
motion datasets. We summarize the aforementioned works in Table 1.
Table 1: Taxonomy of learning from demonstration methods.

GAN-based
AMP (Peng et al., 2021; Escontrela et al., 2022), InterPhys (Hassan et al., 2023),
PACER (Rempe et al., 2023), WASABI (Li et al., 2023b), HumanMimic (Tang et al., 2024),
CASSI (Li et al., 2023a), ASE (Peng et al., 2022), CALM (Tessler et al., 2023),
Multi-AMP (Vollenweider et al., 2023), CASE (Dou et al., 2023),
SMPLOlympics (Luo et al., 2024), FB-CPR (Tirinzoni et al., 2025), PHC (Luo et al., 2023a),
PHC+ (Luo et al., 2023b), PULSE (Luo et al., 2023b)
Feature-based
DeepMimic (Peng et al., 2018), PhysHOI (Wang et al., 2023), ExBody (Cheng et al., 2024),
H2O (He et al., 2024b), HumanPlus (Fu et al., 2024), MaskedMimic (Tessler et al., 2024),
ExBody2 (Ji et al., 2024), OmniH2O (He et al., 2024a), AMO (Li et al., 2025),
TWIST (Ze et al., 2025), GMT (Chen et al., 2025), ControlVAE (Yao et al., 2022),
PhysicsVAE (Won et al., 2022), NCP (Zhu et al., 2023), VMP (Serifi et al., 2024b),
RobotMDM (Serifi et al., 2024a), PAE (Starke et al., 2022), FLD (Li et al., 2024),
DFM (Watanabe et al., 2025), MCP (Sleiman et al., 2024),
RobotKeyframing (Zargarbashi et al., 2024), PARC (Xu et al., 2025), HMI (Fan et al., 2025),
VQ-PMC (Han et al., 2024), Motion Priors Reimagined (Zhang et al., 2025)

8.1 GAN-BASED M ETHODS

GAN-based approaches, such as AMP and its derivatives, use a discriminator to assign reward signals
based on the realism of short transition snippets. This formulation dispenses with time-aligned
supervision, allowing policies to imitate motion in a distributional sense rather than reproducing
specific trajectories. As a result, these methods scale naturally to unstructured or unlabeled data,
enabling smoother transitions between behaviors and generalization beyond the demonstrated clips.
Recent advances mitigate some of the core challenges of GAN-based imitation, namely, discriminator
saturation and mode collapse, by introducing latent structure. Techniques learn motion embeddings
that condition both policy and discriminator, thereby stabilizing training and supporting controllable

7
behavior generation. These latent-conditioned GANs can also model semantic structure in motion
space, facilitating interpolation and compositionality.
Despite these benefits, GAN-based methods remain prone to training instability, require careful dis-
criminator design, and often offer coarser control over motion details. Their implicit reward structure
can obscure performance tuning and requires auxiliary mechanisms for precise task alignment.

8.2 F EATURE -BASED M ETHODS

In contrast, feature-based imitation methods like DeepMimic start with dense, per-frame reward
functions derived from specific motion features. This yields strong supervision for motion matching,
making them highly effective for replicating fine-grained details in demonstrated behavior. However,
traditional approaches are limited by their dependence on hard-coded alignment and lack of
structured motion representation, which restricts scalability and generalization.
Recent developments address these limitations by integrating learned motion representations into the
reward and policy structure. These efforts construct latent motion embeddings to structure behavior
across clips, enabling smoother transitions and support for more diverse or compositional motions.
This new generation of feature-based methods retains interpretability and strong reward signals while
gaining some of the flexibility previously unique to GAN-based setups.
Nevertheless, feature-based systems still face challenges in adapting to auxiliary tasks or goals that
require deviation from the reference trajectory. Their strong reliance on explicit tracking and dense
supervision can make them brittle in dynamic or multi-objective settings, where flexibility is crucial.

Table 2: Comparative analysis.

Criterion GAN-Based Methods Feature-Based Methods


Reward signal implicit, coarse explicit, dense
Scalability high (unstructured data) moderate (depends on representation)
Generalization strong with latent conditioning strong with good embeddings
Training stability challenging (saturation, collapse) stable but sensitive to inductive bias
Interpretability low to moderate high
Control indirect (via discriminator or latent) direct (via features or embeddings)
Task integration flexible precise but less adaptable

9 O N M ETRICS AND M ISCONCEPTIONS

In evaluating learning from demonstration algorithms, it is common practice to reference metrics


such as motion naturalness, energy efficiency, or cost of transport. While these properties are
intuitively appealing, they can be misleading indicators of algorithmic performance. Crucially, such
metrics are not inherent to the learning algorithm itself but are instead highly dependent on the
quality and structure of the reference data. For instance, if a policy trained via a particular algorithm
exhibits smoother or more energy-efficient behavior, this outcome often reflects characteristics
of the underlying demonstrations rather than advantages intrinsic to the algorithmic formulation.
Consequently, attributing these observed properties to the learning method risks conflating algorithmic
capability with dataset bias.
Moreover, these high-level metrics offer limited diagnostic value when comparing algorithm classes.
They do not capture fundamental differences in reward design, training stability, scalability, or
generalization capacity. A GAN-based approach may yield visually smoother transitions due to its
distributional objectives, but this benefit must be weighed against the challenges of motion diversity
and tracking accuracy. Conversely, a feature-based method may produce high-fidelity imitation in
terms of kinematic features but struggle with generalization due to its reliance on well-structured
representations. To conduct a rigorous and meaningful comparison between methods, evaluation

8
should focus on the properties most directly influenced by algorithmic design. These include reward
signal quality, training stability, generalization to novel motions or environments, and adaptability to
auxiliary tasks. By focusing on such factors, researchers and practitioners can better understand the
operational trade-offs between feature-based and GAN-based approaches, avoiding overgeneralized
claims and grounding comparisons in algorithmic substance rather than incidental outcome metrics.

10 D EBUNKING C OMMON B ELIEFS


Despite a growing body of research, misconceptions remain prevalent in discussions of GAN-based
versus feature-based learning from demonstrations. Below, we revisit some common claims, clarify
their limitations, and situate them within a more rigorous analytical framework.

“GAN-based methods automatically develop a distance metric between refer-


ence and policy motions.”

This is partially true. GAN-based methods implicitly learn a similarity function via the discriminator.
However, this function may be ill-defined in early training, leading to discriminator saturation, where
the discriminator assigns uniformly high distances regardless of policy improvement. Moreover, the
discriminator may conflate resemblance to a single exemplar with similarity to the overall distribution,
resulting in mode collapse. Thus, while a learned metric exists, its utility and stability depend heavily
on discriminator design and representation quality.

“GAN-based methods do not require hand-crafted features.”

No. This assertion overlooks a key implementation detail: the discriminator operates on selected
features of the agent state. Choosing these features is analogous to defining reward components in
feature-based methods. Insufficient features can prevent the discriminator from detecting meaningful
discrepancies, while overly complex inputs can lead to rapid overfitting and saturation. This trade-off
is particularly critical in tasks involving partially observed context (e.g., terrain or object interactions),
where feature selection significantly impacts training stability and convergence.

“GAN-based methods avoid hand-tuned reward weights for different features.”

Not quite. While adversarial methods circumvent explicit manual weighting of reward components,
they are still sensitive to feature scaling and normalization. Input magnitudes shape the discriminator’s
sensitivity and therefore act as an implicit weighting scheme. Poorly calibrated inputs can bias the
reward signal, undermining the interpretability and reliability of the learned policy.

“GAN-based methods yield smoother transitions between motions.”

This holds true only relative to early feature-based methods that lacked structured representations
and relied on hard switching between clips. Modern feature-based methods that leverage structured
motion embeddings can produce smooth, semantically meaningful transitions. Interpolation in learned
latent spaces supports temporally and spatially coherent motion generation, rivaling or exceeding
GAN-based transitions when appropriate representation learning is applied.

“Only GAN-based methods can be combined with task rewards.”

No. Both GAN-based and feature-based methods can incorporate task objectives. Feature-based
methods provide dense, frame-aligned imitation rewards, making them effective when the task aligns
closely with the reference motion, but less flexible when deviation is required. In contrast, GAN-based
methods offer distribution-level supervision, enabling greater adaptability to auxiliary goals. This
flexibility, however, comes at the cost of lower fidelity to the reference and a risk of mode collapse.

“GAN-based methods deal better with unstructured or noisy reference mo-


tions.”

This is an oversimplification. GAN-based methods can exhibit robustness to small inconsistencies


in demonstrations due to their distributional supervision. However, this robustness often comes at

9
the cost of discarding fine motion details. Feature-based approaches, especially those employing
probabilistic or variational models, can also handle noise effectively through regularization and
representation smoothing.

“GAN-based methods scale better.”

Not necessarily. Scalability is more a function of motion representation quality than of paradigm.
Both GAN-based and feature-based methods can scale with large datasets if equipped with appropriate
latent encodings. The difference lies in when and how these representations are learned—feature-
based methods often rely on supervised or self-supervised embeddings, while GAN-based methods
may induce representations via adversarial feedback. Neither approach guarantees scalability without
careful design.

“GAN-based methods transfer better to real-world deployment.”

No. There is no intrinsic connection between the choice of imitation algorithm and sim-to-real transfer
efficacy. Transferability is determined primarily by external strategies such as domain randomization,
system identification, and regularization. While GAN-based approaches may respond more flexibly to
auxiliary rewards, they are also more sensitive to regularization, which can create the false impression
that certain regularizers are more effective in these methods.

“Feature-based methods generalize better to unseen motion inputs.”

Generalization depends less on the reward structure and more on the quality and organization of the
motion representation space. Both GAN-based and feature-based methods can generalize effectively
when equipped with well-structured embeddings. Failure modes arise not from the paradigm itself but
from inadequate inductive biases, insufficient diversity in training data, or poor temporal modeling.

“Feature-based methods are easier to implement.”

Not necessarily. Designing robust feature-based systems involves selecting appropriate reward
features, constructing phase functions or embeddings, and managing temporal alignment. These
tasks can be as complex as designing a discriminator, particularly when the goal is to scale across
tasks or environments. Moreover, effective latent representations often require pretraining and careful
architectural choices to avoid collapse or disentanglement failure.

11 F INAL R EMARKS

This survey has examined two major paradigms in learning from demonstrations: feature-based
and GAN-based methods, through the lens of reward structure, scalability, generalization, and
representation. The core distinction lies not merely in architectural components but in their respective
philosophies of supervision: explicit, hand-crafted rewards versus implicit, adversarially learned
objectives.
Feature-based methods offer dense, interpretable rewards that strongly anchor the policy to reference
trajectories, making them well-suited for tasks requiring high-fidelity reproduction of demonstrated
motions. However, they often struggle with generalization, particularly in multi-clip or unstructured
settings, due to the need for manually specified features and aligned references.
GAN-based methods, in contrast, provide more flexible and data-driven reward structures through
discriminative objectives. This enables them to scale naturally to diverse datasets and to support
smoother transitions and behavior interpolation. Yet, they often encounter challenges related to
training stability, reward sparsity, and loss of fine-grained motion detail.
It is important to recognize that many problems commonly attributed to one paradigm reappear
in different forms in the other. For instance, mode collapse in GANs mirrors the brittleness of poor
motion representations in feature-based methods. Similarly, while feature-based methods offer strong
guidance for motion tracking, they may fail to generalize or adapt when rigid reward definitions are
misaligned with auxiliary tasks or dynamic environments.

10
Rather than presenting these two paradigms as mutually exclusive, recent trends point toward a
convergent perspective, one that emphasizes the centrality of structured motion representations.
Whether derived from self-supervised learning, latent encodings, or manually designed summaries,
these representations serve as a bridge between the strengths of each approach: the interpretability
and controllability of explicit rewards and the scalability and adaptability of adversarial training.
Ultimately, the decision between using a feature-based or GAN-based approach is not a question of
universal superiority. Instead, it should be guided by the specific constraints and priorities of the
application: fidelity versus diversity, interpretability versus flexibility, or training simplicity versus
large-scale generalization. Understanding these trade-offs and their relationship to reward structure
and motion representation is essential for designing robust, scalable, and expressive imitation learning
systems.

11
R EFERENCES
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang,
Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist
humanoid robots. arXiv preprint arXiv:2503.14734, 2025.
Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt:
General motion tracking for humanoid whole-body control. arXiv preprint arXiv:2506.14770,
2025.
Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive
whole-body control for humanoid robots. arXiv preprint arXiv:2402.16796, 2024.
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake,
and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The
International Journal of Robotics Research, pp. 02783649241273668, 2023.
Zhiyang Dou, Xuelin Chen, Qingnan Fan, Taku Komura, and Wenping Wang. C· ase: Learning
conditional adversarial skill embeddings for physics-based characters. In SIGGRAPH Asia 2023
Conference Papers, pp. 1–11, 2023.
Alejandro Escontrela, Xue Bin Peng, Wenhao Yu, Tingnan Zhang, Atil Iscen, Ken Goldberg, and
Pieter Abbeel. Adversarial motion priors make good substitutes for complex reward functions. In
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 25–32.
IEEE, 2022.
Yahao Fan, Tianxiang Gui, Kaiyang Ji, Shutong Ding, Chixuan Zhang, Jiayuan Gu, Jingyi Yu, Jingya
Wang, and Ye Shi. One policy but many worlds: A scalable unified policy for versatile humanoid
locomotion. arXiv preprint arXiv:2505.18780, 2025.
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid
shadowing and imitation from humans. arXiv preprint arXiv:2406.10454, 2024.
Lei Han, Qingxu Zhu, Jiapeng Sheng, Chong Zhang, Tingguang Li, Yizheng Zhang, He Zhang,
Yuzhen Liu, Cheng Zhou, Rui Zhao, et al. Lifelike agility and play in quadrupedal robots using
reinforcement learning and generative pre-trained models. Nature Machine Intelligence, 6(7):
787–798, 2024.
Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Syn-
thesizing physical character-scene interactions. In ACM SIGGRAPH 2023 Conference Proceedings,
pp. 1–9, 2023.
Tairan He, Zhengyi Luo, Xialin He, Wenli Xiao, Chong Zhang, Weinan Zhang, Kris Kitani, Changliu
Liu, and Guanya Shi. Omnih2o: Universal and dexterous human-to-humanoid whole-body
teleoperation and learning. arXiv preprint arXiv:2406.08858, 2024a.
Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi.
Learning human-to-humanoid real-time whole-body teleoperation. In 2024 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp. 8944–8951. IEEE, 2024b.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural
information processing systems, 29, 2016.
Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, and Xiaolong Wang.
Exbody2: Advanced expressive humanoid whole-body control. arXiv preprint arXiv:2412.13196,
2024.
Chenhao Li, Sebastian Blaes, Pavel Kolev, Marin Vlastelica, Jonas Frey, and Georg Martius. Versatile
skill control via self-supervised adversarial imitation of unlabeled mixed motions. In 2023 IEEE
international conference on robotics and automation (ICRA), pp. 2944–2950. IEEE, 2023a.
Chenhao Li, Marin Vlastelica, Sebastian Blaes, Jonas Frey, Felix Grimminger, and Georg Martius.
Learning agile skills via adversarial imitation of rough partial demonstrations. In Conference on
Robot Learning, pp. 342–352. PMLR, 2023b.

12
Chenhao Li, Elijah Stanger-Jones, Steve Heim, and Sangbae Kim. Fld: Fourier latent dynamics for
structured motion representation and learning. arXiv preprint arXiv:2402.13820, 2024.
Jialong Li, Xuxin Cheng, Tianshu Huang, Shiqi Yang, Ri-Zhao Qiu, and Xiaolong Wang. Amo:
Adaptive motion optimization for hyper-dexterous humanoid whole-body control. arXiv preprint
arXiv:2505.03738, 2025.
Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time
simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pp. 10895–10904, 2023a.
Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng
Xu. Universal humanoid motion representations for physics-based control. arXiv preprint
arXiv:2310.04582, 2023b.
Zhengyi Luo, Jiashun Wang, Kangni Liu, Haotian Zhang, Chen Tessler, Jingbo Wang, Ye Yuan,
Jinkun Cao, Zihui Lin, Fengyi Wang, et al. Smplolympics: Sports environments for physically
simulated humanoids. arXiv preprint arXiv:2407.00187, 2024.
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-
guided deep reinforcement learning of physics-based character skills. ACM Transactions On
Graphics (TOG), 37(4):1–14, 2018.
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial
motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG),
40(4):1–20, 2021.
Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable
adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics
(TOG), 41(4):1–17, 2022.
Davis Rempe, Zhengyi Luo, Xue Bin Peng, Ye Yuan, Kris Kitani, Karsten Kreis, Sanja Fidler, and
Or Litany. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
13756–13766, 2023.
Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using
massively parallel deep reinforcement learning. In Conference on Robot Learning, pp. 91–100.
PMLR, 2022.
Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz Bächer. Robot motion
diffusion model: Motion generation for robotic characters. In SIGGRAPH Asia 2024 Conference
Papers, pp. 1–9, 2024a.
Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, and Moritz Bächer. Vmp: Versatile
motion priors for robustly tracking motion on physical characters. In Computer Graphics Forum,
volume 43, pp. e15175. Wiley Online Library, 2024b.
Jean-Pierre Sleiman, Mayank Mittal, and Marco Hutter. Guided reinforcement learning for robust
multi-contact loco-manipulation. In 8th Annual Conference on Robot Learning (CoRL 2024),
2024.
Sebastian Starke, Ian Mason, and Taku Komura. Deepphase: Periodic autoencoders for learning
motion phase manifolds. ACM Transactions on Graphics (ToG), 41(4):1–13, 2022.
Annan Tang, Takuma Hiraoka, Naoki Hiraoka, Fan Shi, Kento Kawaharazuka, Kunio Kojima, Kei
Okada, and Masayuki Inaba. Humanmimic: Learning natural locomotion and transitions for
humanoid robot via wasserstein adversarial imitation. In 2024 IEEE International Conference on
Robotics and Automation (ICRA), pp. 13107–13114. IEEE, 2024.
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser-
rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza,
Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint
arXiv:2503.20020, 2025.

13
Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. Calm:
Conditional adversarial latent models for directable virtual characters. In ACM SIGGRAPH 2023
Conference Proceedings, pp. 1–9, 2023.
Chen Tessler, Yunrong Guo, Ofir Nabati, Gal Chechik, and Xue Bin Peng. Maskedmimic: Unified
physics-based character control through masked motion inpainting. ACM Transactions on Graphics
(TOG), 43(6):1–21, 2024.
Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu,
Alessandro Lazaric, and Matteo Pirotta. Zero-shot whole-body humanoid control via behavioral
foundation models. arXiv preprint arXiv:2504.11054, 2025.
Eric Vollenweider, Marko Bjelonic, Victor Klemm, Nikita Rudin, Joonho Lee, and Marco Hutter.
Advanced skills through multiple adversarial motion priors in reinforcement learning. In 2023
IEEE International Conference on Robotics and Automation (ICRA), pp. 5120–5126. IEEE, 2023.
Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-
based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
Ryo Watanabe, Chenhao Li, and Marco Hutter. Dfm: Deep fourier mimic for expressive dance
motion learning. arXiv preprint arXiv:2502.10980, 2025.
Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics-based character controllers using
conditional vaes. ACM Transactions on Graphics (TOG), 41(4):1–12, 2022.
Michael Xu, Yi Shi, KangKang Yin, and Xue Bin Peng. Parc: Physics-based augmentation with
reinforcement learning for character controllers. arXiv preprint arXiv:2505.04002, 2025.
Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of
generative controllers for physics-based characters. ACM Transactions on Graphics (TOG), 41(6):
1–16, 2022.
Fatemeh Zargarbashi, Jin Cheng, Dongho Kang, Robert Sumner, and Stelian Coros. Robotkeyframing:
Learning locomotion with high-level objectives via mixture of dense and sparse rewards. arXiv
preprint arXiv:2407.11562, 2024.
Yanjie Ze, Zixuan Chen, JoÃG, o Pedro AraÚjo, Zi-ang Cao, Xue Bin Peng, Jiajun Wu, and C Karen
Liu. Twist: Teleoperated whole-body imitation system. arXiv preprint arXiv:2505.02833, 2025.
Zewei Zhang, Chenhao Li, Takahiro Miki, and Marco Hutter. Motion priors reimagined: Adapting
flat-terrain skills for complex quadruped mobility. arXiv preprint arXiv:2505.16084, 2025.
Qingxu Zhu, He Zhang, Mengting Lan, and Lei Han. Neural categorical priors for physics-based
character control. ACM Transactions on Graphics (TOG), 42(6):1–16, 2023.

14

You might also like