Humanoid-GPT: Scaling Data and Structure
for Zero-Shot Motion Tracking
CVPR 2026

* equal contribution   † corresponding author
1Tsinghua University   2Galbot Inc.   3Beihang University   4Shanghai Jiao Tong University   5Peking University   6Shanghai Qi Zhi Institute

Real-time whole-body control and zero-shot motion tracking of Humanoid-GPT on a Unitree G1 in everyday home scenarios.

Comparison with SONIC

We compare Humanoid-GPT against SONIC, the strongest prior generalist tracker, across four representative categories. Humanoid-GPT delivers smoother daily behaviors, more expressive dances, more agile high-dynamic motions, and steadier balance. In each video, the left side shows SONIC and the right side shows Humanoid-GPT (ours).

Daily Motion

Dance Motion

High-Dynamic Motion

Balance Motion

Demo Videos

Humanoid-GPT performs robust whole-body control across a wide range of real-world tasks in a zero-shot manner, from physical labor to sports and security scenarios.

Digging

Disinfection

Soccer

Security

Humanoid-GPT Teaser

Real-World Experiments

All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motions in a zero-shot manner, especially various dance motions.

Real-world Experiments

Highlights

1 Current humanoid motion trackers face a fundamental agility–generalization trade-off: they can either track agile motions in-domain or generalize to unseen movements, but not both.
2 We assemble an unprecedented 2B-frame motion corpus by aggregating all major mocap datasets (AMASS, LAFAN1, Motion-X++, PHUMA, MotionMillion) with large-scale in-house recordings.
3 We propose Harmonic Motion Embedding (HME), a novel metric for quantifying and categorizing motion data diversity directly from motion data itself.
4 We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained via expert distillation that achieves both highly dynamic motion tracking and unprecedented zero-shot generalization.
5 Extensive experiments demonstrate clear scaling laws: enlarging both data and model capacity yields consistent improvements in tracking accuracy and generalization.

Method Overview

Humanoid-GPT Pipeline

Humanoid-GPT Training Pipeline. (a) We curate a large-scale motion corpus by aggregating multiple mocap datasets and performing motion retargeting to the Unitree G1 humanoid. (b) Motion experts are trained via reinforcement learning on clustered motion data using Harmonic Motion Embedding (HME). (c) A GPT-style Transformer is trained via DAgger distillation to consolidate all expert knowledge into a single generalist tracker with causal temporal attention.

Comparison with Prior Works

Humanoid-GPT is the first work that combines a Transformer-based architecture, agile motion tracking, zero-shot generalization, and billion-scale training data.

Method Tracker Agile Zero-shot #Frames
HumanPlusTransformer7.2M
OmniH2OMLP7.2M
ASAPMLP-
GMTMoE-MLP6.0M
UniTrackerMLP7.2M
TWISTMLP~9.2M
Any2TrackMLP9.1M
SONICMLP100M
Humanoid-GPT (ours)Transformer2.0B

Data Diversity Analysis

We propose Harmonic Motion Embedding (HME) to quantify motion data diversity in a latent space. Our curated dataset exhibits both higher embedding scale and broader latent coverage, with approximately 4-5× increase in log-volume compared with AMASS.

Data Diversity Comparison

Dataset diversity in the HME embedding space.

Data Distribution

Data distribution visualization.

Scaling Laws

The Transformer-based Humanoid-GPT exhibits clear scaling laws: enlarging both the motion corpus and the model capacity yields consistent and substantial gains in tracking accuracy and stability.

Data Scaling

Data Scaling Curve on Zero-shot Performance

Model Scalability

Model Scalability Comparison

Inference Optimization

We carefully optimized the deployment pipeline using ONNX and TensorRT compilation, achieving an end-to-end inference latency of under 1.5ms on a single NVIDIA RTX 4090 GPU, approximately 5× faster than TWIST.

Inference Optimization

Comparison of inference latency among different optimization methods.

BibTeX

@article{humanoidgpt26,
      title={Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking},
      author={Qi, Zekun and Chen, Xuchuan and Liu, Dairu and Lin, Chenghuai and Lian, Yunrui and Liang, Sikai and Zhang, Zhikai and Guan, Yu and Wang, Jilong and Zhang, Wenyao and Yu, Xinqiang and Wang, He and Yi, Li},
      journal={arXiv preprint arXiv:2606.03985},
      year={2026}
    }