We compare Humanoid-GPT against SONIC, the strongest prior generalist tracker, across four representative categories. Humanoid-GPT delivers smoother daily behaviors, more expressive dances, more agile high-dynamic motions, and steadier balance. In each video, the left side shows SONIC and the right side shows Humanoid-GPT (ours).
Daily Motion
Dance Motion
High-Dynamic Motion
Balance Motion
Humanoid-GPT performs robust whole-body control across a wide range of real-world tasks in a zero-shot manner, from physical labor to sports and security scenarios.
Digging
Disinfection
Soccer
Security
All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motions in a zero-shot manner, especially various dance motions.
Humanoid-GPT Training Pipeline. (a) We curate a large-scale motion corpus by aggregating multiple mocap datasets and performing motion retargeting to the Unitree G1 humanoid. (b) Motion experts are trained via reinforcement learning on clustered motion data using Harmonic Motion Embedding (HME). (c) A GPT-style Transformer is trained via DAgger distillation to consolidate all expert knowledge into a single generalist tracker with causal temporal attention.
Humanoid-GPT is the first work that combines a Transformer-based architecture, agile motion tracking, zero-shot generalization, and billion-scale training data.
| Method | Tracker | Agile | Zero-shot | #Frames |
|---|---|---|---|---|
| HumanPlus | Transformer | ✗ | ✗ | 7.2M |
| OmniH2O | MLP | ✗ | ✗ | 7.2M |
| ASAP | MLP | ✓ | ✗ | - |
| GMT | MoE-MLP | ✓ | ✗ | 6.0M |
| UniTracker | MLP | ✓ | ✗ | 7.2M |
| TWIST | MLP | ✗ | ~ | 9.2M |
| Any2Track | MLP | ✓ | ✗ | 9.1M |
| SONIC | MLP | ✓ | ✓ | 100M |
| Humanoid-GPT (ours) | Transformer | ✓ | ✓ | 2.0B |
We propose Harmonic Motion Embedding (HME) to quantify motion data diversity in a latent space. Our curated dataset exhibits both higher embedding scale and broader latent coverage, with approximately 4-5× increase in log-volume compared with AMASS.
Dataset diversity in the HME embedding space.
Data distribution visualization.
The Transformer-based Humanoid-GPT exhibits clear scaling laws: enlarging both the motion corpus and the model capacity yields consistent and substantial gains in tracking accuracy and stability.
Data Scaling Curve on Zero-shot Performance
Model Scalability Comparison
We carefully optimized the deployment pipeline using ONNX and TensorRT compilation, achieving an end-to-end inference latency of under 1.5ms on a single NVIDIA RTX 4090 GPU, approximately 5× faster than TWIST.
Comparison of inference latency among different optimization methods.
@article{humanoidgpt26,
title={Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking},
author={Qi, Zekun and Chen, Xuchuan and Liu, Dairu and Lin, Chenghuai and Lian, Yunrui and Liang, Sikai and Zhang, Zhikai and Guan, Yu and Wang, Jilong and Zhang, Wenyao and Yu, Xinqiang and Wang, He and Yi, Li},
journal={arXiv preprint arXiv:2606.03985},
year={2026}
}