Humanoid-GPT: Scaling Data and Structure
for Zero-Shot Motion Tracking
CVPR 2026

Zekun Qi^12*, Xuchuan Chen^23*, Dairu Liu^2*, Chenghuai Lin^2*, Yunrui Lian¹², Sikai Liang², Zhikai Zhang¹²,
Yu Guan¹², Jilong Wang², Wenyao Zhang²⁴, Xinqiang Yu², He Wang^25†, Li Yi^16†

* equal contribution † corresponding author

¹Tsinghua University ²Galbot Inc. ³Beihang University ⁴Shanghai Jiao Tong University ⁵Peking University ⁶Shanghai Qi Zhi Institute

arXiv Video Code

Real-time whole-body control and zero-shot motion tracking of Humanoid-GPT on a Unitree G1 in everyday home scenarios.

Comparison with SONIC

We compare Humanoid-GPT against SONIC, the strongest prior generalist tracker, across four representative categories. Humanoid-GPT delivers smoother daily behaviors, more expressive dances, more agile high-dynamic motions, and steadier balance. In each video, the left side shows SONIC and the right side shows Humanoid-GPT (ours).

Daily Motion

Dance Motion

High-Dynamic Motion

Balance Motion

Demo Videos

Humanoid-GPT performs robust whole-body control across a wide range of real-world tasks in a zero-shot manner, from physical labor to sports and security scenarios.

Digging

Disinfection

Soccer

Security

Real-World Experiments

All motions illustrated are excluded from training to verify generalization capability. Our method can track diverse, complex and high-dynamic motions in a zero-shot manner, especially various dance motions.

Highlights

          1
          Current humanoid motion trackers face a fundamental agility–generalization trade-off: they can either track agile motions in-domain or generalize to unseen movements, but not both.
        

          2
          We assemble an unprecedented 2B-frame motion corpus by aggregating all major mocap datasets (AMASS, LAFAN1, Motion-X++, PHUMA, MotionMillion) with large-scale in-house recordings.
        

          3
          We propose Harmonic Motion Embedding (HME), a novel metric for quantifying and categorizing motion data diversity directly from motion data itself.
        

          4
          We introduce Humanoid-GPT, a Transformer-based humanoid tracker trained via expert distillation that achieves both highly dynamic motion tracking and unprecedented zero-shot generalization.
        

          5
          Extensive experiments demonstrate clear scaling laws: enlarging both data and model capacity yields consistent improvements in tracking accuracy and generalization.
        

Method Overview

Humanoid-GPT Training Pipeline. (a) We curate a large-scale motion corpus by aggregating multiple mocap datasets and performing motion retargeting to the Unitree G1 humanoid. (b) Motion experts are trained via reinforcement learning on clustered motion data using Harmonic Motion Embedding (HME). (c) A GPT-style Transformer is trained via DAgger distillation to consolidate all expert knowledge into a single generalist tracker with causal temporal attention.

Comparison with Prior Works

Humanoid-GPT is the first work that combines a Transformer-based architecture, agile motion tracking, zero-shot generalization, and billion-scale training data.

Method	Tracker	Agile	Zero-shot	#Frames
HumanPlus	Transformer	✗	✗	7.2M
OmniH2O	MLP	✗	✗	7.2M
ASAP	MLP	✓	✗	-
GMT	MoE-MLP	✓	✗	6.0M
UniTracker	MLP	✓	✗	7.2M
TWIST	MLP	✗	~	9.2M
Any2Track	MLP	✓	✗	9.1M
SONIC	MLP	✓	✓	100M
Humanoid-GPT (ours)	Transformer	✓	✓	2.0B

Data Diversity Analysis

We propose Harmonic Motion Embedding (HME) to quantify motion data diversity in a latent space. Our curated dataset exhibits both higher embedding scale and broader latent coverage, with approximately 4-5× increase in log-volume compared with AMASS.

Dataset diversity in the HME embedding space.

Data distribution visualization.

Scaling Laws

The Transformer-based Humanoid-GPT exhibits clear scaling laws: enlarging both the motion corpus and the model capacity yields consistent and substantial gains in tracking accuracy and stability.

Data Scaling Curve on Zero-shot Performance

Model Scalability Comparison

Inference Optimization

We carefully optimized the deployment pipeline using ONNX and TensorRT compilation, achieving an end-to-end inference latency of under 1.5ms on a single NVIDIA RTX 4090 GPU, approximately 5× faster than TWIST.

Comparison of inference latency among different optimization methods.

BibTeX

@article{humanoidgpt26,
      title={Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking},
      author={Qi, Zekun and Chen, Xuchuan and Liu, Dairu and Lin, Chenghuai and Lian, Yunrui and Liang, Sikai and Zhang, Zhikai and Guan, Yu and Wang, Jilong and Zhang, Wenyao and Yu, Xinqiang and Wang, He and Yi, Li},
      journal={arXiv preprint arXiv:2606.03985},
      year={2026}
    }

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking CVPR 2026