Official repository for the project TraceGen: World Modeling in 3D Trace-Space Enables Learning from Cross-Embodiment Videos.
Project Website: tracegen.github.io
arXiv: 2511.21690
- Training/testing labels for the five datasets (Libero, Robomimic, Droid, Epickitchen, Bridge), along with the checkpoints trained on each and their metrics, are now available. See the Hugging Face collection for all assets: https://huggingface.co/collections/furonghuang-lab/tracegen
- The official leaderboard is hosted at: π https://huggingface.co/furonghuang-lab/TraceGenBenchmark
For the data generation pipeline TraceForge that prepares cross-embodiment 3D trace dataset for TraceGen training, please refer to
TraceForge GitHub Repository.
TraceForge is a scalable data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces.
- Camera motion compensation: Estimating camera pose and depth, and applying world-to-camera alignment
- Speed retargeting: Normalizing motion speeds across different embodiments
- 3D point tracking: Using predicted camera poses and depth to reconstruct scene-level 3D trajectories for both robot and object motion
We provide two ways to install TraceGen conda environment. Both tested on PyTorch 2.4.1 with CUDA 12.4.
- Create and install environment from
environment.yml:
conda env create -f environment.yml
conda activate trace_gen- Create a conda environment:
conda create -n trace_gen python=3.10
conda activate trace_gen- Install PyTorch (We tested on 2.4.1):
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia- Install all dependencies:
pip install -r requirements.txt- Setup your local configuration: Create a local config file with your dataset paths:
cp cfg/train.local.yaml.example cfg/train.local.yaml
# Edit cfg/train.local.yaml with your dataset directories and checkpoint paths- Start training: See the Training section below for detailed examples.
Datasets prepared through
TraceForge should be organized as follows:
data/
βββ episode_01/
β βββ images/
β β βββ episode_01_0.png
β β βββ episode_01_5.png
β β βββ ...
β βββ samples/
β β βββ episode_01_0.npz # Contains 'keypoints' array [N, 2]
β β βββ episode_01_5.npz
β β βββ ...
β βββ depth/ # (optional)
β β βββ episode_01_0_raw.npz # Contains 'keypoints' array [N, 2]
β β βββ episode_01_5_raw.npz
β β βββ ...
β βββ task_descriptions.json
βββ episode_02/
βββ ...
Training with Small Datasets
When training with small datasets, frequent visualization and checkpoint saving can be inefficient. We recommend the following configuration adjustments:
save_every: 20
num_log_steps_per_epoch: 0 # Disable intra-epoch logging
eval_every: 20
visualize_every: 20
val_split: 0.1 # Or larger to avoid zero validation samplesTraining with Large Datasets
For large datasets, to ensure adequate logging frequency and avoid sparse checkpoints:
save_every: 1
num_log_steps_per_epoch: 10 # Or higher for more frequent intra-epoch logging
eval_every: 1
visualize_every: 1
val_split: 0.01 # Small enough number to avoid validation takes too much timeexport CUDA_VISIBLE_DEVICES=0
python train.py \
--config cfg/train.yaml \
--override \
train.batch_size=6 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=12 \
model.decoder.num_attention_heads=16 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --standalone --nproc_per_node=4 \
train.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --standalone --nproc_per_node=4 \
train.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}Note: Replace
{path_to_pretrained_checkpoint}with the path to your downloaded TraceGen checkpoint. (you can find pretrained TraceGen on TraceForge-123k at https://huggingface.co/JayLee131/TraceGen)
If you enable Weights & Biases logging (logging.use_wandb=true), you can monitor:
- Training and validation losses
- Generated trajectory visualizations
- Predicted trajectory MSE
π Testing on TraceGen benchmarks is released!
This dataset defines the official evaluation protocol for the TraceGen benchmark. Models are evaluated on 5 environments with the following metrics:
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Endpoint MSE
The official leaderboard is hosted at: π https://huggingface.co/furonghuang-lab/TraceGenBenchmark
| Environment | Metric | TraceGen (Γ1eβ2) |
|---|---|---|
| EpicKitchen | MSE | 0.445 |
| MAE | 2.721 | |
| Endpoint MSE | 0.791 | |
| Droid | MSE | 0.206 |
| MAE | 1.289 | |
| Endpoint MSE | 0.285 | |
| Bridge | MSE | 0.653 |
| MAE | 2.419 | |
| Endpoint MSE | 0.607 | |
| Libero | MSE | 0.276 |
| MAE | 1.442 | |
| Endpoint MSE | 0.385 | |
| Robomimic | MSE | 0.138 |
| MAE | 1.416 | |
| Endpoint MSE | 0.151 |
Multi-GPU
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --standalone --nproc_per_node=4 \
test_benchmark.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}
Single-GPU
export CUDA_VISIBLE_DEVICES=0
python test_benchmark.py \
--config cfg/train.yaml \
--override \
train.batch_size=8 \
train.lr_decoder=1.5e-4 \
model.decoder.num_layers=6 \
model.decoder.num_attention_heads=12 \
model.decoder.latent_dim=768 \
data.num_workers=4 \
hardware.mixed_precision=true \
logging.use_wandb=true \
logging.log_every=2000 \
--resume {path_to_pretrained_checkpoint}
High-level overview of the repository structure:
Trace_gen/
βββ cfg/ # Configuration files
βββ dataio/ # Data loading and preprocessing
βββ models/ # Model architectures
βββ losses/ # Loss functions
βββ trainer/ # Training loop implementation
βββ utils/ # Utility functions
βββ train.py # Main training script
βββ test_example.py # Example testing script
βββ test_helpers.py # Testing utilities
βββ environment.yml # Conda environment file
βββ requirements.txt # Python dependencies
βββ README.md # This file
If you find this work useful, please consider citing our paper:
@article{lee2025tracegen,
title={TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos},
author={Lee, Seungjae and Jung, Yoonkyo and Chun, Inkook and Lee, Yao-Chih and Cai, Zikui and Huang, Hongjia and Talreja, Aayush and Dao, Tan Dat and Liang, Yongyuan and Huang, Jia-Bin and Huang, Furong},
journal={arXiv preprint arXiv:2511.21690},
year={2025}
}Our code modifies and builds upon:
- CogVideoX from HuggingFace Diffusers for the 3D trace generation model.
- Prismatic VLMs for insights on multimodal encoder design

