TL;DR ORV generates robot videos with the geometry guidance of 4D occupancy, achieves higher control precision, shows strong generalizations, performs multiview consistent videos generation and conducts simulation-to-real visual transfer.
ORV: 4D Occupancy-centric Robot Video Generation
Xiuyu Yang*, Bohan Li*, Shaocong Xu, Nan Wang, Chongjie Ye, Zhaoxi Chen, Minghan Qin, Yikang Ding, Zheng Zhu, Xin Jin, Hang Zhao, Hao Zhao
Preprint (arXiv 2506.03079)
If you find our work useful in your research, please consider citing our paper:
@article{yang2025orv,
title={ORV: 4D Occupancy-centric Robot Video Generation},
author={Yang, Xiuyu and Li, Bohan and Xu, Shaocong and Wang, Nan and Ye, Chongjie and Chen Zhaoxi and Qin, Minghan and Ding Yikang and Jin, Xin and Zhao, Hang and Zhao, Hao},
journal={arXiv preprint arXiv:2506.03079},
year={2025}
}Clone the ORV repository first:
git --recurse-submodules clone https://github.com/OrangeSodahub/ORV.git
cd ORV/Create new python environment:
conda create -n orv python=3.10
conda create orv
pip install -r requirements.txtNote that we use cuda11.8 by default, please modify the lines in requirements.txt shown below to support your own versions:
torch==2.5.1 --index-url https://download.pytorch.org/whl/cu118
torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118
Our checkpoints are host at huggingface repo, feel free to download them.
For BridgeV2 and RT-1 data (singleview), we primarily reuse the video-trajectory data from IRASim (originally from OXE version). We also put the download links below for convenience:
| Data | Train | Evaluation |
|---|---|---|
| BridgeV2 | bridge_train_data | bridge_eval_data |
| RT-1 | rt1_train_data | rt1_eval_data |
This versions of data have raw resolutions of
480×640for BridgeV2, and256×320for RT-1, and we train ORV models on preprocessed320×480resolution (Please refer to Section E.3 in paper for the details).
Please download the official BridgeV2 data (all episodes have 1~3 views) from official BridgeV2 tfds and then extract the usable bridge data:
bash scripts/extract_data_tfds.sh bridgeWe follow the official Droid tutorials to download DROID dataset (all episodes have 2 views) in RLDS and then extract:
# download raw tfds data (~1.7TB)
gsutil -m cp -r gs://gresearch/robotics/droid <path_to_your_target_dir>
# extract
bash scripts/extract_data_tfds.sh droidThis versions of data have raw resolutions of
256×256for BridgeV2 and180×256for Droid, and we train ORV models on320×480for BridgeV2 and256×384for Droid.
To be finished.
Training or evaluation with processed latents data instead of encoding videos or images online will dramaticallly save memory and time. We use the VAE loaded from huggingface THUDM/CogVideoX-2b.
Please refer to scripts/encode_dataset.sh and scripts/encode_dataset_dist.sh to encode images or videos to latents and save them to disk. Please first check the arguments in scripts first (--dataset, --data_root and --output_dir) and then run:
# single process
bash scripts/encode_dataset.sh $SPLIT $BATCH
# multiple processes
bash scripts/encode_dataset_dist.sh $GPU $SPLIT $BATCHwhere SPLIT is one of 'train', 'val', 'test', $GPU is the number of devices, $BATCH is the batch size of dataloader (recommend just use 1).
For those data reused from IRASim, please ignore their processed latents data and only raw
.mp4data will be used.
We first get the basic singleview action-to-video generation model starting from the pretrained THUDM/CogVideoX-2b (Text-to-video) model through SFT. Please check out and run the following script:
bash scripts/train_control_traj-image_finetune_2b.sh --dataset_type $DATASETwhere $DATASET is chosen from ['bridgev2', 'rt1', 'droid'].
- CUDA devices: please set correct value for the key
ACCELERATE_CONFIG_FILEin these.shscripts which are used for accelerate launching. Predfined.yamlfiles are at config/accelerate;- Experimental settings: Each configuration in config/traj_image_*.yaml files corresponds to one training experimental setting and one model. Please set the correct value for the key
EXP_CONFIG_PATHin scripts.
We incorporate occupancy-derived conditions to have more accurate controls. First set the correct path to pretrained model at stage1 in config/traj_image_condfull_2b_finetune.yaml:
transformer:
<<: *runtime
pretrained_model_name_or_path: THUDM/CogVideoX-2b
transformer_model_name_or_path: outputs/orv_bridge_traj-image_480-320_finetune_2b_30k/checkpointThen run the following script (the yaml config file above is set in this script):
bash scripts/train_control_traj-image-cond_finetune.shThis step further extends the singleview generation model to multiview generation model. First set the correct path to pretrained singleview model in config/traj_image_2b_multiview.yaml and then run the following script:
bash scripts/train_control_traj-image-multiview.shNote that all RGB and condition data of all views need to be processed to latents first.
Generally, run the following script to inference the trained model on the specific dataset:
# single process
bash scripts/eval_control_to_video.sh
# multiple processes
bash scripts/eval_control_to_video_dist.sh $GPUPlease choose the correct *.yaml configuration file in scripts:
eval_traj_image_2b_finetune.yaml: base action-to-video modeleval_traj_image_cond_2b_finetune.yaml: singleview occupancy-conditioned modeleval_traj_image_condfull_2b_multiview.yaml: multiview occupancy-conditioned model
Set the keys GT_PATH and PRED_PATH in following script and run it to calculate metrics (refer to Section E.4 in paper for more details):
bash scripts/eval_metrics.shTo be finished.
- Release arXiv technique report
- Release full codes
- Release checkpoints
- Finish the full instructions
- Release processed data
Thansk for these excellent opensource works and models: CogVideoX; diffusers;.
