This is the official code base for the paper Vid2World: Crafting Video Diffusion Models to Interactive World Models.
Give it a star ๐ if you find our work useful!
-
๐ฉ 2025-12: We release all model checkpoints on ๐ค Huggingface.
-
๐ฉ 2025-12: We release code for training, inference and evaluation.
We repurpose internet-scale pretrained video diffusion models into interactive world models:
- โ๏ธ Converts non-causal video diffusion backbones into autoregressive, temporally causal architectures with frame-level action conditioning.
- ๐ฆธ Enables high-fidelity, action-conditioned video simulation and scalable world model learning across robot manipulation, 3D game simulation, and open-world navigation.
Note
The code is tested on Ubuntu 20.04, 22.04 and AlmaLinux 9.5.
First create your conda environment:
conda create -n v2w python=3.8 -y
conda activate v2wThen, install dependencies:
pip install -r requirements.txtFor training and evaluation:
- Download the base video model (DynamiCrafter, 320
$\times$ 512), and save it intocheckpoints/dynamicrafter_512_v1/model.ckpt. - Download the pretrained i3d model and save it into
checkpoints/i3d/i3d_torchscript.pt.
At this point, your checkpoints folder should look like this:
checkpoints
โโโ dynamicrafter_512_v1
โ โโโ model.ckpt
โโโ i3d
โโโ i3d_torchscript.ptAt the moment, we provide the following models:
| File | Domain | Weight Transfer Method | Action Guidance | Training Steps |
|---|---|---|---|---|
| Vid2World-RT1 | RT-1 | Extrapolative | โ๏ธ | 100k |
| Vid2World-CSGO | CSGO | Extrapolative | โ๏ธ | 100k |
| Vid2World-RECON | RECON | Extrapolative | โ๏ธ | 100k |
| Vid2World-RT1-NAG | RT-1 | Extrapolative | โ | 30k |
| Vid2World-RT1-Masked-NAG | RT-1 | Masked | โ | 30k |
| Vid2World-RT1-30k | RT-1 | Extrapolative | โ๏ธ | 30k |
| Vid2World-RT1-Masked | RT-1 | Masked | โ๏ธ | 30k |
| Vid2World-RT1-Shift | RT-1 | Shift | โ๏ธ | 30k |
Before inference, make sure you switch the |<your_pretrained_checkpoint>| in the config file to the path towards your local checkpoint.
๐ค Robot Manipulation ๐ฆพ
all_combined.mp4 |
๐ฎ Game Simulation ๐น๏ธ
all_combined.1.mp4 |
๐บ๏ธ Open-World Navigation ๐งญ
all_combined.3.mp4 |
For more showcases, check out our Project Page.
To download and preprocess the used dataset:
- Download the RT-1 Robot Action Dataset from OXE.
- Run the following command in the repo to save the processed dataset to your desired local folder.
python lvdm/data/oxe_data_converter.py --dataset_name fractal20220817_data --input_path {path to downloaded OXE} --output_path {path to stored npz}For inference, download our corresponding pretrained model from ๐คHuggingface, check out QuickStart.
To launch training with the RT-1 dataset, go to configs/manipulation/config_rt1_train.yaml and change the |<your_data_dir>| into the directory where your local data directory. To launch training on 1x4 GPU cards, use the following command:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/manipulation/config_rt1_train.yaml --train --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1For ablation experiments, we provide the corresponding configurations in configs/ablation.
| File | Weight Transfer Method | Action Guidance | Model Checkpoint |
|---|---|---|---|
config_rt1_*_masked_nag.yaml |
Masked | โ | ๐คVid2World-RT1-Masked-NAG |
config_rt1_*_extrp_nag.yaml |
Extrapolative | โ | ๐คVid2World-RT1-NAG |
config_rt1_*_shift.yaml |
Shift | โ๏ธ | ๐คVid2World-RT1-Shift |
config_rt1_*_masked.yaml |
Masked | โ๏ธ | ๐คVid2World-RT1-Masked |
config_rt1_*_all.yaml |
Extrapolative | โ๏ธ | ๐คVid2World-RT1-30k |
Here we provide two setups, one is generating the sequence frame by frame, which is referred to as Auto-Regressive Generation, and one that generates the full sequence all in one go, which we refer to as Non-Auto-Regressive Generation.
Before running the experiments, make sure you download/train the corresponding checkpoints, as well as change the data paths in the config file used.
For auto-regressive generation, run:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base code_release_configs/manipulation/config_rt1_test_ar.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1While doing ablation, switch the configuration file to the corresponding file.
For non-auto-regressive generation, run:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base code_release_configs/manipulation/config_rt1_test_nar.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1Test model's ability to respond to different world_vector actions (X+, X-, Y+, Y-, Z+, Z-).
First, update the config file configs/manipulation/config_rt1_action_control_test.yaml:
- Set
pretrained_checkpointto your checkpoint path - Set
data_dirto your RT-1 data directory
Then run:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/manipulation/config_rt1_action_control_test.yaml --val --name rt1_action_control_test --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1Results will be saved to the directory specified in the config file's save_dir parameter. Each batch visualizes 8 action variants side-by-side for comparison.
To download and preprocess data, please follow the steps from DIAMOND, specifically:
- Download the
.tarfiles in thedataset_dm_scraped_dust2_tarsfrom this dataset repo. - Use the provided script to process the dataset for full and low res. For our purpose, we use only the
full_resfolder.
For inference, download our corresponding pretrained model from ๐คHuggingface, check out QuickStart.
To launch training with the csgo dataset, go to configs/game/config_csgo_train.yaml and change the |<your_data_dir>| into the directory where your local data directory. To launch training on 1x4 GPU cards, use the following command:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/game/config_csgo_train.yaml --train --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1For inference, run:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/game/config_csgo_test.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1For long rollout inference on CSGO, run:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/game/config_csgo_test_long_rollout.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1For long rollout inference on previously unseen games (Valorant, Delta Force), run:
Valorant:
python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/game/config_csgo_test_long_rollout_valorant.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 2 lightning.trainer.num_nodes=1Delta Force:
python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --master_addr=127.0.0.1 --master_port=12879 --node_rank=0 ./main/trainer.py --base configs/game/config_csgo_test_long_rollout_delta_force.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 2 lightning.trainer.num_nodes=1To download and preprocess data, please follow the steps from NoMaD, specifically:
- Download the RECON dataset.
- Change the preprocessing resolution to (640,480).
- Run
process_recon.pyto save the processed dataset to your desired local folder.
For inference, download our corresponding pretrained model from ๐คHuggingface, check out QuickStart.
To launch training with the RECON dataset, go to configs/navigation/config_recon_train.yaml and change the |<your_data_dir>| into the directory where your local data directory. To launch training on 1x4 GPU cards, use the following command:
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/navigation/config_recon_train.yaml --train --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1Following NWM, we evaluate our performance under two setups: single-step generation and auto-regressive generation. While in both setups, our model is doing auto-regressive generation, the data split is different, we support both setups.
Change the |<data_dir>| and |<path_to_pretrained_checkpoint>| in configs/navigation/config_recon_test_single_step.yaml.
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/navigation/config_recon_test_single_step.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1Change the |<data_dir>| and |<path_to_pretrained_checkpoint>| in configs/navigation/config_recon_test_rollout.yaml.
python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12869 --node_rank=0 ./main/trainer.py --base configs/navigation/config_recon_test_rollout.yaml --val --name training_512_v1.0 --logdir |<your_log_dir>| --devices 4 lightning.trainer.num_nodes=1Note
Check out this issue if you encounter the following error message:
ImportError: cannot import name 'trunc_normal_' from 'utils' (unknown location)
For evaluation, after running the inference code, calculate the metrics by running:
python eval.py --exp_folder |<your_log_image_dir>| --env |<rt1/csgo/recon_time/recon_rollout>|If you find our code useful, please consider citing our paper:
@article{huang2025vid2world0,
title={Vid2World: Crafting Video Diffusion Models to Interactive World Models},
author={Siqiao Huang and Jialong Wu and Qixing Zhou and Shangchen Miao and Mingsheng Long},
year={2025},
journal= {arXiv preprint arXiv:2505.14357}
}If you have any questions, please contact [email protected].
We sincerely appreciate the following github repos for their valuable codebase we build upon:
- https://github.com/Doubiiu/DynamiCrafter
- https://github.com/thuml/iVideoGPT
- https://github.com/facebookresearch/nwm
- https://github.com/eloialonso/diamond
- https://github.com/universome/stylegan-v

