Chenhui Gou3β Yefei He4β Feng Chen1β β Xinyu Zhang5β‘β Lingqiao Liu1β β‘
1Adelaide Universityβ 2The Australian National Universityβ 3Monash Universityβ 4Zhejiang Universityβ 5University of Auckland
* Equal contributionββ Project leadββ‘ Corresponding author
LiveWorld enables persistent out-of-sight dynamics. Instead of freezing unobserved regions, our framework explicitly decouples world evolution from observation rendering. We register stationary Monitors to autonomously fast-forward the temporal progression of active entities in the background. As the observer explores the scene along the target trajectory, our state-aware renderer projects the continuously evolved world states to synthesize the final observation, ensuring that dynamic events progress naturally even when entities are completely out of the observer's view.
- Inference code and pretrained weights (initial version in the paper)
- Training code
- Training data preparation pipeline
- Inference sample preparation pipeline
- Demo data and examples (inference + training)
- LiveBench and its inference scripts
- Model trained on 1.3B backbone
- Model (14B/1.3B) with better training
Tested environment:
- Ubuntu 22.04, Python 3.11, CUDA 12.8, PyTorch 2.8 (cu128)
- H100 96GB / H200 140GB, 128GB+ system memory
- Flash Attention 2.8.3 (Flash Attention 3 is also supported on Hopper GPUs but requires building from source)
1. Create conda environment
conda create -n liveworld python=3.11 -y
conda activate liveworld2. Clone the repository and install ffmpeg
git clone https://github.com/ZichengDuan/LiveWorld.git
cd LiveWorld
# Install ffmpeg (either way works)
conda install -c conda-forge ffmpeg -y
# or: sudo apt-get update && sudo apt-get install -y ffmpeg3. Install PyTorch
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu1284. Setup environment (install dependencies + download model weights)
bash setup_env.shThis script will:
- Install Python dependencies from
setup/requirements.txt - Install LiveWorld and local packages (SAM3, Stream3R)
- Download all pretrained weights (~100GB) into
ckpts/ - Download example data into
examples/
π¦ Downloaded model weights
| Model | Source | Purpose |
|---|---|---|
| LiveWorld State Adapter + LoRA | ZichengD/LiveWorld | Core LiveWorld weights |
| Wan2.1-T2V-14B | Wan-AI/Wan2.1-T2V-14B | Backbone |
| Wan2.1-Fun-1.3B-InP | alibaba-pai/Wan2.1-Fun-1.3B-InP | VAE (data preparation) |
| Wan2.1-T2V-14B-StepDistill | lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill | Distilled backbone (optional fast inference) |
| Qwen3-VL-8B-Instruct | Qwen/Qwen3-VL-8B-Instruct | Entity detection |
| SAM3 | facebook/sam3 | Video segmentation |
| STream3R | yslan/STream3R | 3D reconstruction |
| DINOv3 | facebook/dinov3-vith16plus-pretrain-lvd1689m | Entity matching |
Place source images in examples/inference_sample/raw/, then run:
bash create_infer_sample.shThis generates per-image inference configs under examples/inference_sample/processed/, including:
- Scene point cloud (via Stream3R)
- Camera trajectories
- Entity detection and storyline (via Qwen3-VL)
A pre-built sample (
kid_coffee) is already included underexamples/inference_sample/processed/β you can skip this step and go directly to inference.
Edit infer.sh to set your config path and GPU, then run:
bash infer.shExample infer.sh:
export CUDA_VISIBLE_DEVICES=0
python scripts/infer.py \
--config examples/inference_sample/processed/kid_coffee/infer_scripts/case1_right.yaml \
--system-config configs/infer_system_config_few_step_14B.yaml \
--output-root outputs \
--device cuda:0Two system configs are provided:
configs/infer_system_config_14B.yamlβ full-step inferenceconfigs/infer_system_config_few_step_14B.yamlβ 4-step distilled inference (faster)
Place source videos in examples/training_sample/raw/ (organized by dataset name), then run:
bash create_train_sample.shThis runs a 4-step pipeline:
- Build samples β clip extraction, entity detection (Qwen3-VL), segmentation (SAM3), geometry estimation (Stream3R), sample construction
- Captioning β generate text descriptions with Qwen3-VL
- VAE encode β encode videos to latent space
- Pack LMDB β package into sharded LMDB for training
Example training samples from MIRA, RealEstate10K, and SpatiaVID_HQ are included under examples/training_sample/.
Edit train.sh to set your GPU configuration, then run:
bash train.shTraining configs:
configs/train_liveworld_14B.yamlβ 14B backboneconfigs/train_liveworld_1-3B.yamlβ 1.3B backbone
Both train.sh and create_train_sample.sh support multi-node multi-GPU β edit the NODES and CUDA_VISIBLE_DEVICES_LIST arrays at the top of each script.
LiveWorld/
βββ infer.sh # Inference entry point
βββ train.sh # Training entry point
βββ create_infer_sample.sh # Inference sample preparation
βββ create_train_sample.sh # Training data preparation
βββ setup_env.sh # One-click environment setup
βββ setup/ # Installation scripts & requirements
βββ configs/
β βββ infer_system_config_14B.yaml # Full-step inference config
β βββ infer_system_config_few_step_14B.yaml # 4-step distilled inference config
β βββ train_liveworld_14B.yaml # 14B training config
β βββ train_liveworld_1-3B.yaml # 1.3B training config
β βββ data_preparation.yaml # Data preparation config
βββ liveworld/ # Core package
β βββ trainer.py # Task definition + training loop
β βββ wrapper.py # Model wrappers (VAE, text encoder, State Adapter)
β βββ dataset.py # LMDB dataset loader
β βββ utils.py # Utilities
β βββ geometry_utils.py # Geometry & projection utilities
β βββ pipelines/
β βββ pipeline_unified_backbone.py # Unified Backbone
β βββ pointcloud_updater.py # Stream3R point cloud handler
β βββ monitor_centric/ # Monitor-Centric Evolution Pipeline
βββ scripts/
β βββ infer.py # Inference script
β βββ train.py # Training script
β βββ create_infer_sample/ # Inference sample creation
β β βββ assemble_event_bench.py # Main assembly (trajectory + storyline)
β β βββ build_scene_pointcloud.py # Scene 3D reconstruction
β β βββ plot_trajectories_3d.py # Trajectory visualization
β βββ create_train_data/ # Training data processing steps
β β βββ step1_build_samples.py # Clip extraction + geometry + samples
β β βββ step2_captioning.py # Video captioning
β β βββ step3_vae_encode.py # VAE encoding
β β βββ step4a_pack_lmdb.py # LMDB packing
β β βββ step4b_cache_keys.py # Key caching
β βββ dataset_preparation/ # Legacy data preparation
βββ examples/ # Sample data
β βββ inference_sample/ # Inference example
β β βββ raw/ # Source images (input)
β β βββ processed/ # Generated configs + point clouds (output)
β βββ training_sample/ # Training example
β βββ raw/ # Source videos (input)
β βββ processed/ # Extracted samples (intermediate)
β βββ processed_lmdb/ # Packed LMDB (output)
βββ misc/
β βββ sam3/ # SAM3 (local package)
β βββ STream3R/ # Stream3R (local package)
βββ ckpts/ # Model weights (downloaded by setup_env.sh)
How to summon foreground entities into the scene?
By default, the system only generates scene_text (background description) and does not automatically produce fg_text. To introduce foreground entities (e.g., a person, animal, or object), manually add a fg_text field to the corresponding iteration in your inference YAML config:
iter_input:
'0':
scene_text: The brick wall backdrop remains visible behind the stall...
fg_text: 'On the right, a wooden bench under the wall sits a lovely corgi dog, staying steadily on the bench and rest.'The fg_text describes the foreground entity you want to appear and its behavior. You can edit this in any generated config under examples/inference_sample/processed/<image>/infer_scripts/*.yaml.
We thank the authors of Wan2.1, SAM3, STream3R, Qwen3-VL, and DINOv3 for their outstanding open-source contributions.
We also acknowledge that the concept of reasoning about out-of-sight world shares a similar spirit with Out of Sight, Not Out of Mind (Plizzari et al., 3DV 2025) [paper, code], which explored spatial cognition of off-screen objects in egocentric video perception β a totally different domain but a kindred high-level insight.
If you find this work helpful, please consider citing:
@article{duan2026liveworld,
title={LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models},
author={Duan, Zicheng and Xia, Jiatong and Zhang, Zeyu and Zhang, Wenbo and Zhou, Gengze and Gou, Chenhui and He, Yefei and Chen, Feng and Zhang, Xinyu and Liu, Lingqiao},
journal={arXiv preprint arXiv:2603.07145},
year={2026}
}