Skip to content

ZichengDuan/LiveWorld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌏 LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

1Adelaide University  2The Australian National University  3Monash University  4Zhejiang University  5University of Auckland

* Equal contribution † Project lead ‑ Corresponding author


teaser

LiveWorld enables persistent out-of-sight dynamics. Instead of freezing unobserved regions, our framework explicitly decouples world evolution from observation rendering. We register stationary Monitors to autonomously fast-forward the temporal progression of active entities in the background. As the observer explores the scene along the target trajectory, our state-aware renderer projects the continuously evolved world states to synthesize the final observation, ensuring that dynamic events progress naturally even when entities are completely out of the observer's view.

πŸ“‹ TODOs

  • Inference code and pretrained weights (initial version in the paper)
  • Training code
  • Training data preparation pipeline
  • Inference sample preparation pipeline
  • Demo data and examples (inference + training)
  • LiveBench and its inference scripts
  • Model trained on 1.3B backbone
  • Model (14B/1.3B) with better training

πŸ”§ Installation

Tested environment:

  • Ubuntu 22.04, Python 3.11, CUDA 12.8, PyTorch 2.8 (cu128)
  • H100 96GB / H200 140GB, 128GB+ system memory
  • Flash Attention 2.8.3 (Flash Attention 3 is also supported on Hopper GPUs but requires building from source)

1. Create conda environment

conda create -n liveworld python=3.11 -y
conda activate liveworld

2. Clone the repository and install ffmpeg

git clone https://github.com/ZichengDuan/LiveWorld.git
cd LiveWorld

# Install ffmpeg (either way works)
conda install -c conda-forge ffmpeg -y
# or: sudo apt-get update && sudo apt-get install -y ffmpeg

3. Install PyTorch

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

4. Setup environment (install dependencies + download model weights)

bash setup_env.sh

This script will:

  • Install Python dependencies from setup/requirements.txt
  • Install LiveWorld and local packages (SAM3, Stream3R)
  • Download all pretrained weights (~100GB) into ckpts/
  • Download example data into examples/
πŸ“¦ Downloaded model weights
Model Source Purpose
LiveWorld State Adapter + LoRA ZichengD/LiveWorld Core LiveWorld weights
Wan2.1-T2V-14B Wan-AI/Wan2.1-T2V-14B Backbone
Wan2.1-Fun-1.3B-InP alibaba-pai/Wan2.1-Fun-1.3B-InP VAE (data preparation)
Wan2.1-T2V-14B-StepDistill lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill Distilled backbone (optional fast inference)
Qwen3-VL-8B-Instruct Qwen/Qwen3-VL-8B-Instruct Entity detection
SAM3 facebook/sam3 Video segmentation
STream3R yslan/STream3R 3D reconstruction
DINOv3 facebook/dinov3-vith16plus-pretrain-lvd1689m Entity matching

πŸš€ Inference

1. Prepare inference samples from source images

Place source images in examples/inference_sample/raw/, then run:

bash create_infer_sample.sh

This generates per-image inference configs under examples/inference_sample/processed/, including:

  • Scene point cloud (via Stream3R)
  • Camera trajectories
  • Entity detection and storyline (via Qwen3-VL)

A pre-built sample (kid_coffee) is already included under examples/inference_sample/processed/ β€” you can skip this step and go directly to inference.

2. Run inference

Edit infer.sh to set your config path and GPU, then run:

bash infer.sh

Example infer.sh:

export CUDA_VISIBLE_DEVICES=0
python scripts/infer.py \
    --config examples/inference_sample/processed/kid_coffee/infer_scripts/case1_right.yaml \
    --system-config configs/infer_system_config_few_step_14B.yaml \
    --output-root outputs \
    --device cuda:0

Two system configs are provided:

  • configs/infer_system_config_14B.yaml β€” full-step inference
  • configs/infer_system_config_few_step_14B.yaml β€” 4-step distilled inference (faster)

πŸ‹οΈ Training

1. Prepare training data

Place source videos in examples/training_sample/raw/ (organized by dataset name), then run:

bash create_train_sample.sh

This runs a 4-step pipeline:

  1. Build samples β€” clip extraction, entity detection (Qwen3-VL), segmentation (SAM3), geometry estimation (Stream3R), sample construction
  2. Captioning β€” generate text descriptions with Qwen3-VL
  3. VAE encode β€” encode videos to latent space
  4. Pack LMDB β€” package into sharded LMDB for training

Example training samples from MIRA, RealEstate10K, and SpatiaVID_HQ are included under examples/training_sample/.

2. Run training

Edit train.sh to set your GPU configuration, then run:

bash train.sh

Training configs:

  • configs/train_liveworld_14B.yaml β€” 14B backbone
  • configs/train_liveworld_1-3B.yaml β€” 1.3B backbone

Both train.sh and create_train_sample.sh support multi-node multi-GPU β€” edit the NODES and CUDA_VISIBLE_DEVICES_LIST arrays at the top of each script.

πŸ“ Project Structure

LiveWorld/
β”œβ”€β”€ infer.sh                        # Inference entry point
β”œβ”€β”€ train.sh                        # Training entry point
β”œβ”€β”€ create_infer_sample.sh          # Inference sample preparation
β”œβ”€β”€ create_train_sample.sh          # Training data preparation
β”œβ”€β”€ setup_env.sh                    # One-click environment setup
β”œβ”€β”€ setup/                          # Installation scripts & requirements
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ infer_system_config_14B.yaml        # Full-step inference config
β”‚   β”œβ”€β”€ infer_system_config_few_step_14B.yaml  # 4-step distilled inference config
β”‚   β”œβ”€β”€ train_liveworld_14B.yaml            # 14B training config
β”‚   β”œβ”€β”€ train_liveworld_1-3B.yaml           # 1.3B training config
β”‚   └── data_preparation.yaml               # Data preparation config
β”œβ”€β”€ liveworld/                      # Core package
β”‚   β”œβ”€β”€ trainer.py                  # Task definition + training loop
β”‚   β”œβ”€β”€ wrapper.py                  # Model wrappers (VAE, text encoder, State Adapter)
β”‚   β”œβ”€β”€ dataset.py                  # LMDB dataset loader
β”‚   β”œβ”€β”€ utils.py                    # Utilities
β”‚   β”œβ”€β”€ geometry_utils.py           # Geometry & projection utilities
β”‚   └── pipelines/
β”‚       β”œβ”€β”€ pipeline_unified_backbone.py    # Unified Backbone
β”‚       β”œβ”€β”€ pointcloud_updater.py           # Stream3R point cloud handler
β”‚       └── monitor_centric/                # Monitor-Centric Evolution Pipeline
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ infer.py                    # Inference script
β”‚   β”œβ”€β”€ train.py                    # Training script
β”‚   β”œβ”€β”€ create_infer_sample/        # Inference sample creation
β”‚   β”‚   β”œβ”€β”€ assemble_event_bench.py     # Main assembly (trajectory + storyline)
β”‚   β”‚   β”œβ”€β”€ build_scene_pointcloud.py   # Scene 3D reconstruction
β”‚   β”‚   └── plot_trajectories_3d.py     # Trajectory visualization
β”‚   β”œβ”€β”€ create_train_data/          # Training data processing steps
β”‚   β”‚   β”œβ”€β”€ step1_build_samples.py      # Clip extraction + geometry + samples
β”‚   β”‚   β”œβ”€β”€ step2_captioning.py         # Video captioning
β”‚   β”‚   β”œβ”€β”€ step3_vae_encode.py         # VAE encoding
β”‚   β”‚   β”œβ”€β”€ step4a_pack_lmdb.py         # LMDB packing
β”‚   β”‚   └── step4b_cache_keys.py        # Key caching
β”‚   └── dataset_preparation/        # Legacy data preparation
β”œβ”€β”€ examples/                       # Sample data
β”‚   β”œβ”€β”€ inference_sample/           # Inference example
β”‚   β”‚   β”œβ”€β”€ raw/                        # Source images (input)
β”‚   β”‚   └── processed/                  # Generated configs + point clouds (output)
β”‚   └── training_sample/            # Training example
β”‚       β”œβ”€β”€ raw/                        # Source videos (input)
β”‚       β”œβ”€β”€ processed/                  # Extracted samples (intermediate)
β”‚       └── processed_lmdb/             # Packed LMDB (output)
β”œβ”€β”€ misc/
β”‚   β”œβ”€β”€ sam3/                       # SAM3 (local package)
β”‚   └── STream3R/                   # Stream3R (local package)
└── ckpts/                          # Model weights (downloaded by setup_env.sh)

❓ FAQ

How to summon foreground entities into the scene?

By default, the system only generates scene_text (background description) and does not automatically produce fg_text. To introduce foreground entities (e.g., a person, animal, or object), manually add a fg_text field to the corresponding iteration in your inference YAML config:

iter_input:
  '0':
    scene_text: The brick wall backdrop remains visible behind the stall...
    fg_text: 'On the right, a wooden bench under the wall sits a lovely corgi dog, staying steadily on the bench and rest.'

The fg_text describes the foreground entity you want to appear and its behavior. You can edit this in any generated config under examples/inference_sample/processed/<image>/infer_scripts/*.yaml.

πŸ™ Acknowledgements

We thank the authors of Wan2.1, SAM3, STream3R, Qwen3-VL, and DINOv3 for their outstanding open-source contributions.

We also acknowledge that the concept of reasoning about out-of-sight world shares a similar spirit with Out of Sight, Not Out of Mind (Plizzari et al., 3DV 2025) [paper, code], which explored spatial cognition of off-screen objects in egocentric video perception β€” a totally different domain but a kindred high-level insight.

πŸ“ Citation

If you find this work helpful, please consider citing:

@article{duan2026liveworld,
  title={LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models},
  author={Duan, Zicheng and Xia, Jiatong and Zhang, Zeyu and Zhang, Wenbo and Zhou, Gengze and Gou, Chenhui and He, Yefei and Chen, Feng and Zhang, Xinyu and Liu, Lingqiao},
  journal={arXiv preprint arXiv:2603.07145},
  year={2026}
}

About

Official implementation of paper LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

Topics

Resources

License

Stars

Watchers

Forks

Contributors