Skip to content

LafouCC/generated-reality

Repository files navigation

Generated Reality: Human-centric World Simulation via Interactive Video Generation with Hand and Camera Control

Linxi Xie1,2,*, Lisong C. Sun1,*, Ashley Neall1,3,*, Tong Wu1, Shengqu Cai1, Gordon Wetzstein1

1Stanford University   2NYU Shanghai   3UNC Chapel Hill
*Equal contribution.


Introduction

TL;DR: We condition a video diffusion model on 3D hand poses and camera trajectories tracked from VR equipment, enabling interactive, egocentric world simulation with dexterous hand-object interactions.

Generated Reality Teaser

Our method takes a reference image and tracked hand keypoints (21 joints per hand, in camera space) as input and generates egocentric videos grounded in real human motion. A MotionEncoder compresses the per-frame 3D hand pose sequence into a compact embedding injected directly into a Wan2.2 video diffusion transformer. Joint hand–camera conditioning allows the model to disambiguate hand motion from head/camera movement, enabling more coherent interactions.

Installation

This codebase is built on top of DiffSynth-Studio with modifications to support hand motion conditioning and our custom training pipeline.

git clone https://github.com/LafouCC/generated-reality.git
cd generated-reality
pip install -e .
pip install accelerate pandas imageio torchvision

Download the base model weights (Wan2.2-Fun-5B-Control) and place them under models/:

models/
├── Wan-AI/Wan2.2-Fun-5B-Control/
│   └── diffusion_pytorch_model.safetensors
├── Wan-AI/Wan2.1-T2V-1.3B/
│   └── models_t5_umt5-xxl-enc-bf16.pth
└── Wan-AI/Wan2.2-TI2V-5B/
    └── Wan2.2_VAE.pth

Datasets

The curated datasets used for training/evaluation are provided here:

Note: Some videos may be encoded as mp4v, which can appear as green frames in macOS preview. The video content is valid; for playback compatibility, re-encode to H.264.

Datasets follow this structure, with a metadata.csv containing columns video, reference_image, control_video, motion. Each motion entry is a JSON file with per-frame hand 3d keypoints in the format {frame_idx: {left: [63 floats], right: [63 floats]}}.

dataset/
├── train/
│   └── metadata.csv
└── valid/
    └── metadata.csv

An example sample is provided under example_test_data/valid/.

Training

We provide an example launch script at launch.sh. The script is configured for the Wan2.2-Fun-5B-Control model and was configured for the GigaHands dataset. Adapt the model paths, dataset path, and hyperparameters for your setup.

The trainable parameters are a LoRA adapter on all DiT blocks and a lightweight MotionEncoder. Both are saved together in a single .safetensors checkpoint.

Note for Wan2.2 14B model family: The 14B architecture splits the denoising schedule across a high-noise DiT and a low-noise DiT. When fine-tuning the low-noise DiT, pass --high_noise_checkpoint_path pointing to your trained high-noise checkpoint so that the MotionEncoder is warm-started from the high-noise stage rather than initialized from scratch, which enables continual training of the motion encoder across the two stages.

Inference

python inference.py \
  --checkpoint_path path/to/checkpoint.safetensors \
  --metadata_path path/to/valid/metadata.csv \
  --model_base path/to/models \
  --output_dir outputs/

By default inference runs on all available GPUs in parallel. Use --num_gpus N to limit, or --start_gpu K to offset GPU indexing.

Citation

If you find this work helpful, please cite:

@article{xie2025generatedreality,
  title     = {Generated Reality: Human-centric World Simulation via Interactive Video Generation with Hand and Camera Control},
  author    = {Xie, Linxi and Sun, Lisong C. and Neall, Ashley and Wu, Tong and Cai, Shengqu and Wetzstein, Gordon},
  journal   = {arXiv preprint arXiv:2602.18422},
  year      = {2025}
}

About

Official Codebase for "Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control [CVPR 2026]"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages