Generated Reality: Human-centric World Simulation via Interactive Video Generation with Hand and Camera Control

[arXiv] [Project Page]

Linxi Xie^1,2,*, Lisong C. Sun^1,*, Ashley Neall^1,3,*, Tong Wu¹, Shengqu Cai¹, Gordon Wetzstein¹

¹Stanford University ²NYU Shanghai ³UNC Chapel Hill
^*Equal contribution.

Introduction

TL;DR: We condition a video diffusion model on 3D hand poses and camera trajectories tracked from VR equipment, enabling interactive, egocentric world simulation with dexterous hand-object interactions.

Our method takes a reference image and tracked hand keypoints (21 joints per hand, in camera space) as input and generates egocentric videos grounded in real human motion. A MotionEncoder compresses the per-frame 3D hand pose sequence into a compact embedding injected directly into a Wan2.2 video diffusion transformer. Joint hand–camera conditioning allows the model to disambiguate hand motion from head/camera movement, enabling more coherent interactions.

Installation

This codebase is built on top of DiffSynth-Studio with modifications to support hand motion conditioning and our custom training pipeline.

git clone https://github.com/LafouCC/generated-reality.git
cd generated-reality
pip install -e .
pip install accelerate pandas imageio torchvision

Download the base model weights (Wan2.2-Fun-5B-Control) and place them under models/:

models/
├── Wan-AI/Wan2.2-Fun-5B-Control/
│   └── diffusion_pytorch_model.safetensors
├── Wan-AI/Wan2.1-T2V-1.3B/
│   └── models_t5_umt5-xxl-enc-bf16.pth
└── Wan-AI/Wan2.2-TI2V-5B/
    └── Wan2.2_VAE.pth

Datasets

The curated datasets used for training/evaluation are provided here:

Note: Some videos may be encoded as mp4v, which can appear as green frames in macOS preview. The video content is valid; for playback compatibility, re-encode to H.264.

Datasets follow this structure, with a metadata.csv containing columns video, reference_image, control_video, motion. Each motion entry is a JSON file with per-frame hand 3d keypoints in the format {frame_idx: {left: [63 floats], right: [63 floats]}}.

dataset/
├── train/
│   └── metadata.csv
└── valid/
    └── metadata.csv

An example sample is provided under example_test_data/valid/.

Training

We provide an example launch script at launch.sh. The script is configured for the Wan2.2-Fun-5B-Control model and was configured for the GigaHands dataset. Adapt the model paths, dataset path, and hyperparameters for your setup.

The trainable parameters are a LoRA adapter on all DiT blocks and a lightweight MotionEncoder. Both are saved together in a single .safetensors checkpoint.

Note for Wan2.2 14B model family: The 14B architecture splits the denoising schedule across a high-noise DiT and a low-noise DiT. When fine-tuning the low-noise DiT, pass --high_noise_checkpoint_path pointing to your trained high-noise checkpoint so that the MotionEncoder is warm-started from the high-noise stage rather than initialized from scratch, which enables continual training of the motion encoder across the two stages.

Inference

python inference.py \
  --checkpoint_path path/to/checkpoint.safetensors \
  --metadata_path path/to/valid/metadata.csv \
  --model_base path/to/models \
  --output_dir outputs/

By default inference runs on all available GPUs in parallel. Use --num_gpus N to limit, or --start_gpu K to offset GPU indexing.

Citation

If you find this work helpful, please cite:

@article{xie2025generatedreality,
  title     = {Generated Reality: Human-centric World Simulation via Interactive Video Generation with Hand and Camera Control},
  author    = {Xie, Linxi and Sun, Lisong C. and Neall, Ashley and Wu, Tong and Cai, Shengqu and Wetzstein, Gordon},
  journal   = {arXiv preprint arXiv:2602.18422},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
diffsynth		diffsynth
example_test_data/valid		example_test_data/valid
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
launch.sh		launch.sh
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generated Reality: Human-centric World Simulation via Interactive Video Generation with Hand and Camera Control

[arXiv] [Project Page]

Introduction

Installation

Datasets

Training

Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generated Reality: Human-centric World Simulation via Interactive Video Generation with Hand and Camera Control

[arXiv] [Project Page]

Introduction

Installation

Datasets

Training

Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages