We propose a 4D video generation model that enforces geometric consistency across multiple camera views to predict spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications to robot manipulation by extracting gripper poses from generated videos using an off-the-shelf pose tracking algorithm. We show that the model generalizes to novel viewpoints and enables robots to leverage multi-view information for planning.
| 📄 Paper | 🌐 Project Page | 📦 Dataset | 🤗 Hugging Face |
|---|---|---|---|
| arXiv | Website | Stanford Mirror | Dataset · Checkpoints |
Zeyi Liu¹ · Shuang Li¹ · Eric Cousineau² · Siyuan Feng² · Benjamin Burchfiel² · Shuran Song¹
¹ Stanford University
² Toyota Research Institute
Robotic manipulation requires understanding how 3D geometry evolves over time under agent actions. However, most video generation models are trained with single-view RGB videos, limiting their ability to reason about geometry and cross-view consistency.
This project introduces a geometry-aware 4D video generation pipeline that:
- Models multi-view RGB-D observations across time
- Enforces cross-view geometric consistency via pointmaps
- Learns temporally coherent latent dynamics suitable for manipulation
The resulting models serve as strong foundations for world modeling, policy learning, and planning in robotics.
We release a multi-view, multi-task robotic manipulation dataset collected in simulation.
Simulation tasks (LBM):
- StoreCerealBoxUnderShelf
- PutSpatulaOnTableFromUtensilCrock
- PlaceAppleFromBowlIntoBin
Real-world robot manipulation tasks:
- BimanualAddOrangeSlicesToBowl
- BimanualChopCucumber
- BimanualCupOnSaucer
- BimanualTwistCapOffBottle
- Simulation: 50 demonstrations per task
- Real world: 10 demonstrations per task
- 16 RGB-D camera views per timestep, sampled from the upper hemisphere
- Synchronized robot actions and observations
- Simulation data collected in the Large Behavior Model (LBM) environment
📥 Download links:
- Dataset: https://real.stanford.edu/4dgen/data/
- Hugging Face mirror: https://huggingface.co/datasets/Zeyi/4dgen-dataset
We provide multiple checkpoints to support different stages of the pipeline:
- Stable Video Diffusion (SVD) backbones
- Task-specific VAEs for RGB and pointmap latents
- 4D video generation models fine-tuned on manipulation data
📦 Checkpoints:
- SVD / base models: https://real.stanford.edu/4dgen/checkpoints/
- Fine-tuned VAEs: https://real.stanford.edu/4dgen/checkpoints/VAE/
- 4D generation outputs: https://real.stanford.edu/4dgen/checkpoints/outputs/
We recommend using conda or mamba.
cd 4dgen
conda env create -f environment.yml
conda activate video_policy
conda install pytorch3dTested on:
- Ubuntu 22.04
- CUDA 12.2
CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_autoencoder_workspaceCUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_svd_lightning_workspaceNotes:
- Tested on 4× NVIDIA A6000 (48GB)
- Batch size: 1
- Training time: ~2 days
Run the provided evaluation example:
python notebooks/eval.pyThis script demonstrates loading a trained checkpoint and generating multi-view 4D predictions.
We show representative qualitative results illustrating multi-view RGB-D video generation.
If you find this project useful, please consider citing:
@article{liu2025geometry,
title={Geometry-aware 4D Video Generation for Robot Manipulation},
author={Liu, Zeyi and Li, Shuang and Cousineau, Eric and Feng, Siyuan and Burchfiel, Benjamin and Song, Shuran},
journal={arXiv preprint arXiv:2507.01099},
year={2025}
}This project is released for research use. Please see the repository for license details.
💬 Questions or issues? Feel free to open a GitHub issue or reach out via the project page.








