🧠 Geometry-aware 4D Video Generation for Robot Manipulation

We propose a 4D video generation model that enforces geometric consistency across multiple camera views to predict spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications to robot manipulation by extracting gripper poses from generated videos using an off-the-shelf pose tracking algorithm. We show that the model generalizes to novel viewpoints and enables robots to leverage multi-view information for planning.

🔗 Project Links

📄 Paper	🌐 Project Page	📦 Dataset	🤗 Hugging Face
arXiv	Website	Stanford Mirror	Dataset · Checkpoints

👥 Authors

Zeyi Liu¹ · Shuang Li¹ · Eric Cousineau² · Siyuan Feng² · Benjamin Burchfiel² · Shuran Song¹

¹ Stanford University
² Toyota Research Institute

🧩 Overview

Robotic manipulation requires understanding how 3D geometry evolves over time under agent actions. However, most video generation models are trained with single-view RGB videos, limiting their ability to reason about geometry and cross-view consistency.

This project introduces a geometry-aware 4D video generation pipeline that:

Models multi-view RGB-D observations across time
Enforces cross-view geometric consistency via pointmaps
Learns temporally coherent latent dynamics suitable for manipulation

The resulting models serve as strong foundations for world modeling, policy learning, and planning in robotics.

📦 Dataset

We release a multi-view, multi-task robotic manipulation dataset collected in simulation.

Tasks

Simulation tasks (LBM):

StoreCerealBoxUnderShelf
PutSpatulaOnTableFromUtensilCrock
PlaceAppleFromBowlIntoBin

Real-world robot manipulation tasks:

BimanualAddOrangeSlicesToBowl
BimanualChopCucumber
BimanualCupOnSaucer
BimanualTwistCapOffBottle

Key Properties

Simulation: 50 demonstrations per task
Real world: 10 demonstrations per task
16 RGB-D camera views per timestep, sampled from the upper hemisphere
Synchronized robot actions and observations
Simulation data collected in the Large Behavior Model (LBM) environment

📥 Download links:

🧠 Pre-trained Models

We provide multiple checkpoints to support different stages of the pipeline:

Stable Video Diffusion (SVD) backbones
Task-specific VAEs for RGB and pointmap latents
4D video generation models fine-tuned on manipulation data

📦 Checkpoints:

SVD / base models: https://real.stanford.edu/4dgen/checkpoints/
Fine-tuned VAEs: https://real.stanford.edu/4dgen/checkpoints/VAE/
4D generation outputs: https://real.stanford.edu/4dgen/checkpoints/outputs/

⚙️ Installation

We recommend using conda or mamba.

cd 4dgen
conda env create -f environment.yml
conda activate video_policy
conda install pytorch3d

Tested on:

Ubuntu 22.04
CUDA 12.2

🔧 Training

1️⃣ Fine-tune the VAE

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_autoencoder_workspace

2️⃣ Train the 4D Video Generation Model

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_svd_lightning_workspace

Notes:

Tested on 4× NVIDIA A6000 (48GB)
Batch size: 1
Training time: ~2 days

🔍 Inference

Run the provided evaluation example:

python notebooks/eval.py

This script demonstrates loading a trained checkpoint and generating multi-view 4D predictions.

🎥 Qualitative Results

We show representative qualitative results illustrating multi-view RGB-D video generation.

Generated RGB-D Videos

Task 1

Task 2

Task 3

📚 Citation

If you find this project useful, please consider citing:

@article{liu2025geometry,
  title={Geometry-aware 4D Video Generation for Robot Manipulation},
  author={Liu, Zeyi and Li, Shuang and Cousineau, Eric and Feng, Siyuan and Burchfiel, Benjamin and Song, Shuran},
  journal={arXiv preprint arXiv:2507.01099},
  year={2025}
}

📄 License

This project is released for research use. Please see the repository for license details.

💬 Questions or issues? Feel free to open a GitHub issue or reach out via the project page.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
common		common
config		config
dataset		dataset
metrics @ ac32567		metrics @ ac32567
model		model
my_codecs		my_codecs
notebooks		notebooks
requirements		requirements
scripts		scripts
sgm		sgm
video_common		video_common
videos		videos
workspace		workspace
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Geometry-aware 4D Video Generation for Robot Manipulation

🔗 Project Links

👥 Authors

🧩 Overview

📦 Dataset

Tasks

Key Properties

🧠 Pre-trained Models

⚙️ Installation

🔧 Training

1️⃣ Fine-tune the VAE

2️⃣ Train the 4D Video Generation Model

🔍 Inference

🎥 Qualitative Results

Generated RGB-D Videos

Task 1

Task 2

Task 3

📚 Citation

📄 License

About

Uh oh!

Releases

Packages

Languages

License

lzylucy/4dgen

Folders and files

Latest commit

History

Repository files navigation

🧠 Geometry-aware 4D Video Generation for Robot Manipulation

🔗 Project Links

👥 Authors

🧩 Overview

📦 Dataset

Tasks

Key Properties

🧠 Pre-trained Models

⚙️ Installation

🔧 Training

1️⃣ Fine-tune the VAE

2️⃣ Train the 4D Video Generation Model

🔍 Inference

🎥 Qualitative Results

Generated RGB-D Videos

Task 1

Task 2

Task 3

📚 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages