Skip to content

lzylucy/4dgen

Repository files navigation

🧠 Geometry-aware 4D Video Generation for Robot Manipulation

4DGen teaser

We propose a 4D video generation model that enforces geometric consistency across multiple camera views to predict spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications to robot manipulation by extracting gripper poses from generated videos using an off-the-shelf pose tracking algorithm. We show that the model generalizes to novel viewpoints and enables robots to leverage multi-view information for planning.

4DGen real video


🔗 Project Links

📄 Paper 🌐 Project Page 📦 Dataset 🤗 Hugging Face
arXiv Website Stanford Mirror Dataset · Checkpoints

👥 Authors

Zeyi Liu¹ · Shuang Li¹ · Eric Cousineau² · Siyuan Feng² · Benjamin Burchfiel² · Shuran Song¹

¹ Stanford University
² Toyota Research Institute


🧩 Overview

Robotic manipulation requires understanding how 3D geometry evolves over time under agent actions. However, most video generation models are trained with single-view RGB videos, limiting their ability to reason about geometry and cross-view consistency.

This project introduces a geometry-aware 4D video generation pipeline that:

  • Models multi-view RGB-D observations across time
  • Enforces cross-view geometric consistency via pointmaps
  • Learns temporally coherent latent dynamics suitable for manipulation

The resulting models serve as strong foundations for world modeling, policy learning, and planning in robotics.


📦 Dataset

We release a multi-view, multi-task robotic manipulation dataset collected in simulation.

Tasks

Simulation tasks (LBM):

  • StoreCerealBoxUnderShelf
  • PutSpatulaOnTableFromUtensilCrock
  • PlaceAppleFromBowlIntoBin

Real-world robot manipulation tasks:

  • BimanualAddOrangeSlicesToBowl
  • BimanualChopCucumber
  • BimanualCupOnSaucer
  • BimanualTwistCapOffBottle

Key Properties

  • Simulation: 50 demonstrations per task
  • Real world: 10 demonstrations per task
  • 16 RGB-D camera views per timestep, sampled from the upper hemisphere
  • Synchronized robot actions and observations
  • Simulation data collected in the Large Behavior Model (LBM) environment

📥 Download links:


🧠 Pre-trained Models

We provide multiple checkpoints to support different stages of the pipeline:

  • Stable Video Diffusion (SVD) backbones
  • Task-specific VAEs for RGB and pointmap latents
  • 4D video generation models fine-tuned on manipulation data

📦 Checkpoints:


⚙️ Installation

We recommend using conda or mamba.

cd 4dgen
conda env create -f environment.yml
conda activate video_policy
conda install pytorch3d

Tested on:

  • Ubuntu 22.04
  • CUDA 12.2

🔧 Training

1️⃣ Fine-tune the VAE

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_autoencoder_workspace

2️⃣ Train the 4D Video Generation Model

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_svd_lightning_workspace

Notes:

  • Tested on 4× NVIDIA A6000 (48GB)
  • Batch size: 1
  • Training time: ~2 days

🔍 Inference

Run the provided evaluation example:

python notebooks/eval.py

This script demonstrates loading a trained checkpoint and generating multi-view 4D predictions.

🎥 Qualitative Results

We show representative qualitative results illustrating multi-view RGB-D video generation.

Generated RGB-D Videos

Task 1

Task 1 RGB Task 1 depth

Task 2

Task 2 RGB Task 2 depth

Task 3

Task 3 RGB Task 3 depth

📚 Citation

If you find this project useful, please consider citing:

@article{liu2025geometry,
  title={Geometry-aware 4D Video Generation for Robot Manipulation},
  author={Liu, Zeyi and Li, Shuang and Cousineau, Eric and Feng, Siyuan and Burchfiel, Benjamin and Song, Shuran},
  journal={arXiv preprint arXiv:2507.01099},
  year={2025}
}

📄 License

This project is released for research use. Please see the repository for license details.


💬 Questions or issues? Feel free to open a GitHub issue or reach out via the project page.

About

Codebase for paper "Geometry-aware 4D Video Generation for Robot Manipulation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages