FVP: 4D Visual Pre-training for Robot Learning

[Website] [arXiv] [ICCV 2025]

FVP is a novel 3D point cloud representation learning pipeline for robotic manipulation. Different from prior works in Contrastive Learning and Masked Signal Modeling, FVP trains 3D visual representations by leveraging the preceding frame point cloud and employing a diffusion model to predict the point cloud of the current frame.

This is a PyTorch implementation of the paper FVP: 4D Visual Pre-training for Robot Learning:

@article{cheng2025fvp,
    author    = {Chengkai Hou and Yanjie Ze and Yankai Fu and Zeyu Gao and Yue Yu and Songbo Hu and Shanghang Zhang and Huazhe Xu},
    title     = {FVP: 4D Visual Pre-training for Robot Learning},
    journal   = {ICCV},
    year      = {2025},
  }

❗ This repo contains configs and experiments on simulation dataset and real-world dataset.

Requirements

3D Diffusion policy

Please see DP3 installation instructions.

FVP

In addition to PyTorch environments, please install:

conda install pyyaml
pip install ema-pytorch tensorboard

Simulation Dataset Generation

You can generate a dataset of simulated data following the DP3 instructions, for example:

cd your_path/3D-Diffusion-Policy-master
bash scripts/gen_demonstration_adroit.sh hammer

Real-world Dataset Generation

We collect the real-world dataset as a dictionary， which follows the same format as the simulator dataset：

"point_cloud": Array of shape (T, Np, 6), Np is the number of point clouds, 6 denotes [x, y, z, r, g, b]. Note: it is highly suggested to crop out the table/background and only leave the useful point clouds in your observation, which demonstrates effectiveness in our real-world experiments.
"image": Array of shape (T, H, W, 3)
"depth": Array of shape (T, H, W)
"agent_pos": Array of shape (T, Nd), Nd is the action dim of the robot agent, i.e. 22 for our dexhand tasks (6d position of end effector + 16d joint position)
"action": Array of shape (T, Nd). We use relative end-effector position control for the robot arm and relative joint-angle position control for the dex hand.

You can follow this example to collect real-world dataset.

FVP Pre-training

For config dp3.yaml, you should change your dataset path. Then, you can use FVP to train:

python train_gpu.py  --config config/dp3.yaml

DP3 Post-training

Simply load the weights trained using FVP, and then proceed with the standard DP3 command line for execution.

bash scripts/train_policy.sh dp3 adroit_hammer 0112 0 0

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
3D-Diffusion-Policy-master		3D-Diffusion-Policy-master
common		common
config		config
dataset		dataset
model		model
picture		picture
README.md		README.md
datasets.py		datasets.py
metric.py		metric.py
train_gpu.py		train_gpu.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FVP: 4D Visual Pre-training for Robot Learning

Requirements

3D Diffusion policy

FVP

Simulation Dataset Generation

Real-world Dataset Generation

FVP Pre-training

DP3 Post-training

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

JackHck/FVP

Folders and files

Latest commit

History

Repository files navigation

FVP: 4D Visual Pre-training for Robot Learning

Requirements

3D Diffusion policy

FVP

Simulation Dataset Generation

Real-world Dataset Generation

FVP Pre-training

DP3 Post-training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages