Skip to content

VideoVLA-Project/VideoVLA

Repository files navigation

VideoVLA[NeurIPS 2025]

VideoVLA is a simple approach that explores the potential of directly transforming large video generation models into robotic VLA manipulators..

This repository contains the official implementation of the paper:

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators NeurIPS 2025

🔗 Project Page: Project Website

📄 Paper: Paper Link

1. Quick Start

First, prepare the runtime environment and install all required dependencies by running:

bash build.sh

2. Downloading the Pretrained Checkpoint

Our method relies on pretrained components from CogVideo. You can follow the official CogVideo instructions to obtain the pretrained checkpoints: CogVideo

Specifically, download:

  • T5 checkpoint
  • VAE checkpoint

After downloading, update the checkpoint paths in the following configuration file:

config_use/action_config/videovla_config.yaml

Make sure the paths correctly point to the downloaded T5 and VAE checkpoints before starting training or evaluation.


3. Inference

This section describes how to run inference with a trained model checkpoint to generate video and action.

python sample_video_action.py \
  --base config_use/action_config/videovla_config.yaml config_use/action_config/inference_config/inference.yaml

Citations

@article{
    videovla,
    title={VideoVLA: Video Generators Can Be Generalizable Robot Manipulators},
    author={Yichao Shen and Fangyun Wei and Zhiying Du and Yaobo Liang and Yan Lu and Jiaolong Yang and Nanning Zheng and Baining Guo},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025)},
    year={2025},
    url={https://openreview.net/forum?id=UPHlqbZFZB}
    }
  

About

[NeurIPS2025]VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors