VideoVLA is a simple approach that explores the potential of directly transforming large video generation models into robotic VLA manipulators..
This repository contains the official implementation of the paper:
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators NeurIPS 2025
🔗 Project Page: Project Website
📄 Paper: Paper Link
First, prepare the runtime environment and install all required dependencies by running:
bash build.shOur method relies on pretrained components from CogVideo. You can follow the official CogVideo instructions to obtain the pretrained checkpoints: CogVideo
Specifically, download:
- T5 checkpoint
- VAE checkpoint
After downloading, update the checkpoint paths in the following configuration file:
config_use/action_config/videovla_config.yaml
Make sure the paths correctly point to the downloaded T5 and VAE checkpoints before starting training or evaluation.
This section describes how to run inference with a trained model checkpoint to generate video and action.
python sample_video_action.py \
--base config_use/action_config/videovla_config.yaml config_use/action_config/inference_config/inference.yaml@article{
videovla,
title={VideoVLA: Video Generators Can Be Generalizable Robot Manipulators},
author={Yichao Shen and Fangyun Wei and Zhiying Du and Yaobo Liang and Yan Lu and Jiaolong Yang and Nanning Zheng and Baining Guo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025)},
year={2025},
url={https://openreview.net/forum?id=UPHlqbZFZB}
}