VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma
Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose 🔥VIVA🔥, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, contentpreserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, highfidelity paired video–instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
All codes are successfully tested on:
- NVIDIA H100 GPU (80G)
- Debian 12 (bookworm)
- CUDA 12.4
- Python 3.11
- Pytorch 2.7.1
- Create a new conda environment:
conda create -n VIVA python==3.11
conda activate VIVA- Make sure
fastvideoandflash-attnare properly installed:
pip install fastvideo
pip install flash-attn- We recommend:
pip install transformers==4.47.0 peft==0.17.1 - Download pretrained VIVA Checkpoint from HuggingFace to
./ckpts.
huggingface-cli download xiaoyan03/VIVA ckpts.zip --repo-type model --local-dir .
unzip ckpts.zip
rm ckpts.zip-
Data Preparation: Please follow the folder struction at
data/example. -
Preprocess
bash scripts/preprocess.sh- Inference
bash scripts/inference.shFeel free to adjust the cfg-scale (VIDEO_SCALES) in scripts/inference.sh.
Please cite our paper if you find this repository useful:
@article{cong2025viva,
title={VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization},
author={Cong, Xiaoyan and Yang, Haotian and Wang, Angtian and Wang, Yizhi and Yang, Yiding and Zhang, Canyu and Ma, Chongyang},
journal={arXiv preprint arXiv:2512.16906},
year={2025}
}