VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose 🔥VIVA🔥, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, contentpreserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, highfidelity paired video–instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.

⚙️ Installtion

All codes are successfully tested on:

NVIDIA H100 GPU (80G)
Debian 12 (bookworm)
CUDA 12.4
Python 3.11
Pytorch 2.7.1

Create a new conda environment:

conda create -n VIVA python==3.11
conda activate VIVA

Make sure fastvideo and flash-attn are properly installed:

pip install fastvideo
pip install flash-attn

We recommend:

pip install transformers==4.47.0 peft==0.17.1

Download pretrained VIVA Checkpoint from HuggingFace to ./ckpts.

huggingface-cli download xiaoyan03/VIVA ckpts.zip --repo-type model --local-dir .
unzip ckpts.zip
rm ckpts.zip

🚀 Inference

Data Preparation: Please follow the folder struction at data/example.
Preprocess

bash scripts/preprocess.sh

Inference

bash scripts/inference.sh

Feel free to adjust the cfg-scale (VIDEO_SCALES) in scripts/inference.sh.

🎓 Citation

Please cite our paper if you find this repository useful:

@article{cong2025viva,
  title={VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization},
  author={Cong, Xiaoyan and Yang, Haotian and Wang, Angtian and Wang, Yizhi and Yang, Yiding and Zhang, Canyu and Ma, Chongyang},
  journal={arXiv preprint arXiv:2512.16906},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data/example		data/example
fastvideo		fastvideo
scripts		scripts
.gitignore		.gitignore
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Abstract

⚙️ Installtion

🚀 Inference

🎓 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Abstract

⚙️ Installtion

🚀 Inference

🎓 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages