Skip to content

Oliver-Cong02/VIVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Website Paper

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma

Abstract

Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose 🔥VIVA🔥, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, contentpreserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, highfidelity paired video–instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.

⚙️ Installtion

All codes are successfully tested on:

  • NVIDIA H100 GPU (80G)
  • Debian 12 (bookworm)
  • CUDA 12.4
  • Python 3.11
  • Pytorch 2.7.1
  1. Create a new conda environment:
conda create -n VIVA python==3.11
conda activate VIVA
  1. Make sure fastvideo and flash-attn are properly installed:
pip install fastvideo
pip install flash-attn
  1. We recommend:
pip install transformers==4.47.0 peft==0.17.1 
  1. Download pretrained VIVA Checkpoint from HuggingFace to ./ckpts.
huggingface-cli download xiaoyan03/VIVA ckpts.zip --repo-type model --local-dir .
unzip ckpts.zip
rm ckpts.zip

🚀 Inference

  1. Data Preparation: Please follow the folder struction at data/example.

  2. Preprocess

bash scripts/preprocess.sh
  1. Inference
bash scripts/inference.sh

Feel free to adjust the cfg-scale (VIDEO_SCALES) in scripts/inference.sh.

🎓 Citation

Please cite our paper if you find this repository useful:

@article{cong2025viva,
  title={VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization},
  author={Cong, Xiaoyan and Yang, Haotian and Wang, Angtian and Wang, Yizhi and Yang, Yiding and Zhang, Canyu and Ma, Chongyang},
  journal={arXiv preprint arXiv:2512.16906},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors