Haoji Zhang*, Xin Gu*, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, Yansong Tang†
*Equal contributions, †Correspondence
We proposed VITAL, a tool-augmented framework that enables advanced long video reasoning and temporal grounding.
We also introduce MTVR, a high-quality multi-task video reasoning training dataset.
The dataset is available here.
This project is based on verl 0.4.0.dev, vllm 0.8.5.post1, transformers 4.51.1, torch 2.6.0 and python 3.10. Install dependencies:
bash setup.shVITAL/
├── data/
│ ├── actnet/
│ ├── charades/
│ ├── longvideo-reason/
│ ├── mmvu/
│ ├── nextgqa/
│ ├── rextime/
│ ├── vidchapters/
│ ├── Video-R1-data/
│ ├── videomme/
│ ├── videommmu/
│ ├── vidi/
│ ├── vsibench/
├── models/
│ ├── Qwen2.5-VL-3B-Instruct/
│ ├── Qwen2.5-VL-7B-Instruct/
├── outputs/ # an empty folder to store the outputs- Prepare Qwen2.5-VL-7B-Instruct model
mkdir -p models
huggingface-cli download --resume-download --local-dir-use-symlinks False Qwen/Qwen2.5-VL-7B-Instruct --revision main --local-dir ./models/Qwen2.5-VL-7B-Instruct- Prepare dataset for training and evaluation. You can download the following datasets from there official websites:
Run the following script to train and evaluate the model:
bash train_stage_1_sft.sh
bash train_stage_2_dgrpo.sh
bash train_stage_3_sft.sh
bash train_stage_4_dgrpo.shNote:
- You need to set the some configuration in the corresponding script, such as
PRETRAINED_CKPT,YOUR_WANDB_API_KEY.
If you find this project useful in your research, please consider citing:
@article{zhang2025thinking,
title={Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning},
author={Zhang, Haoji and Gu, Xin and Li, Jiawen and Ma, Chixiang and Bai, Sule and Zhang, Chubin and Zhang, Bowen and Zhou, Zhichao and He, Dongliang and Tang, Yansong},
journal={arXiv preprint arXiv:2508.04416},
year={2025}
}
