This repository contains the official implementation of our work, which introduces a new paradigm for training large vision-language models. We argue that the key to unlocking advanced multimodal reasoning lies in moving beyond uniform, coarse learning signals and instead putting a spotlight on token-level visual perception.
Our proposed algorithm, Visually-Perceptive Policy Optimization (VPPO), is the first to directly implement this principle. It intelligently focuses policy updates on the critical moments of visually-grounded reasoning, leading to state-of-the-art performance, superior training efficiency, and a more robust learning process.
- [2026-01-30] 🎉 Our paper has been accepted to ICLR 2026!
- [2025-11-11] The training script for VPPO-8B is now available! We've updated our codebase to fully support training based on Qwen3-VL-8B-Instruct. You can find the script in
examples/configs/train_vppo_8b.sh. - [2025-11-07] We have released VPPO-8B, a new model that achieves excellent performance. It was trained using our VPPO algorithm, starting from the Qwen3-VL-8B-Instruct model. Compared to our previous training runs, we increased the
max response lengthand theEntropy Penalty Coefficient. You can find more details and access the model on our Hugging Face Models collection.
Standard reinforcement learning methods for LVLMs suffer from a fundamental flaw: they treat every token in a generated response as equally important. A single reward is broadcast indiscriminately, rewarding generic phrases just as much as the critical step where the model perceives a key detail from the image.
Our analysis reveals two key truths about multimodal reasoning:
- Token visual dependency is sparse: Only a small fraction of tokens in a reasoning chain are highly dependent on the visual input. These are the pivotal moments of visually-grounded reasoning.
- Trajectory visual dependency is heterogeneous: Not all correct solutions are equal. Some are robustly grounded in visual evidence, while others are "lucky guesses" based on linguistic priors.
Left: Most tokens have low visual dependency. Right: Trajectories show a wide range of visual dependency. Standard RL treats them all the same.
This misalignment causes signal dilution, slowing down learning and preventing models from developing genuine multimodal perception and reasoning skills.
VPPO is a novel policy gradient algorithm designed to solve this problem by reshaping the learning signal at two levels of granularity:
-
Macro-Level 🎯 Trajectory Advantage Shaping (TAS): We re-weight the advantage of each trajectory based on its average visual dependency. This prioritizes learning from robust, perception-grounded reasoning paths.
-
Micro-Level 🔦 Token Gradient Filtering (TGF): We construct a sparse gradient mask to focus policy updates exclusively on the top-k% most visually-dependent tokens. This puts a "spotlight" on what truly matters, reducing gradient variance and leading to more stable and effective training.
By focusing the learning signal, VPPO establishes a new state-of-the-art across 8 challenging multimodal reasoning benchmarks.
VPPO not only achieves a higher final performance but gets there faster and more reliably.
- Python 3.10
- PyTorch 2.8.0
- CUDA 12.8
# Create and activate a conda environment
conda create -n vppo python=3.10
conda activate vppo
# Clone the repository
git clone https://github.com/huaixuheqing/VPPO-RL
cd VPPO-RL
# Install dependencies
pip install -e .The training pipeline is adapted from EasyR1. We provide example scripts for training 7B and 8B models with VPPO.
- Hardware for Qwen2.5-VL-7B: 8 x H800 (80G) GPUs.
- Hardware for Qwen2.5-VL-32B: 32 x H800 (80G) GPUs.
# To train the VPPO-7B model
bash examples/configs/train_vppo_7b.sh
# To train the VPPO-8B model
bash examples/configs/train_vppo_8b.shOur evaluation leverages the framework from PAPO-Eval. To replicate our results, you will need to download our specific evaluation data from the VPPO-Eval Hugging Face dataset. Please place the data folder from this dataset directly into your local PAPO-Eval repository. Once the data is in place, you can run the evaluation scripts by selecting the desired benchmark name. A complete list of available benchmark names can be found in the data/dataset_info.json file. All results in the paper are reported as average accuracy@8 with an inference temperature of 1.0.
| Benchmark | Hugging Face Link | Focus Domain |
|---|---|---|
| Geo3k | hiyouga/geometry3k |
Geometric Reasoning |
| We-Math | We-Math/We-Math |
Math Reasoning |
| MMK12 | FanqingM/MMK12 |
Math Reasoning |
| MathVerse | AI4Math/MathVerse |
Math Reasoning |
| MathVision | MathLLMs/MathVision |
Math Reasoning |
| DynaMath | DynaMath/DynaMath_Sample |
Math Reasoning |
| LogicVista | lscpku/LogicVista |
Logical Reasoning |
| MMMU-Pro | MMMU/MMMU_Pro |
Multi-discipline |
Note: We filter instances from MathVerse, MathVision, and DynaMath to ensure verifiable, exact-match evaluation. All datasets, including these filtered subsets, are publicly available on our Hugging Face.
If you find our work on token perception and the VPPO algorithm useful in your research, please cite our paper:
@article{huang2025spotlight,
title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu},
journal={arXiv preprint arXiv:2510.09285},
year={2025}
}Our codebase is built upon the excellent work of EasyR1. We are grateful to the original authors for their valuable contributions.





