Inspired by the human visual system's top-down, task-driven search, we propose Multi-turn Grounding-based Policy Optimization (MGPO). MGPO equips LMMs with interpretable, iterative visual grounding: the model predicts key regions, crops sub-images, and reasons over both the original and focused views.
Key advantages:
- Interpretable, Top-down Visual Reasoning: MGPO highlights which image regions are attended to at each step.
- Breaks Pixel Limits: Even if the full image is blurry due to resizing, MGPO identifies and crops clear sub-images for further analysis.
- No Extra Grounding Annotations Needed: MGPO is trained only with binary answer correctness, yet learns robust grounding.
Our code is based on verl, training code and script are available at
(Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process.)
- MGPO outperforms both SFT and GRPO on high-resolution tasks.
- +5.4% on MME-Realworld (ID), +5.2% on V* Bench (OOD) over GRPO baseline.
- Surpasses OpenAI’s o1 and GPT-4o on V* Bench, despite using a smaller model and less data.
If you find our work to be useful for your research, please consider citing.
@misc{huang2025highresolutionvisualreasoningmultiturn,
title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning},
author={Xinyu Huang and Yuhao Dong and Weiwei Tian and Bo Li and Rui Feng and Ziwei Liu},
year={2025},
eprint={2507.05920},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.05920},
}
