High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

💡 Introduction

Inspired by the human visual system's top-down, task-driven search, we propose Multi-turn Grounding-based Policy Optimization (MGPO). MGPO equips LMMs with interpretable, iterative visual grounding: the model predicts key regions, crops sub-images, and reasons over both the original and focused views.

Key advantages:

Interpretable, Top-down Visual Reasoning: MGPO highlights which image regions are attended to at each step.
Breaks Pixel Limits: Even if the full image is blurry due to resizing, MGPO identifies and crops clear sub-images for further analysis.
No Extra Grounding Annotations Needed: MGPO is trained only with binary answer correctness, yet learns robust grounding.

🚀 Training Code

Our code is based on verl, training code and script are available at

https://github.com/xinyu1205/verl/blob/mgpo/examples/grpo_trainer/run_qwen2_5_vl-7b_mgpo.sh

🧰 Experiments

Visualizations

(Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process.)

Main Results

MGPO outperforms both SFT and GRPO on high-resolution tasks.
+5.4% on MME-Realworld (ID), +5.2% on V* Bench (OOD) over GRPO baseline.
Surpasses OpenAI’s o1 and GPT-4o on V* Bench, despite using a smaller model and less data.

✒️ Citation

If you find our work to be useful for your research, please consider citing.

@misc{huang2025highresolutionvisualreasoningmultiturn,
      title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning}, 
      author={Xinyu Huang and Yuhao Dong and Weiwei Tian and Bo Li and Rui Feng and Ziwei Liu},
      year={2025},
      eprint={2507.05920},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.05920}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

💡 Introduction

🚀 Training Code

🧰 Experiments

Visualizations

Main Results

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

💡 Introduction

🚀 Training Code

🧰 Experiments

Visualizations

Main Results

✒️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages