Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

A new RL method for visual reasoning, which significantly outperforms vanilla GRPO, and bypasses the need for explicit chain-of-thought supervision during training.

🤗 Hugging Face | 📑 Paper | 📖 Blog

This is the official implementation of the paper 'Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning'.

News📰

[2025/06/16]:🔥We have released our code.
[2025/06/03]:🔥Model checkpoints are available at [🤗HuggingFace].
[2025/05/20]:🔥We have released our paper [Arxiv].

Overview✈️

We reveal a critical limitation of GRPO when applied to visual language models (VLMs)—the tendency to develop shortcut learning. This finding highlights the need for better training techniques to ensure robust reasoning capabilities. To address the shortcut learning problem, we propose Visionary-R1.

The core component of reinforcement learning involves sampling training data using the policy model. In visual reasoning tasks, the sampled reasoning paths are evaluated based on the final answer. However, due to the shortcut issue—where the model might produce an answer without proper reasoning or disregard the visual input, relying mainly on textual patterns from the question—the samples with correct answers may fail to provide useful reasoning guidance, thus impeding the model’s reasoning abilities.

In the example below, the direct application of GRPO can lead to shortcuts when handling simple samples, as the model can arrive at the correct answer without detailed reasoning. However, this shortcut thinking struggles to generalize to more complex samples, ultimately impairing the model’s overall reasoning ability.

The cornerstone of this approach is a structured caption–reason–answer training format, where the model must first generate a detailed caption of the image before proceeding to reasoning and answering the question.

This structured process ensures that the model doesn’t rely on superficial cues or patterns, as it often does in traditional setups. Instead, the captioning step forces the model to engage in a deeper analysis of the image context. By requiring detailed captions regardless of whether the question is easy or difficult, the framework encourages the model to adopt a consistent, robust problem-solving approach. This not only mitigates shortcut learning but also enhances the model’s ability to generalize across different data distributions.

To further ensure that the captions are meaningful and informative, we apply auxiliary supervision using reinforcement learning from AI feedback. This involves imposing a caption reward, which is combined with standard accuracy and format rewards during policy optimization. The integration of these rewards incentivizes the model to produce captions that are well-structured and contextually rich.

Main Results🗒️

Model	Size	Strategy	Data	MathVista	MathVision	MMStar	MMBench
Close-source models
GPT-4o	-	-	-	63.8	31.2	65.1	84.3
GPT-o1	-	-	-	71.8	63.2	67.5	83.8
Claude3.5-Sonnet	-	-	-	67.7	37.9	65.1	82.6
Claude3.7-Sonnet	-	-	-	74.5	58.6	68.8	82.0
Gemini-1.5-Pro	-	-	-	63.9	19.2	59.1	73.9
Gemini-2.5-Pro	-	-	-	82.7	73.3	77.5	90.1
Open-source models
Qwen2.5-VL	3B	-	-	62.3	21.2	55.9	79.1
InternVL2.5	4B	-	-	60.5	20.9	58.3	81.1
MiniCPM-V2.6	8B	-	-	60.6	17.5	57.5	81.5
LLaMA3.2	11B	-	-	51.5	-	49.8	65.8
Reasoning models
Ovis	4B	SFT	CoT	66.6	-	59.5	79.3
Mulberry	7B	SFT	CoT	63.1	-	61.3	-
R1-Onevision	7B	SFT+RL	CoT	64.1	29.9	-	-
Insight-V	7B	SFT+RL	CoT	59.9	-	61.5	82.3
R1-VL	7B	SFT+RL	CoT	63.5	24.7	60	-
LLaVA-CoT	11B	SFT	CoT	54.8	-	57.6	75
Our models
Base Model	3B	-	-	61.5	19.1	52.4	82.1
SFT	3B	SFT	QA	54.6	7.0	61.9	80.7
GRPO	3B	RL	QA	61.8	20.3	54.3	78.6
Visionary-R1	3B	RL	QA	69.4	24.7	66.5	84.1

Set up 📐

Environment

git clone [email protected]:maifoundations/Visionary-R1.git
cd Visionary-R1

# build environment
conda create -n visionary-r1 python=3.12
conda activate visionary-r1

pip install -e ".[all]"

Data Preparation

To organize the training data effectively, it should be structured in the following format.

{
          "solution": f"<answer> Yes </answer>",
          "prompt": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": "Is the value of Favorable 38 in 2015?"},
                ],
            },
          ],
          "problem_type": 'numerical'
    }

Training

To run Visionary-R1 with Qwen2.5-VL-3B:

torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" \
    --nnodes="${ARNOLD_WORKER_NUM}" \
    --node_rank="${ARNOLD_ID}" \
    --master_addr="${METIS_WORKER_0_HOST}" \
    --master_port="${port_in_cmd}" \
    src/open_r1/grpo_vllm_caption.py \
    --deepspeed scripts/zero3.json \
    --output_dir checkpoints/Visionary-R1 \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --dataset_name ${YOUR_DATA_PATH} \
    --max_prompt_length 4096 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --logging_steps 1 \
    --learning_rate 5e-7 \
    --beta 0.04 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --max_pixels 2359296 \
    --save_total_limit 1 \
    --num_train_epochs 1 \
    --run_name Visionary-R1

Citation🎓

@article{xia2025visionary,
  title={Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning},
  author={Xia, Jiaer and Zang, Yuhang and Gao, Peng and Li, Yixuan and Zhou, Kaiyang},
  journal={arXiv preprint arXiv:2505.14677},
  year={2025}
}

Acknowledgment

We learned the design and reused code from the following projects: [MM-EUREKA], [R1-V], [Video-R1].

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
images		images
local_scripts		local_scripts
slurm		slurm
src/open_r1		src/open_r1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

News📰

Overview✈️

Main Results🗒️

Set up 📐

Environment

Data Preparation

Training

Citation🎓

Acknowledgment

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

maifoundations/Visionary-R1

Folders and files

Latest commit

History

Repository files navigation

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

News📰

Overview✈️

Main Results🗒️

Set up 📐

Environment

Data Preparation

Training

Citation🎓

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages