Skip to content

maifoundations/Visionary-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

A new RL method for visual reasoning, which significantly outperforms vanilla GRPO, and bypasses the need for explicit chain-of-thought supervision during training.

🤗 Hugging Face   |    📑 Paper    |    📖 Blog   

This is the official implementation of the paper 'Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning'.

News📰

  • [2025/06/16]:🔥We have released our code.
  • [2025/06/03]:🔥Model checkpoints are available at [🤗HuggingFace].
  • [2025/05/20]:🔥We have released our paper [Arxiv].

Overview✈️

We reveal a critical limitation of GRPO when applied to visual language models (VLMs)—the tendency to develop shortcut learning. This finding highlights the need for better training techniques to ensure robust reasoning capabilities. To address the shortcut learning problem, we propose Visionary-R1.

The core component of reinforcement learning involves sampling training data using the policy model. In visual reasoning tasks, the sampled reasoning paths are evaluated based on the final answer. However, due to the shortcut issue—where the model might produce an answer without proper reasoning or disregard the visual input, relying mainly on textual patterns from the question—the samples with correct answers may fail to provide useful reasoning guidance, thus impeding the model’s reasoning abilities.

In the example below, the direct application of GRPO can lead to shortcuts when handling simple samples, as the model can arrive at the correct answer without detailed reasoning. However, this shortcut thinking struggles to generalize to more complex samples, ultimately impairing the model’s overall reasoning ability.

The cornerstone of this approach is a structured caption–reason–answer training format, where the model must first generate a detailed caption of the image before proceeding to reasoning and answering the question.

This structured process ensures that the model doesn’t rely on superficial cues or patterns, as it often does in traditional setups. Instead, the captioning step forces the model to engage in a deeper analysis of the image context. By requiring detailed captions regardless of whether the question is easy or difficult, the framework encourages the model to adopt a consistent, robust problem-solving approach. This not only mitigates shortcut learning but also enhances the model’s ability to generalize across different data distributions.

To further ensure that the captions are meaningful and informative, we apply auxiliary supervision using reinforcement learning from AI feedback. This involves imposing a caption reward, which is combined with standard accuracy and format rewards during policy optimization. The integration of these rewards incentivizes the model to produce captions that are well-structured and contextually rich.

Main Results🗒️

Model Size Strategy Data MathVista MathVision MMStar MMBench
Close-source models
GPT-4o - - - 63.8 31.2 65.1 84.3
GPT-o1 - - - 71.8 63.2 67.5 83.8
Claude3.5-Sonnet - - - 67.7 37.9 65.1 82.6
Claude3.7-Sonnet - - - 74.5 58.6 68.8 82.0
Gemini-1.5-Pro - - - 63.9 19.2 59.1 73.9
Gemini-2.5-Pro - - - 82.7 73.3 77.5 90.1
Open-source models
Qwen2.5-VL 3B - - 62.3 21.2 55.9 79.1
InternVL2.5 4B - - 60.5 20.9 58.3 81.1
MiniCPM-V2.6 8B - - 60.6 17.5 57.5 81.5
LLaMA3.2 11B - - 51.5 - 49.8 65.8
Reasoning models
Ovis 4B SFT CoT 66.6 - 59.5 79.3
Mulberry 7B SFT CoT 63.1 - 61.3 -
R1-Onevision 7B SFT+RL CoT 64.1 29.9 - -
Insight-V 7B SFT+RL CoT 59.9 - 61.5 82.3
R1-VL 7B SFT+RL CoT 63.5 24.7 60 -
LLaVA-CoT 11B SFT CoT 54.8 - 57.6 75
Our models
Base Model 3B - - 61.5 19.1 52.4 82.1
SFT 3B SFT QA 54.6 7.0 61.9 80.7
GRPO 3B RL QA 61.8 20.3 54.3 78.6
Visionary-R1 3B RL QA 69.4 24.7 66.5 84.1

Set up 📐

Environment

git clone [email protected]:maifoundations/Visionary-R1.git
cd Visionary-R1

# build environment
conda create -n visionary-r1 python=3.12
conda activate visionary-r1

pip install -e ".[all]"

Data Preparation

To organize the training data effectively, it should be structured in the following format.

{
          "solution": f"<answer> Yes </answer>",
          "prompt": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": "Is the value of Favorable 38 in 2015?"},
                ],
            },
          ],
          "problem_type": 'numerical'
    }

Training

To run Visionary-R1 with Qwen2.5-VL-3B:

torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" \
    --nnodes="${ARNOLD_WORKER_NUM}" \
    --node_rank="${ARNOLD_ID}" \
    --master_addr="${METIS_WORKER_0_HOST}" \
    --master_port="${port_in_cmd}" \
    src/open_r1/grpo_vllm_caption.py \
    --deepspeed scripts/zero3.json \
    --output_dir checkpoints/Visionary-R1 \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --dataset_name ${YOUR_DATA_PATH} \
    --max_prompt_length 4096 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --logging_steps 1 \
    --learning_rate 5e-7 \
    --beta 0.04 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --max_pixels 2359296 \
    --save_total_limit 1 \
    --num_train_epochs 1 \
    --run_name Visionary-R1

Citation🎓

@article{xia2025visionary,
  title={Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning},
  author={Xia, Jiaer and Zang, Yuhang and Gao, Peng and Li, Yixuan and Zhou, Kaiyang},
  journal={arXiv preprint arXiv:2505.14677},
  year={2025}
}

Acknowledgment

We learned the design and reused code from the following projects: [MM-EUREKA], [R1-V], [Video-R1].

About

Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •