Yi Ding*1,2, Lijun Li*1, Bing Cao†2, Jing Shao†1
1Shanghai Artificial Intelligence Laboratory, 2Tianjin University
*Equal contribution †Corresponding author
📢 Please consider citing or 🌟 MIS if our repository is helpful to your work!
📅[2025-05-26] We release the new version of our paper, including MIRage on more powerful VLMs. Please check out here.
📅[2025-01-31] 🧨 Our paper Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models is released now! 🧨
📅[2025-01-30] 🧨 Our Dataset, MIRage series VLMs are released now! 🧨
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.
You can download our MIS dataset from Huggingface 🤗.
MIRage: Multi-Image Reasoning Safety Fine-Tuning
- You can download InternVL2.5-8B fine-tuned with our MIRage and MIS training data from here 🤗.
- You can download Qwen2-VL-7B-Instruct fine-tuned with our MIRage and MIS training data from here 🤗.
-
Clone our MIS repo:
git clone https://github.com/DripNowhy/MIS.git cd MIS -
Data Preparation: First, download our MIS test set. Then, organize your data following the structure below:
├── easy_image │ ├── 1 │ │ └── object1.png │ │ └── object2.png │ └── ... ├── hard_image │ ├── 1 │ │ └── object1.png │ │ └── object2.png │ └── ... ├── real_image │ ├── 1 │ │ └── object1.png │ │ └── object2.png │ └── ... ├── mis_easy.json ├── mis_hard.json └── mis_real.json -
For Qwen2-VL series, InternVL2.5 series, Phi3.5-Vision-Instruct, Idefics3-8B, LLaVA-OneVision-72b-Chat-hf models. We recommend you to deploy VLMs using vLLM.
pip install vllm pip install qwen_vl_utils bash scripts/inf_vllm.sh -
For LLaVA-NeXT-Interleave, first, install the LLaVA environment by following the instructions in the LLaVA-NeXT Official Repository. Once the LLaVA environment is set up, you can run inferences on the model using the following code:
bash scripts/inf_llava.sh -
For DeepSeek-VL2, first, install the deepseek environment by following the instructions in the DeepSeek-VL2 Official Repository. Once the deepseek environment is set up, you can run inferences on the model using the following code:
bash scripts/inf_deepseek.sh
Now, you can use GPT-4o as evaluator to do the evaluation. Make sure you have fulfilled your openai api in evaluation/gpt_eval.py.
bash scripts/eval_all.sh
@article{ding2025rethinking,
title={Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models},
author={Ding, Yi and Li, Lijun and Cao, Bing and Shao, Jing},
journal={arXiv preprint arXiv:2501.18533},
year={2025}
}


