- [2025/9/18] Sherlock is accepted by NeurIPS 2025.
- [2025/5/29] We've released our paper: http://arxiv.org/abs/2505.22651.
- [2025/5/28] We've released the training, evaluation, and data construction code of Sherlock.
- [2025/5/27] We've released the model weights of Sherlock.
Our analysis reveals that existing reasoning VLMs, whether trained with SFT or RL, struggle to self-correct (both step-wise and response-wise).
We propose Sherlock, the first framework to achieve intrinsic self-correction in reasoning VLMs, with significant improvements across diverse benchmarks using only 20k randomly sampled annotation data from LLaVA-CoT.
-
Base Model
Our Sherlock is built on Llama3.2-Vision-11B-Instruct model, you can download it here.
-
Training Data
In SFT and Offline stage, Sherlock randomly sampled 20k data with annotation from LLaVA-CoT in total. During the Online self-improvement stage, we randomly sampled only question and image without ground truth from LLaVA-CoT until it self-construct 5k preference data. You should first download LLaVA-CoT dataset.
-
Sherlock Weights
You can download our Sherlock model weights from the Huggingface collection Sherlock.
-
Demo
After download our Sherlock Iter2 weight, you can try the demo in this file.
Thanks to LLaMA-Factory team, our training code is modified based on their framework! You can find detailed training guidance in this file.
Thanks to VLMEvalKit team, our training code is modified based on their framework! You can find detailed evaluation guidance in this file.
Our project benefits from LLaVA-CoT, LLaMA-Factory, and VLMEvalKit. Thanks for their wonderful works.
If you find our project is helpful, please cite our paper as
@article{ding2025sherlock,
title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
author={Ding, Yi and Zhang, Ruqi},
journal={arXiv preprint arXiv:2505.22651},
year={2025}
}



