GitHub - DripNowhy/Sherlock: [NeurIPS 2025] Official Implementation of paper "Sherlock: Self-Correcting Reasoning in Vision-Language Models"

Sherlock: Self-Correcting Reasoning in Vision-Language Models

If you find our project is helpful, please consider to give us a star ⭐ on GitHub!

📣 News

[2025/9/18] Sherlock is accepted by NeurIPS 2025.
[2025/5/29] We've released our paper: http://arxiv.org/abs/2505.22651.
[2025/5/28] We've released the training, evaluation, and data construction code of Sherlock.
[2025/5/27] We've released the model weights of Sherlock.

🔥 Highlights

Our analysis reveals that existing reasoning VLMs, whether trained with SFT or RL, struggle to self-correct (both step-wise and response-wise).

We propose Sherlock, the first framework to achieve intrinsic self-correction in reasoning VLMs, with significant improvements across diverse benchmarks using only 20k randomly sampled annotation data from LLaVA-CoT.

🔧 Usage

Preparation

Base Model

Our Sherlock is built on Llama3.2-Vision-11B-Instruct model, you can download it here.
Training Data

In SFT and Offline stage, Sherlock randomly sampled 20k data with annotation from LLaVA-CoT in total. During the Online self-improvement stage, we randomly sampled only question and image without ground truth from LLaVA-CoT until it self-construct 5k preference data. You should first download LLaVA-CoT dataset.
Sherlock Weights

You can download our Sherlock model weights from the Huggingface collection Sherlock.
Demo

After download our Sherlock Iter2 weight, you can try the demo in this file.

Training

Thanks to LLaMA-Factory team, our training code is modified based on their framework! You can find detailed training guidance in this file.

Evaluation

Thanks to VLMEvalKit team, our training code is modified based on their framework! You can find detailed evaluation guidance in this file.

🎉 Acknowledgement

Our project benefits from LLaVA-CoT, LLaMA-Factory, and VLMEvalKit. Thanks for their wonderful works.

📃 Citation

If you find our project is helpful, please cite our paper as

@article{ding2025sherlock,
  title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
  author={Ding, Yi and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2505.22651},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
inference		inference
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sherlock: Self-Correcting Reasoning in Vision-Language Models

If you find our project is helpful, please consider to give us a star ⭐ on GitHub!

📣 News

🔥 Highlights

🔧 Usage

Preparation

Training

Evaluation

🎉 Acknowledgement

📃 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DripNowhy/Sherlock

Folders and files

Latest commit

History

Repository files navigation

Sherlock: Self-Correcting Reasoning in Vision-Language Models

If you find our project is helpful, please consider to give us a star ⭐ on GitHub!

📣 News

🔥 Highlights

🔧 Usage

Preparation

Training

Evaluation

🎉 Acknowledgement

📃 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages