Skip to content

DripNowhy/Sherlock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

If you find our project is helpful, please consider to give us a star ⭐ on GitHub!

Paper Page Hugging Face Collection GitHub Stars

📣 News

🔥 Highlights

Sherlock Self-correction

Our analysis reveals that existing reasoning VLMs, whether trained with SFT or RL, struggle to self-correct (both step-wise and response-wise).

Sherlock Pipeline

We propose Sherlock, the first framework to achieve intrinsic self-correction in reasoning VLMs, with significant improvements across diverse benchmarks using only 20k randomly sampled annotation data from LLaVA-CoT.

Sherlock Result

🔧 Usage

Preparation

  • Base Model

    Our Sherlock is built on Llama3.2-Vision-11B-Instruct model, you can download it here.

  • Training Data

    In SFT and Offline stage, Sherlock randomly sampled 20k data with annotation from LLaVA-CoT in total. During the Online self-improvement stage, we randomly sampled only question and image without ground truth from LLaVA-CoT until it self-construct 5k preference data. You should first download LLaVA-CoT dataset.

  • Sherlock Weights

    You can download our Sherlock model weights from the Huggingface collection Sherlock.

  • Demo

    After download our Sherlock Iter2 weight, you can try the demo in this file.

Training

Thanks to LLaMA-Factory team, our training code is modified based on their framework! You can find detailed training guidance in this file.

Evaluation

Thanks to VLMEvalKit team, our training code is modified based on their framework! You can find detailed evaluation guidance in this file.

🎉 Acknowledgement

Our project benefits from LLaVA-CoT, LLaMA-Factory, and VLMEvalKit. Thanks for their wonderful works.

📃 Citation

If you find our project is helpful, please cite our paper as

@article{ding2025sherlock,
  title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
  author={Ding, Yi and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2505.22651},
  year={2025}
}

About

[NeurIPS 2025] Official Implementation of paper "Sherlock: Self-Correcting Reasoning in Vision-Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published