Interpretable Open-Vocabulary Referring Object Detection with
Reverse Contrast Attention
Drandreb Earl O. Juanico1, Rowel O. Atienza1,2, Jeffrey Kenneth Go3
1AI Graduate Program, University of the Philippines Diliman, Quezon City
2EEEI, University of the Philippines Diliman, Quezon City
3Samsung R & D Institute Philippines
[email protected],
[email protected] •
[email protected] •
[email protected]
Reverse Contrast Attention (RCA) is a simple add-on that makes vision-language AIs pick out the right object in an image more accurately—all without retraining them. It tweaks the model’s “attention” so subtle but useful details matter more than extreme signals. In tests on a challenge where the AI must highlight any object mentioned in a sentence, RCA raised accuracy in most models—sometimes by more than 25 percent. It’s especially helpful for systems that combine image and text late in their processing, though others still see gains.
Bottom line: RCA makes these multimodal AIs both easier to interpret and better at their job.
Attention is More Than You Need
Results
Evaluation on COCO validation 2017 on rare classes using FitAP as a metric
| Model | Params | LLM | Vision | pre-RCA | post-RCA | % Change |
|---|---|---|---|---|---|---|
| Ovis2–34B | 34.9B | Qwen2.5–32B | AIMv2–1B | 3.23869 | 3.52222 | +8.75 |
| SAIL–VL–1.6–8B | 8.33B | Qwen2.5–7B | AIMv2 Huge | 4.84873 | 5.67149 | +17.0 |
| WeThink–Qwen2.5VL–7B | 8.29B | Qwen2.5–7B | QwenViT | 39.9640 | 37.7606 | –5.51 |
| Qwen2.5–VL–7B | 8.29B | Qwen2.5–7B | QwenViT | 37.0005 | 46.8535 | +26.6 |
| MiniCPM–o–2.6 | 8.67B | Qwen2.5–7B | SigLIP–400M | 0.03064 | 0.07334 | +139 |
| valley2.dpo | 8.88B | Qwen2.5–7B | SigLIP–400M | 11.5145 | 11.6927 | +1.55 |
| Kimi–VL–A3B | 16.4B | Moonlight–16B–A3B | MoonViT | 30.7194 | 32.2176 | +4.88 |
| Ristretto–3B | 3.84B | Qwen2.5–3B | SigLIP–400M | 9.12887 | 7.94552 | –13.0 |
| POINTS1.5–Qwen2.5–7B | 8.3B | Qwen2.5–7B | NaViT | 9.75203 | 9.45686 | –3.03 |
| Valley–Eagle | 8.9B | Qwen2.5–7B | SigLIP–400M | 11.7736 | 11.2598 | –4.36 |
| Gemma3–27B | 27.4B | Gemma3–27B | SigLIP–400M | 2.74179 | 3.01913 | +10.1 |
| VARCO–VISION–14B | 15.2B | Qwen2.5–14B | SigLIP–400M | 27.3592 | 28.7003 | +4.90 |
| DeepSeek–VL2 | 27.5B | DeepSeekMoE–27B | SigLIP–400M | 3.38530 | 3.99586 | +18.0 |
| PaliGemma2–3B–mix–448 | 3B | Gemma2–2B | SigLIP–400M | 38.7982 | 41.1179 | +5.98 |
| Moondream2 | 1.9B | Phi–1.5 | SigLIP–400M | 47.0039 | 47.0819 | +0.17 |
Example RCA-based improvements on a few VLM without retraining or fine-tuning
Figure 1. RCA effect on Kimi-VL-A3B
Figure 2. RCA effect on PaliGemma2-3B-mix-448
Figure 3. RCA effect on Qwen2.5-VL-7B
- We removed the
*.safetensorsand*.gguffiles originally found in the respective Huggingface repository of the models. You may re-download them to the appropriate model directory usingwget <Huggingface-repo>/file. - The codes have been tested in a
DGX x86_64system usingcondaenvironments. Please install Anaconda3 first. - Clone this repository with
git clone https://github.com/earl-juanico/rca.git cdrca- Run the notebook
Test-<vlm>for thevlmof interest. These notebooks are found in./exploratory. - You may need to install the correct conda environment settings. See the top cell of the notebook.
You may read additional description about RCA from this blog.
If you use this work, please cite:
@article{juanico2025interpretable,
title={Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention},
author={Juanico, Drandreb Earl O and Atienza, Rowel O and Go, Jeffrey Kenneth},
journal={arXiv preprint arXiv:2507.19891},
year={2025}
}
