Skip to content

earl-juanico/rca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interpretable Open-Vocabulary Referring Object Detection with
Reverse Contrast Attention

Drandreb Earl O. Juanico1, Rowel O. Atienza1,2, Jeffrey Kenneth Go3
1AI Graduate Program, University of the Philippines Diliman, Quezon City
2EEEI, University of the Philippines Diliman, Quezon City
3Samsung R & D Institute Philippines
[email protected], [email protected][email protected][email protected]

Illustration of reverse contrast attention


Summary (from full paper: arXiv)

Reverse Contrast Attention (RCA) is a simple add-on that makes vision-language AIs pick out the right object in an image more accurately—all without retraining them. It tweaks the model’s “attention” so subtle but useful details matter more than extreme signals. In tests on a challenge where the AI must highlight any object mentioned in a sentence, RCA raised accuracy in most models—sometimes by more than 25 percent. It’s especially helpful for systems that combine image and text late in their processing, though others still see gains.

Bottom line: RCA makes these multimodal AIs both easier to interpret and better at their job.

Attention is More Than You Need

Attention is more than you need

Results

Evaluation on COCO validation 2017 on rare classes using FitAP as a metric

Table 1. Selected VLMs from OC-MMAL, their architectures, and FitAP () before and after applying RCA.
Model Params LLM Vision pre-RCA post-RCA % Change
Ovis2–34B34.9BQwen2.5–32BAIMv2–1B3.238693.52222+8.75
SAIL–VL–1.6–8B8.33BQwen2.5–7BAIMv2 Huge4.848735.67149+17.0
WeThink–Qwen2.5VL–7B8.29BQwen2.5–7BQwenViT39.964037.7606–5.51
Qwen2.5–VL–7B8.29BQwen2.5–7BQwenViT37.000546.8535+26.6
MiniCPM–o–2.68.67BQwen2.5–7BSigLIP–400M0.030640.07334+139
valley2.dpo8.88BQwen2.5–7BSigLIP–400M11.514511.6927+1.55
Kimi–VL–A3B16.4BMoonlight–16B–A3BMoonViT30.719432.2176+4.88
Ristretto–3B3.84BQwen2.5–3BSigLIP–400M9.128877.94552–13.0
POINTS1.5–Qwen2.5–7B8.3BQwen2.5–7BNaViT9.752039.45686–3.03
Valley–Eagle8.9BQwen2.5–7BSigLIP–400M11.773611.2598–4.36
Gemma3–27B27.4BGemma3–27BSigLIP–400M2.741793.01913+10.1
VARCO–VISION–14B15.2BQwen2.5–14BSigLIP–400M27.359228.7003+4.90
DeepSeek–VL227.5BDeepSeekMoE–27BSigLIP–400M3.385303.99586+18.0
PaliGemma2–3B–mix–4483BGemma2–2BSigLIP–400M38.798241.1179+5.98
Moondream21.9BPhi–1.5SigLIP–400M47.003947.0819+0.17

Example RCA-based improvements on a few VLM without retraining or fine-tuning

Kimi-VL-A3B

Figure 1. RCA effect on Kimi-VL-A3B


PaliGemma2-3b-mix-448

Figure 2. RCA effect on PaliGemma2-3B-mix-448


Qwen2.5-VL-7B

Figure 3. RCA effect on Qwen2.5-VL-7B

Instructions

  1. We removed the *.safetensors and *.gguf files originally found in the respective Huggingface repository of the models. You may re-download them to the appropriate model directory using wget <Huggingface-repo>/file.
  2. The codes have been tested in a DGX x86_64 system using conda environments. Please install Anaconda3 first.
  3. Clone this repository with git clone https://github.com/earl-juanico/rca.git
  4. cd rca
  5. Run the notebook Test-<vlm> for the vlm of interest. These notebooks are found in ./exploratory.
  6. You may need to install the correct conda environment settings. See the top cell of the notebook.

Additional details

You may read additional description about RCA from this blog.

Cite

If you use this work, please cite:

@article{juanico2025interpretable,
  title={Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention},
  author={Juanico, Drandreb Earl O and Atienza, Rowel O and Go, Jeffrey Kenneth},
  journal={arXiv preprint arXiv:2507.19891},
  year={2025}
}

About

RCA stands for Reverse Contrast Attention

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published