Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

This is the official implementation of the paper: "Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models."

Abstract

Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.

Installation

# Create a conda environment
conda create -n sparc python=3.10 -y
conda activate sparc

# LLaVA installation
cd LLaVA
pip install --upgrade pip  # Enables PEP 660 support
pip install -e .

# install additional
cd ..
pip install -r requirements.txt

Evaluation

IIW-400 Dataset (CLAIR Evaluation)

Download the dataset from IIW-400.
Edit the dataset path in script/llava/captioning_iiw400.sh.
Run caption generation:
```
bash scripts/llava/captioning_iiw400.sh
```
Update the dataset path in script/clair_iiw_eval.sh.
Run CLAIR evaluation:
```
bash scripts/clair_iiw_eval.sh
```

DOCCI Dataset (CLAIR Evaluation)

Download the dataset from DOCCI.
Edit the dataset path in script/llava/captioning_docci.sh.
Run caption generation:
```
bash scripts/llava/captioning_docci.sh
```
Update the dataset path in script/clair_docci_eval.sh.
Run CLAIR evaluation:
```
bash scripts/clair_docci_eval.sh
```

CHAIR Evaluation

Download the COCO 2014 validation images and annotation files.
Edit the dataset path in script/llava/captioning_coco.sh.
Run caption generation:
```
bash scripts/llava/captioning_coco.sh
```
Update the dataset path in script/chair_eval.sh.
Run CHAIR evaluation:
```
bash scripts/chair_eval.sh
```

Acknowledgement

This project is based on the LLaVA codebase.
Evaluation codes are based on CHAIR, CLAIR

Citation

If you find this work useful for your research, please cite:

@misc{jung2025visualattentionfadesselective,
  title={Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models},
  author={Mingi Jung and Saehuyng Lee and Eunji Kim and Sungroh Yoon},
  year={2025},
  eprint={2502.01419},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2502.01419}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LLaVA		LLaVA
__pycache__		__pycache__
scripts		scripts
README.md		README.md
attn_util.py		attn_util.py
chair.pkl		chair.pkl
chair.py		chair.py
clair.py		clair.py
dataset_loader.py		dataset_loader.py
eval.py		eval.py
requirments.txt		requirments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

Abstract

Installation

Evaluation

IIW-400 Dataset (CLAIR Evaluation)

DOCCI Dataset (CLAIR Evaluation)

CHAIR Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mingi000508/SPARC

Folders and files

Latest commit

History

Repository files navigation

Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

Abstract

Installation

Evaluation

IIW-400 Dataset (CLAIR Evaluation)

DOCCI Dataset (CLAIR Evaluation)

CHAIR Evaluation

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages