Skip to content

CR400AF-A/SparseMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

This repository contains PyTorch implementation for SparseMM.

Project Page | arXiv Paper

Introduce SparseMM

We investigate how MLLMs process visual inputs by analyzing their attention mechanisms and reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed Visual Heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis.

Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding.

Main idea of Visual Head

Visual_Head.png

SparseMM for MLLM Acceleration

SparseMM.png

Main Results

Results on Multi-modal Benchmarks

main_result.png

Efficiency Evaluation for SparseMM

efficiency.png

Visualization of Visual Head

viz.png

Get Started

Install

  1. Clone this repository:
git clone https://github.com/CR400AF-A/SparseMM.git
cd SparseMM
  1. Init your environment
conda create -n sparsemm python=3.10 -y
conda activate sparsemm
  1. Install packages

Compile CUDA code for Flatten Cache Storage. If you encounter a CUDA compile error, please check your GPU Virtual Architecture and GPU Feature, then change the corresponding compile flag in csrc/build.py

pip install packaging torch==2.5.1
pip uninstall ninja && pip cache purge && pip install ninja --no-cache-dir
cd csrc && make
cd ..

Install other packages

pip install -e .
pip install flash-attn==2.4.1 --no-build-isolation # currently only support FlashAttention
pip install qwen-vl-utils
  1. Install lmms-eval for evaluation
cd lmms-eval
pip install -e .
cd ..

Chase Visual Head

  1. download dataset:
huggingface-cli download --repo-type dataset --resume-download nnethercott/synthdog-en-detection --local-dir /path/to/datasets/synthdog-en-detection

huggingface-cli download --repo-type dataset --resume-download detection-datasets/coco --local-dir /path/to/datasets/coco
  1. process dataset:
python3 scripts/chase_visual_head/process_data.py
python3 scripts/chase_visual_head/process_data_coco.py
  1. chase visual head:
bash scripts/chase_visual_head/llava.sh
bash scripts/chase_visual_head/llava_coco.sh
bash scripts/chase_visual_head/qwen.sh

Eval

bash scripts/eval/llava.sh
bash scripts/eval/mistral.sh
bash scripts/eval/qwen.sh

Viz

bash scripts/others/viz.sh

Speed and Memory

bash scripts/others/speed_and_memory.sh

Citation

If you found this repository useful, please consider citing:

@article{wang2025sparsemm,
  title={SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs},
  author={Wang, Jiahui and Liu, Zuyan and Rao, Yongming and Lu, Jiwen},
  journal={arXiv preprint arXiv:2506.05344},
  year={2025}
}

Acknowledgement

  • Our codebase is conducted on AdaKV and PyramidKV.

  • Thanks to lmms-eval team, for building such a useful evaluation system!

About

[ICCV 2025] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published