GitHub - Ruixxxx/VisionDrop: [Official Repo: AAAI 2026] Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment

VisionDrop

This repository provides the implementation for our paper: "Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment", AAAI 2026.

🧩 Method

📦 Environment Setup

Install necessary packages.

conda create -n vdrop python=3.10 -y
conda activate vdrop

# Install dependencies
pip install --upgrade pip
pip install -e .

(Optional) Install FlashAttention for further inference acceleration.

pip install flash-attn --no-build-isolation

⚙️ Key Arguments

Our method performs progressive visual token reduction at both the visual encoder and LLM decoding phase. The main arguments are:

--dominant '42' : Number of dominant tokens retained from the visual encoder.
--contextual '6' : Number of contextual tokens retained alongside dominant ones from the visual encoder.
--layer_list '[8,16,24]' : LLM layers after which token reduction is applied.
--image_token_list "[[30,5],[22,4],[16,3]]" : Token retention schedule per LLM layer, formatted as list of [dominant, contextual].

These example settings correspond to an average token retention of 32 tokens.

🚀 Efficient Inference

We follow the original evaluation in LLaVA on 9 image understanding benchmarks.

Before evaluation, prepare the datasets following the LLaVA Evaluation.md instructions, and download LLaVA-1.5-7B checkpoints from Hugging Face.

We provide the evaluation scripts for each benchmark:

bash scripts/v1_5/visiondrop_eval/${DATASET}.sh

🔗 Citation

If you find this project useful in your research, please consider citing:

@article{xu2025visiondrop,
    author    = {Rui Xu and Yunke Wang and Yong Luo and Bo Du},
    title     = {Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment},
    journal   = {arXiv preprint arXiv:2506.22283},
    year      = {2025},
}

❤️ Acknowledgments

This work builds upon several excellent open-source projects:

Thanks for the original authors for their contributions to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figures		figures
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionDrop

🧩 Method

📦 Environment Setup

⚙️ Key Arguments

🚀 Efficient Inference

🔗 Citation

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VisionDrop

🧩 Method

📦 Environment Setup

⚙️ Key Arguments

🚀 Efficient Inference

🔗 Citation

❤️ Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages