Skip to content

Ruixxxx/VisionDrop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisionDrop

This repository provides the implementation for our paper: "Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment", AAAI 2026.

🧩 Method

📦 Environment Setup

  1. Install necessary packages.
conda create -n vdrop python=3.10 -y
conda activate vdrop

# Install dependencies
pip install --upgrade pip
pip install -e .
  1. (Optional) Install FlashAttention for further inference acceleration.
pip install flash-attn --no-build-isolation

⚙️ Key Arguments

Our method performs progressive visual token reduction at both the visual encoder and LLM decoding phase. The main arguments are:

  • --dominant '42' : Number of dominant tokens retained from the visual encoder.
  • --contextual '6' : Number of contextual tokens retained alongside dominant ones from the visual encoder.
  • --layer_list '[8,16,24]' : LLM layers after which token reduction is applied.
  • --image_token_list "[[30,5],[22,4],[16,3]]" : Token retention schedule per LLM layer, formatted as list of [dominant, contextual].

These example settings correspond to an average token retention of 32 tokens.

🚀 Efficient Inference

We follow the original evaluation in LLaVA on 9 image understanding benchmarks.

Before evaluation, prepare the datasets following the LLaVA Evaluation.md instructions, and download LLaVA-1.5-7B checkpoints from Hugging Face.

We provide the evaluation scripts for each benchmark:

bash scripts/v1_5/visiondrop_eval/${DATASET}.sh

🔗 Citation

If you find this project useful in your research, please consider citing:

@article{xu2025visiondrop,
    author    = {Rui Xu and Yunke Wang and Yong Luo and Bo Du},
    title     = {Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment},
    journal   = {arXiv preprint arXiv:2506.22283},
    year      = {2025},
}

❤️ Acknowledgments

This work builds upon several excellent open-source projects:

Thanks for the original authors for their contributions to the community.

About

[Official Repo: AAAI 2026] Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors