SpecVLM

🚀 2.68× Decoding Speedup with 90% Token Reduction ⬇️

Yicheng Ji*, Jun Zhang*, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li (* equal contribution)

📌 Overview

Publication

[EMNLP 2025 Main] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning 🔗 arXiv 2508.16201

🔗 emnlp2025

📖 Abstract

Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we perform a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes the remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B.

⚙️ Environment Setup

Install the required dependencies:

conda create -n SpecVLM python==3.10 -y
conda activate SpecVLM
pip install torch torchvision
pip install -r requirements.txt

🛠 Download Models & Datasets

For LLaVA-OneVision models: https://huggingface.co/llava-hf
For Qwen2.5-VL models: https://huggingface.co/Qwen
For VideoDetailCaption dataset: https://huggingface.co/datasets/lmms-lab/VideoDetailCaption

🚀 Quick Evaluation

Run the demo script to quickly evaluate SpecVLM:

sh run.sh

Please also moderate the model path, data path, pruning ratio, and frame number in run.sh file.

After runing the script, the evaluation result will be stored in results/.

Note

Our method primarily targets resource-constrained long-video scenarios, where GPU memory bandwidth constitutes the main bottleneck during inference. Users are advised to set the input length according to GPU capacity. Theoretically, as frame number grows, SpecVLM achieves higher acceleration ratios.
In principle, our approach is lossless, with only minimal impact introduced by the attention implementation and data type settings. Given the insensitivity of draft models to token pruning, we also recommend uniform pruning as a compatibility-friendly alternative.

Citation

If you find SpecVLM useful or relevant to your research, please kindly cite our papers:

@inproceedings{ji2025specvlm,
  title={Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning},
  author={Ji, Yicheng and Zhang, Jun and Xia, Heming and Chen, Jinpeng and Shou, Lidan and Chen, Gang and Li, Huan},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={7216--7230},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
decoding		decoding
kv_cache		kv_cache
models		models
tree_choices		tree_choices
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpecVLM

🚀 2.68× Decoding Speedup with 90% Token Reduction ⬇️

📌 Overview

Publication

📖 Abstract

⚙️ Environment Setup

🛠 Download Models & Datasets

🚀 Quick Evaluation

Note

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

zju-jiyicheng/SpecVLM

Folders and files

Latest commit

History

Repository files navigation

SpecVLM

🚀 2.68× Decoding Speedup with 90% Token Reduction ⬇️

📌 Overview

Publication

📖 Abstract

⚙️ Environment Setup

🛠 Download Models & Datasets

🚀 Quick Evaluation

Note

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages