Yicheng Ji*, Jun Zhang*, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li (* equal contribution)
[EMNLP 2025 Main] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning 🔗 arXiv 2508.16201
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we perform a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes the remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B.
Install the required dependencies:
conda create -n SpecVLM python==3.10 -y
conda activate SpecVLM
pip install torch torchvision
pip install -r requirements.txt- For LLaVA-OneVision models: https://huggingface.co/llava-hf
- For Qwen2.5-VL models: https://huggingface.co/Qwen
- For VideoDetailCaption dataset: https://huggingface.co/datasets/lmms-lab/VideoDetailCaption
Run the demo script to quickly evaluate SpecVLM:
sh run.shPlease also moderate the model path, data path, pruning ratio, and frame number in run.sh file.
After runing the script, the evaluation result will be stored in results/.
- Our method primarily targets resource-constrained long-video scenarios, where GPU memory bandwidth constitutes the main bottleneck during inference. Users are advised to set the input length according to GPU capacity. Theoretically, as frame number grows, SpecVLM achieves higher acceleration ratios.
- In principle, our approach is lossless, with only minimal impact introduced by the attention implementation and data type settings. Given the insensitivity of draft models to token pruning, we also recommend uniform pruning as a compatibility-friendly alternative.
If you find SpecVLM useful or relevant to your research, please kindly cite our papers:
@inproceedings{ji2025specvlm,
title={Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning},
author={Ji, Yicheng and Zhang, Jun and Xia, Heming and Chen, Jinpeng and Shou, Lidan and Chen, Gang and Li, Huan},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={7216--7230},
year={2025}
}
