Code for the paper "CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs"
Authors: Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao
This repository provides a guide for setting up and running InternVL with KVcacheQuant for efficient inference.
- Install required packages (e.g., InternVL, Triton):
pip install internvl triton==3.2.0- Download the InternVL2.5-26B/8B model from HuggingFace:
- Change batch size (line 104 in
infer.py) - Set bit number (line 13 in
calibquant.py) - Run Inference
python infer.py- Ensure all dependencies are installed before running the script.
- Modify parameters accordingly for optimal performance based on your hardware.
- If you encounter issues, refer to the official documentation or repository.
@article{,
title={CalibQuant: 1-Bit KV Cache Quantization via Calibration for Multimodal LLMs},
author={Han, Insu and Zhang, Zeliang and Zhu, Yifan and Liang, Susan and Wang, Zhiyuan and Liu, Jiani and Lin, Haiting and Zhao, Mingjie and Xu, Chenliang and Wan, Kun and Zhao, Wentian},
journal={arXiv preprint arXiv:2502.14882},
year={2025}
}