GitHub - Fantasyele/LLaVA-KD: [ICCV 2025] Official implementation of LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai^1*, Jiangning Zhang^2,3*, Haoyang He², Xinwei He⁴, Ao Tong¹,

Zhenye Gan³, Chengjie Wang³, Zhucun Xue², Yong Liu², Xiang Bai¹

¹Huazhong University of Science and Technology,

²Zhejiang University, ³Youtu Lab, Tencent, ⁴Huazhong Agricultural University

[Paper]

Abstract

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs ($l$-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs ($s$-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation. To mitigate this limitation, we propose a novel LLaVA-KD framework to transfer knowledge from $l$-MLLMs to $s$-MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships. Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: (1) Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in $s$-MLLMs, (2) Supervised Fine-Tuning to equip the $s$-MLLMs with multimodal understanding capacity, and (3) Distilled Fine-Tuning to refine $s$-MLLM's knowledge. Our approach significantly improves $s$-MLLMs performance without altering the model architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component.

Overview

📜 Main Results on 10 Popular Benchmarks

Benchmarked results with SoTA MLLMs. Compared with counterparts, our \method~achieves highly competitive results than current small-scale MLLM models. AVG: The average of the nine benchmarks for comprehensive comparison except MMMU. $^\dagger$: reproduced results using the official code.

🛠️ Installation

Based on python3.12 and torch-2.6.0

Prepare the environment

python3.12 -m pip install --no-cache-dir --upgrade -r requirements.txt
python3.12 -m pip install numpy==1.26.2
python3.12 -m pip install urllib3==1.26.6

Install cuda12.6
```
sh cuda_12.9.1_575.57.08_linux.run
```

Install Cusparselt

cd ../LLaVA_KD_whls/
rpm -i cusparselt-local-repo-rhel9-0.7.1-1.0-1.x86_64.rpm
dnf clean all
dnf -y install libcusparselt0 libcusparselt-devel

Install bitsandbytes

cd ../LLaVA_KD_whls/bitsandbytes-0.46.0
python3.12 setup.py install

Install deepspeed

python3.12 -m pip install ptflops
python3.12 -m pip install deepspeed==0.14.4

LLaVA-KD Weights

Model	Vision Encoder	LLM	CKPTs
LLaVA-KD-1B-Base-Qwen1.5	siglip-so400m-patch14-384	Qwen/Qwen1.5-0.5B	LLaVA-KD-Base-siglip-Qwen1.5-0.5B
LLaVA-KD-2B-Base-Qwen1.5	siglip-so400m-patch14-384	Qwen/Qwen1.5-1.8B	LLaVA-KD-Base-siglip-Qwen1.5-1.8B
LLaVA-KD-1B-Base-Qwen2.5	siglip-so400m-patch14-384	Qwen/Qwen2.5-0.5B	LLaVA-KD-Base-siglip-Qwen2.5-0.5B
LLaVA-KD-2B-Base-Qwen2.5	siglip-so400m-patch14-384	Qwen/Qwen2.5-1.5B	LLaVA-KD-Base-siglip-Qwen2.5-1.5B

💻 Evaluation

Please evaluate the model according to Evaluation.md.

Quickstart

Download the Pre-trained VisualEnc, LLM, LLaVAKD weights to the ./pretrained_ckpt. And then:

python quick_inference.py --model_path ./pretrained_ckpt/LLaVAKD_Model_Path --image_file ./image_test/img_test_1.jpg  --query "What is that orange thing behind the girl?"

☑️ TODO List

Release the training code
Release the checkpoints

💫 Citation

If you find this code useful, don't forget to star the repo and cite the paper.

@article{cai2024llava,
  title={LLaVA-KD: A Framework of Distilling Multimodal Large Language Models},
  author={Cai, Yuxuan and Zhang, Jiangning and He, Haoyang and He, Xinwei and Tong, Ao and Gan, Zhenye and Wang, Chengjie and Bai, Xiang},
  journal={arXiv preprint arXiv:2410.16236},
  year={2024}
}

💘 Acknowledgements

We thank the great works TinyLLaVA, LLaVA for providing assistance for our research.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
docs		docs
image_test		image_test
llavakd		llavakd
scripts		scripts
.gitignore		.gitignore
README.md		README.md
install_requirements.sh		install_requirements.sh
quick_inference.py		quick_inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Overview

📜 Main Results on 10 Popular Benchmarks

🛠️ Installation

LLaVA-KD Weights

💻 Evaluation

Quickstart

☑️ TODO List

💫 Citation

💘 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abstract

Overview

📜 Main Results on 10 Popular Benchmarks

🛠️ Installation

LLaVA-KD Weights

💻 Evaluation

Quickstart

☑️ TODO List

💫 Citation

💘 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages