Skip to content

Fantasyele/LLaVA-KD

Repository files navigation

Yuxuan Cai1*, Jiangning Zhang2,3*, Haoyang He2, Xinwei He4, Ao Tong1,

Zhenye Gan3, Chengjie Wang3, Zhucun Xue2, Yong Liu2, Xiang Bai1

1Huazhong University of Science and Technology,

2Zhejiang University, 3Youtu Lab, Tencent, 4Huazhong Agricultural University

[Paper]

Abstract

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs ($l$-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs ($s$-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation. To mitigate this limitation, we propose a novel LLaVA-KD framework to transfer knowledge from $l$-MLLMs to $s$-MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships. Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: (1) Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in $s$-MLLMs, (2) Supervised Fine-Tuning to equip the $s$-MLLMs with multimodal understanding capacity, and (3) Distilled Fine-Tuning to refine $s$-MLLM's knowledge. Our approach significantly improves $s$-MLLMs performance without altering the model architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component.


Overview

accuracy


📜 Main Results on 10 Popular Benchmarks

Benchmarked results with SoTA MLLMs. Compared with counterparts, our \method~achieves highly competitive results than current small-scale MLLM models. AVG: The average of the nine benchmarks for comprehensive comparison except MMMU. $^\dagger$: reproduced results using the official code. comparison_llavakd


🛠️ Installation

  • Based on python3.12 and torch-2.6.0

  • Prepare the environment

    python3.12 -m pip install --no-cache-dir --upgrade -r requirements.txt
    python3.12 -m pip install numpy==1.26.2
    python3.12 -m pip install urllib3==1.26.6
  • Install cuda12.6

    sh cuda_12.9.1_575.57.08_linux.run
  • Install Cusparselt

    cd ../LLaVA_KD_whls/
    rpm -i cusparselt-local-repo-rhel9-0.7.1-1.0-1.x86_64.rpm
    dnf clean all
    dnf -y install libcusparselt0 libcusparselt-devel
  • Install bitsandbytes

    cd ../LLaVA_KD_whls/bitsandbytes-0.46.0
    python3.12 setup.py install 
  • Install deepspeed

    python3.12 -m pip install ptflops
    python3.12 -m pip install deepspeed==0.14.4
    
    

LLaVA-KD Weights

Model Vision Encoder LLM CKPTs
LLaVA-KD-1B-Base-Qwen1.5 siglip-so400m-patch14-384 Qwen/Qwen1.5-0.5B LLaVA-KD-Base-siglip-Qwen1.5-0.5B
LLaVA-KD-2B-Base-Qwen1.5 siglip-so400m-patch14-384 Qwen/Qwen1.5-1.8B LLaVA-KD-Base-siglip-Qwen1.5-1.8B
LLaVA-KD-1B-Base-Qwen2.5 siglip-so400m-patch14-384 Qwen/Qwen2.5-0.5B LLaVA-KD-Base-siglip-Qwen2.5-0.5B
LLaVA-KD-2B-Base-Qwen2.5 siglip-so400m-patch14-384 Qwen/Qwen2.5-1.5B LLaVA-KD-Base-siglip-Qwen2.5-1.5B

💻 Evaluation

Please evaluate the model according to Evaluation.md.

Quickstart

Download the Pre-trained VisualEnc, LLM, LLaVAKD weights to the ./pretrained_ckpt. And then:

python quick_inference.py --model_path ./pretrained_ckpt/LLaVAKD_Model_Path --image_file ./image_test/img_test_1.jpg  --query "What is that orange thing behind the girl?"

accuracy

☑️ TODO List

  • Release the training code
  • Release the checkpoints

💫 Citation

If you find this code useful, don't forget to star the repo and cite the paper.

@article{cai2024llava,
  title={LLaVA-KD: A Framework of Distilling Multimodal Large Language Models},
  author={Cai, Yuxuan and Zhang, Jiangning and He, Haoyang and He, Xinwei and Tong, Ao and Gan, Zhenye and Wang, Chengjie and Bai, Xiang},
  journal={arXiv preprint arXiv:2410.16236},
  year={2024}
}

💘 Acknowledgements

We thank the great works TinyLLaVA, LLaVA for providing assistance for our research.

About

[ICCV 2025] Official implementation of LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors