Skip to content

Hoar012/RAP-MLLM

Repository files navigation

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

News

  • 2025.3.16 The RAP dataset is now available. Access it here.🔥🔥
  • 2025.2.27 RAP is accepted by CVPR 2025!🎉🎉
  • 2024.11.24 Release code and model weights.

Personalize Your Multimodal Large Language Model via Retrieval Augmented Generation.

RAP-MLLM
Introduce some user-specific concepts to our RAP-MLLM, it can remember them and achieve excellent performance in a variety of personalized multimodal generation tasks.

Visit our Project Page for more demostrations.

📋 Contents

Install

  1. Clone the repo into a local folder.
git clone https://github.com/Hoar012/RAP-MLLM.git

cd RAP-MLLM
  1. Install packages.
conda create -n rap python=3.10 -y
conda activate rap
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

pip install -r requirements.txt

Models

Pretrained model weights are available on Hugging Face.

RAP-LLaVA: RAP-LLaVA-13b; RAP-Phi3-V: RAP-Phi3-mini

Demo

Build Your Personal Database:

Each concept record in the database can be structured with the following format:

{
    "concept_dict": {
        "<concept>": {
            "name": "concept_name",
            "image": "image_path",
            "info": "",
            "category": ""
        }
    },
    "path_to_concept": {
        "image_path": "<concept>",
    }
}

We provide an example of the database in example_database.

CLI Demo:

python cli.py --model-path Hoar012/RAP-LLaVA-13b --image-file /path/to/test_image --retrieval --database example_database --topK 1

Data

Please check Data for more detail.

Training

We provide the training scripts with DeepSpeed below. Try training on your own dataset!

Model RAP-LLaVA RAP-Phi3-V LLaVA-LoRA
Script script script script

Evaluation

Prepare Data

Please download the test data used in the paper from the repositories of MyVLM and Yo'LLaVA.

We also provide the images for multi-concept evaluation in this Google Drive link.

In addition, we provide the full database used for question answering at this Google Drive link.

Evaluation on Image Captioning

python eval/caption.py  --eval-file /path/to/eval_file --model-path Hoar012/RAP-LLaVA-13b --retrieval --database /path/to/database --topK 2

The eval-file records the image paths to be evaluated and their corresponding target concepts, formatted as follows:

{
    "/path/to/image": [
        "target_concept"
    ],
}

Evaluation on Question Answering

python eval/VQA.py --eval-file eval/yollava-visual-qa.json --model-path Hoar012/RAP-LLaVA-13b --retrieval --database /path/to/database --topK 1

Replace /path/to/output_file with the path to your output file, then run the following command to obtain the accuracy:

python eval/eval_qa.py --output_path /path/to/output_file

Evaluation on Visual Recognition

python eval/recognition.py --eval-file eval/recognition_test.json --model-path Hoar012/RAP-LLaVA-13b --retrieval --database /path/to/database --topK 1

BibTeX

@InProceedings{Hao_2025_CVPR,
    author    = {Hao, Haoran and Han, Jiaming and Li, Changsheng and Li, Yu-Feng and Yue, Xiangyu},
    title     = {RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {14538-14548}
}

Acknowledgement

LLaVA, MyVLM, YoLLaVA

Releases

No releases published

Packages

No packages published