mRAG-gim

Code for CIKM'25 paper - Multimodal RAG Enhanced Visual Description

Prerequisites

Begin by creating a Conda environment with the required packages using the following command:

conda env create -f requirements.yml

Data Setup

The model uses both the MS-COCO and Flickr30k datasets. You'll need to download and set them up as follows:

cd data

Create a new directory for MS-COCO and download the training and validation images:

mkdir mscoco && cd mscoco
wget http://images.cocodataset.org/zips/train2014.zip
unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
unzip val2014.zip
cd ../..

The images should be stored in data/coco.

Flickr30k:

Apply for access to the Flickr30k dataset and save the images to ./datasets/flickr30k.

Annotations:

Download the train/val/test annotations for both datasets from here and save them to the annotations directory.

Parse the datasets:

Once all the files are in place, parse both datasets with these scripts:

python data_prep/pcoco.py
python data_prep/pflickr30k.py

Reproducing Results

Follow these steps to reproduce the results from our paper.

Model Preparation

Before you begin, you will need to prepare the models.

Decoder Model:

(Optional) For the decoder model, download DeBERTaV3-base by following the instructions on its Hugging Face page.

Extracting Embeddings

Next, you need to extract the CLIP language embeddings for the captions.

Generate embeddings:

Run the following command. This process will take some time as it extracts embeddings for all CLIP backbones and saves them to the data/ directory.


python data_prep/gen_embeddings.py --datadir datasets/mscoco --data mscoco --vis-encoder RN50x64

Computing Mappings

Before computing the mappings, you will need to download the English spaCy pipeline for stop-word removal.

Download spaCy:

python -m spacy download en_core_web_sm

Now, you can compute the mappings by executing this command:

python align_captions.py --dataset mscoco --vis-encoder RN50x64

You can set the --dataset argument to either mscoco or flickr30k.

Generating Captions

Finally, you can generate captions for the MS-COCO and Flickr30k datasets.

MS-COCO:

Generate captions on the test split with the following command:

python generate_captions.py --k 18 --mscoco --vis-encoder RN50x64 --train-method linear_reg --decoding greedy

Flickr30k:

To generate captions for the Flickr30k dataset, change the --datadir and --flickr30k arguments:

python generate_captions.py --datadir data/flickr30k/imgs_test.pkl --flickr30k --k 18 --vis-encoder RN50x64 --train-method linear_reg --decoding greedy

Hyperparameters:

--k: The number of captions provided in the prompt.

--decoding: Supports greedy, sampling, nucleus, and topk.

Results and Evaluation

The generated captions are saved as a JSON file within a new results directory.

The metrics, including BLEU, CIDEr-D, Rouge-L, and SPICE, are computed using the code from the tylin/coco-caption repository. The necessary annotation files can be found in the annotations/ directory.

Citation

If you find this code or our research useful, please cite our paper:

ACM Reference Format:

Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz. 2025. Multimodal RAG Enhanced Visual Description. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea. ACM, Seoul, Korea, 5 pages. https://doi.org/10.1145/3746252.3760826

BibTeX:

@inproceedings{jaiswal2025multimodal,
  author    = {Jaiswal, Amit Kumar and Liu, Haiming and Frommholz, Ingo},
  title     = {Multimodal RAG Enhanced Visual Description},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  address   = {Seoul, Republic of Korea},
  doi       = {10.1145/3746252.3760826},
  url       = {https://doi.org/10.1145/3746252.3760826}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
annotations		annotations
data_prep		data_prep
LICENSE		LICENSE
README.md		README.md
align_captions.py		align_captions.py
generate_captions.py		generate_captions.py
requirements.yml		requirements.yml
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mRAG-gim

Prerequisites

Data Setup

Annotations:

Reproducing Results

Model Preparation

Extracting Embeddings

Generate embeddings:

Computing Mappings

Generating Captions

Hyperparameters:

Results and Evaluation

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mRAG-gim

Prerequisites

Data Setup

Annotations:

Reproducing Results

Model Preparation

Extracting Embeddings

Generate embeddings:

Computing Mappings

Generating Captions

Hyperparameters:

Results and Evaluation

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages