Code for CIKM'25 paper - Multimodal RAG Enhanced Visual Description
Begin by creating a Conda environment with the required packages using the following command:
conda env create -f requirements.yml
The model uses both the MS-COCO and Flickr30k datasets. You'll need to download and set them up as follows:
cd data
Create a new directory for MS-COCO and download the training and validation images:
mkdir mscoco && cd mscoco
wget http://images.cocodataset.org/zips/train2014.zip
unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
unzip val2014.zip
cd ../..
The images should be stored in data/coco.
- Flickr30k:
Apply for access to the Flickr30k dataset and save the images to ./datasets/flickr30k.
Download the train/val/test annotations for both datasets from here and save them to the annotations directory.
Parse the datasets:
Once all the files are in place, parse both datasets with these scripts:
python data_prep/pcoco.py
python data_prep/pflickr30k.py
Follow these steps to reproduce the results from our paper.
Before you begin, you will need to prepare the models.
- Decoder Model:
(Optional) For the decoder model, download DeBERTaV3-base by following the instructions on its Hugging Face page.
Next, you need to extract the CLIP language embeddings for the captions.
Run the following command. This process will take some time as it extracts embeddings for all CLIP backbones and saves them to the data/ directory.
python data_prep/gen_embeddings.py --datadir datasets/mscoco --data mscoco --vis-encoder RN50x64
Before computing the mappings, you will need to download the English spaCy pipeline for stop-word removal.
Download spaCy:
python -m spacy download en_core_web_sm
Now, you can compute the mappings by executing this command:
python align_captions.py --dataset mscoco --vis-encoder RN50x64
You can set the --dataset argument to either mscoco or flickr30k.
Finally, you can generate captions for the MS-COCO and Flickr30k datasets.
- MS-COCO:
Generate captions on the test split with the following command:
python generate_captions.py --k 18 --mscoco --vis-encoder RN50x64 --train-method linear_reg --decoding greedy
- Flickr30k:
To generate captions for the Flickr30k dataset, change the --datadir and --flickr30k arguments:
python generate_captions.py --datadir data/flickr30k/imgs_test.pkl --flickr30k --k 18 --vis-encoder RN50x64 --train-method linear_reg --decoding greedy
--k: The number of captions provided in the prompt.
--decoding: Supports greedy, sampling, nucleus, and topk.
The generated captions are saved as a JSON file within a new results directory.
The metrics, including BLEU, CIDEr-D, Rouge-L, and SPICE, are computed using the code from the tylin/coco-caption repository. The necessary annotation files can be found in the annotations/ directory.
If you find this code or our research useful, please cite our paper:
ACM Reference Format:
Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz. 2025. Multimodal RAG Enhanced Visual Description. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea. ACM, Seoul, Korea, 5 pages. https://doi.org/10.1145/3746252.3760826
BibTeX:
@inproceedings{jaiswal2025multimodal,
author = {Jaiswal, Amit Kumar and Liu, Haiming and Frommholz, Ingo},
title = {Multimodal RAG Enhanced Visual Description},
booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
year = {2025},
publisher = {ACM},
address = {Seoul, Republic of Korea},
doi = {10.1145/3746252.3760826},
url = {https://doi.org/10.1145/3746252.3760826}
}