Skip to content

amitkumarj441/mRAG-gim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mRAG-gim

Code for CIKM'25 paper - Multimodal RAG Enhanced Visual Description

Prerequisites

Begin by creating a Conda environment with the required packages using the following command:

conda env create -f requirements.yml

Data Setup

The model uses both the MS-COCO and Flickr30k datasets. You'll need to download and set them up as follows:

cd data

Create a new directory for MS-COCO and download the training and validation images:

mkdir mscoco && cd mscoco
wget http://images.cocodataset.org/zips/train2014.zip
unzip train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
unzip val2014.zip
cd ../..

The images should be stored in data/coco.

  • Flickr30k:

Apply for access to the Flickr30k dataset and save the images to ./datasets/flickr30k.

Annotations:

Download the train/val/test annotations for both datasets from here and save them to the annotations directory.

Parse the datasets:

Once all the files are in place, parse both datasets with these scripts:

python data_prep/pcoco.py
python data_prep/pflickr30k.py

Reproducing Results

Follow these steps to reproduce the results from our paper.

Model Preparation

Before you begin, you will need to prepare the models.

  • Decoder Model:

(Optional) For the decoder model, download DeBERTaV3-base by following the instructions on its Hugging Face page.

Extracting Embeddings

Next, you need to extract the CLIP language embeddings for the captions.

Generate embeddings:

Run the following command. This process will take some time as it extracts embeddings for all CLIP backbones and saves them to the data/ directory.


python data_prep/gen_embeddings.py --datadir datasets/mscoco --data mscoco --vis-encoder RN50x64

Computing Mappings

Before computing the mappings, you will need to download the English spaCy pipeline for stop-word removal.

Download spaCy:

python -m spacy download en_core_web_sm

Now, you can compute the mappings by executing this command:

python align_captions.py --dataset mscoco --vis-encoder RN50x64

You can set the --dataset argument to either mscoco or flickr30k.

Generating Captions

Finally, you can generate captions for the MS-COCO and Flickr30k datasets.

  • MS-COCO:

Generate captions on the test split with the following command:

python generate_captions.py --k 18 --mscoco --vis-encoder RN50x64 --train-method linear_reg --decoding greedy
  • Flickr30k:

To generate captions for the Flickr30k dataset, change the --datadir and --flickr30k arguments:

python generate_captions.py --datadir data/flickr30k/imgs_test.pkl --flickr30k --k 18 --vis-encoder RN50x64 --train-method linear_reg --decoding greedy

Hyperparameters:

--k: The number of captions provided in the prompt.

--decoding: Supports greedy, sampling, nucleus, and topk.

Results and Evaluation

The generated captions are saved as a JSON file within a new results directory.

The metrics, including BLEU, CIDEr-D, Rouge-L, and SPICE, are computed using the code from the tylin/coco-caption repository. The necessary annotation files can be found in the annotations/ directory.

Citation

If you find this code or our research useful, please cite our paper:

ACM Reference Format:

Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz. 2025. Multimodal RAG Enhanced Visual Description. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea. ACM, Seoul, Korea, 5 pages. https://doi.org/10.1145/3746252.3760826

BibTeX:

@inproceedings{jaiswal2025multimodal,
  author    = {Jaiswal, Amit Kumar and Liu, Haiming and Frommholz, Ingo},
  title     = {Multimodal RAG Enhanced Visual Description},
  booktitle = {Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)},
  year      = {2025},
  publisher = {ACM},
  address   = {Seoul, Republic of Korea},
  doi       = {10.1145/3746252.3760826},
  url       = {https://doi.org/10.1145/3746252.3760826}
}

About

Code for CIKM'25 paper - Multimodal RAG Enhanced Visual Description

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages