Skip to content

TejasS1233/multimodal_rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal RAG

A comprehensive multimodal retrieval system supporting text, images, video, and audio embeddings using CLIP and Gemini Embedding models.

Features

  • Multiple Embedding Models: CLIP (local), Gemini 001 (text-only), Gemini 2 (multimodal)
  • Multimodal Support: Text, Images, Video, Audio, PDFs
  • Vector Storage: ChromaDB for persistent storage
  • LLM Integration: Groq via LiteLLM for generation

Installation

cd multimodal_rag
uv sync

Configuration

Create a .env file with your API keys:

# Groq API Keys (for LLM)
GROQ_API_KEY=your_groq_api_key
# Google API Key (for Gemini Embedding)
GOOGLE_API_KEY=your_google_api_key

Running the Benchmark

uv run python benchmark_full.py

Benchmark Results

Text Embeddings

Model Dimensions Avg Time Notes
CLIP 384 0.173s Local, free, fastest
Gemini 001 3072 3.822s Text-only
Gemini 2 3072 4.354s Multimodal

Winner: CLIP is 25.2x faster than Gemini 2

Image Embeddings

Model Dimensions Avg Time Notes
CLIP 1024 13.895s Local, free
Gemini 001 N/A N/A Not supported
Gemini 2 3072 9.619s Native multimodal

Winner: Gemini 2 is 1.4x faster than CLIP

Video/Audio/PDF

Model Video Audio PDF
CLIP No No No
Gemini 001 No No No
Gemini 2 Yes Yes Yes

Performance Summary

Modality    | CLIP         | Gemini 001   | Gemini 2    
-------------------------------------------------------
TEXT        | 0.173s       | 3.822s      | 4.354s     
IMAGE       | 13.895s      | N/A         | 9.619s     
VIDEO       | N/A          | N/A         | Supported   
AUDIO       | N/A          | N/A         | Supported   
PDF         | N/A          | N/A         | Supported   

About

testing out specialised embedders vs multimodal gemini 2 embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages