Multimodal RAG

A comprehensive multimodal retrieval system supporting text, images, video, and audio embeddings using CLIP and Gemini Embedding models.

Features

Multiple Embedding Models: CLIP (local), Gemini 001 (text-only), Gemini 2 (multimodal)
Multimodal Support: Text, Images, Video, Audio, PDFs
Vector Storage: ChromaDB for persistent storage
LLM Integration: Groq via LiteLLM for generation

Installation

cd multimodal_rag
uv sync

Configuration

Create a .env file with your API keys:

# Groq API Keys (for LLM)
GROQ_API_KEY=your_groq_api_key
# Google API Key (for Gemini Embedding)
GOOGLE_API_KEY=your_google_api_key

Running the Benchmark

uv run python benchmark_full.py

Benchmark Results

Text Embeddings

Model	Dimensions	Avg Time	Notes
CLIP	384	0.173s	Local, free, fastest
Gemini 001	3072	3.822s	Text-only
Gemini 2	3072	4.354s	Multimodal

Winner: CLIP is 25.2x faster than Gemini 2

Image Embeddings

Model	Dimensions	Avg Time	Notes
CLIP	1024	13.895s	Local, free
Gemini 001	N/A	N/A	Not supported
Gemini 2	3072	9.619s	Native multimodal

Winner: Gemini 2 is 1.4x faster than CLIP

Video/Audio/PDF

Model	Video	Audio	PDF
CLIP	No	No	No
Gemini 001	No	No	No
Gemini 2	Yes	Yes	Yes

Performance Summary

Modality    | CLIP         | Gemini 001   | Gemini 2    
-------------------------------------------------------
TEXT        | 0.173s       | 3.822s      | 4.354s     
IMAGE       | 13.895s      | N/A         | 9.619s     
VIDEO       | N/A          | N/A         | Supported   
AUDIO       | N/A          | N/A         | Supported   
PDF         | N/A          | N/A         | Supported

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/multimodal_rag		src/multimodal_rag
.gitignore		.gitignore
README.md		README.md
benchmark_full.py		benchmark_full.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal RAG

Features

Installation

Configuration

Running the Benchmark

Benchmark Results

Text Embeddings

Image Embeddings

Video/Audio/PDF

Performance Summary

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal RAG

Features

Installation

Configuration

Running the Benchmark

Benchmark Results

Text Embeddings

Image Embeddings

Video/Audio/PDF

Performance Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages