A comprehensive multimodal retrieval system supporting text, images, video, and audio embeddings using CLIP and Gemini Embedding models.
- Multiple Embedding Models: CLIP (local), Gemini 001 (text-only), Gemini 2 (multimodal)
- Multimodal Support: Text, Images, Video, Audio, PDFs
- Vector Storage: ChromaDB for persistent storage
- LLM Integration: Groq via LiteLLM for generation
cd multimodal_rag
uv syncCreate a .env file with your API keys:
# Groq API Keys (for LLM)
GROQ_API_KEY=your_groq_api_key
# Google API Key (for Gemini Embedding)
GOOGLE_API_KEY=your_google_api_keyuv run python benchmark_full.py
| Model | Dimensions | Avg Time | Notes |
|---|---|---|---|
| CLIP | 384 | 0.173s | Local, free, fastest |
| Gemini 001 | 3072 | 3.822s | Text-only |
| Gemini 2 | 3072 | 4.354s | Multimodal |
Winner: CLIP is 25.2x faster than Gemini 2
| Model | Dimensions | Avg Time | Notes |
|---|---|---|---|
| CLIP | 1024 | 13.895s | Local, free |
| Gemini 001 | N/A | N/A | Not supported |
| Gemini 2 | 3072 | 9.619s | Native multimodal |
Winner: Gemini 2 is 1.4x faster than CLIP
| Model | Video | Audio | |
|---|---|---|---|
| CLIP | No | No | No |
| Gemini 001 | No | No | No |
| Gemini 2 | Yes | Yes | Yes |
Modality | CLIP | Gemini 001 | Gemini 2
-------------------------------------------------------
TEXT | 0.173s | 3.822s | 4.354s
IMAGE | 13.895s | N/A | 9.619s
VIDEO | N/A | N/A | Supported
AUDIO | N/A | N/A | Supported
PDF | N/A | N/A | Supported