-
-
Notifications
You must be signed in to change notification settings - Fork 69.4k
feat(memory): add gemini-embedding-2-preview as supported embedding model #42487
Description
Summary
Add support for gemini-embedding-2-preview as an option alongside the existing gemini-embedding-001 default. Users should be able to opt in via config.
Background
Google released gemini-embedding-2-preview (docs) with significantly better specs than gemini-embedding-001:
gemini-embedding-001 |
gemini-embedding-2-preview |
|
|---|---|---|
| Modalities | Text only | Text, image, video, audio, PDF |
| Dimensions | 768 | 3072 (configurable: 768, 1536, 3072) |
| Input tokens | 2048 | 8192 |
| Task types | ❌ | ✅ (RETRIEVAL_QUERY, RETRIEVAL_DOCUMENT, etc.) |
| Matryoshka | ❌ | ✅ (truncatable dimensions) |
The multimodal support is the headline feature — the same embedding space covers text, images, video frames, audio clips, and PDFs, enabling cross-modal retrieval (e.g. search text, get an image back and vice versa). Higher input token limit (8192 vs 2048) means fewer chunk splits for long memory entries.
Required Changes
1. src/memory/embedding-model-limits.ts
Add the new model's token limit:
"gemini:gemini-embedding-2-preview": 8192,2. src/memory/embeddings-gemini.ts
- Support the
outputDimensionalityparameter (3072 default; configurable) - Support the
taskTypefield — passRETRIEVAL_DOCUMENTfor storage,RETRIEVAL_QUERYfor search - Support multimodal
partsin the request body:inlineData(base64 image/audio/video) andfileData(URI for PDFs, videos via File API) - Both fields are absent for older models (backward compatible)
- Both single (
embedContent) and batch (batchEmbedContents) endpoints support multimodal parts
Text request shape (backward compat):
{
"model": "models/gemini-embedding-2-preview",
"content": { "parts": [{ "text": "..." }] },
"taskType": "RETRIEVAL_DOCUMENT",
"outputDimensionality": 3072
}Multimodal request shape (image example):
{
"model": "models/gemini-embedding-2-preview",
"content": {
"parts": [
{ "inlineData": { "mimeType": "image/png", "data": "<base64>" } },
{ "text": "optional caption or context" }
]
},
"taskType": "RETRIEVAL_DOCUMENT",
"outputDimensionality": 3072
}Supported MIME types: image/png, image/jpeg, image/webp, image/gif, video/mp4, audio/mp3, audio/wav, application/pdf, and others supported by the Gemini API.
3. src/memory/embeddings-model-normalize.ts
Ensure gemini-embedding-2-preview passes through normalization without being mangled.
4. New: src/memory/embeddings-gemini-multimodal.ts (or extend existing)
Helper to build a multimodal content object from a file path or URL, detecting MIME type and encoding inline vs. File API upload based on size. This keeps embeddings-gemini.ts clean.
5. Config / docs
- Update the
memorySearch.modelconfig reference to listgemini-embedding-2-previewas a valid option - Document supported input types beyond text
- Add a note that switching models requires re-indexing existing memory (vector dimensions change: 768 → 3072)
DEFAULT_GEMINI_EMBEDDING_MODELstaysgemini-embedding-001— existing configs unaffected
Acceptance Criteria
-
model: "gemini-embedding-2-preview"in config produces valid text embeddings - Token limit resolved as 8192 for the new model
-
outputDimensionalitydefaults to 3072; configurable via optionalmemorySearch.dimensionalityfield -
taskTypepassed appropriately for write vs. search operations - Multimodal parts (image, audio, video, PDF) can be embedded when using the new model
- Existing
gemini-embedding-001behavior unchanged (text-only, no extra fields) - Unit tests cover limit resolution, text request shape, and multimodal part construction
- Docs updated with model option, multimodal support, and re-index warning
Notes
gemini-embedding-2-previewis in preview — model name may change on GA. Consider aliasing once stable.- Batch endpoint (
batchEmbedContents) also acceptsoutputDimensionalityat the top level — ensure both paths pass it. - For large files (video, long audio), use the Gemini File API upload path rather than inline base64.
- Cross-modal retrieval (embed image, search with text) works out of the box since both live in the same vector space.