feat(embedding): add multimodal embedding items#373
Conversation
|
Hi @jundot , This PR focuses on adding multimodal embedding items support. The fix for custom processor input preparation ( When you have time, could you please help review it and let me know if there’s anything you’d like me to adjust or improve? Thanks! |
|
Thanks @MasakiMu319, reviewed the full diff and traced through every text embedding path to confirm nothing breaks for existing usage. Looks good. I noticed a few minor things (dead Merging this. |
Why: - add structured multimodal embedding inputs without breaking the existing text input path - support custom embedding processors that need image-aware input preparation - keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights What: - add an items-based embedding request format for text and image inputs - route structured items through embedding normalization, engine, and custom processor preparation - count usage from prepared multimodal inputs and preserve empty-string text items - extend tests for multimodal requests, custom processors, and native loading validation
1dd4212 to
ba6a774
Compare
…fallback - remove unused is_likely_local_image_path() and its import os - reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value - narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)
|
@MasakiMu319 v0.3.0rc1 is out with your multimodal embedding work included (#373, #369). https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks! |
I’ve tested the multimodal embedding flow in |
Why: - add structured multimodal embedding inputs without breaking the existing text input path - support custom embedding processors that need image-aware input preparation - keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights What: - add an items-based embedding request format for text and image inputs - route structured items through embedding normalization, engine, and custom processor preparation - count usage from prepared multimodal inputs and preserve empty-string text items - extend tests for multimodal requests, custom processors, and native loading validation
…piled fallback - remove unused is_likely_local_image_path() and its import os - reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value - narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)
Summary
itemssupport to/v1/embeddingsfor text and image inputsinputtext path while makinginputanditemsmutually exclusiveWhy
The existing embeddings API only accepted plain text input, which blocked multimodal embedding models such as
Qwen3-VL-Embedding-2B-mxfp8from using their processor-define text/image input path. The server also treated all embedding requests as text-only, so image-aware processors could not receive structured items at all.At the same time, native embedding loading needed more precise validation. We want to allow extra Hugging Face checkpoint weights, but we should still fail closed when required native weights are missing or when provided tensor shapes do not match the expected model parameters.
What
EmbeddingInputItemanditemsto the embeddings request schema{text, image}payloadsprepare_embedding_inputs/prepare_model_inputsTesting
uv run pytest tests/test_embedding.py tests/integration/test_server_endpoints.py -k 'embedding'Manual verification
Qwen3-VL-Embedding-2B-mxfp8forced toembedding/v1/embeddingssucceeds for: