Skip to content

feat(embedding): add multimodal embedding items#373

Merged
jundot merged 1 commit intojundot:mainfrom
MasakiMu319:feat/multimodal-embeddings
Mar 29, 2026
Merged

feat(embedding): add multimodal embedding items#373
jundot merged 1 commit intojundot:mainfrom
MasakiMu319:feat/multimodal-embeddings

Conversation

@MasakiMu319
Copy link
Copy Markdown
Contributor

Summary

  • add structured items support to /v1/embeddings for text and image inputs
  • keep the existing input text path while making input and items mutually exclusive
  • route multimodal embedding requests through custom processor preparation when available
  • preserve native embedding loading safety by validating required weights and tensor shapes before allowing non-strict loading

Why

The existing embeddings API only accepted plain text input, which blocked multimodal embedding models such as Qwen3-VL-Embedding-2B-mxfp8 from using their processor-define text/image input path. The server also treated all embedding requests as text-only, so image-aware processors could not receive structured items at all.

At the same time, native embedding loading needed more precise validation. We want to allow extra Hugging Face checkpoint weights, but we should still fail closed when required native weights are missing or when provided tensor shapes do not match the expected model parameters.

What

  • add EmbeddingInputItem and items to the embeddings request schema
  • normalize structured embedding items into internal {text, image} payloads
  • update the embeddings server path and engine interface to accept structured items
  • pass multimodal items directly to custom embedding processors via prepare_embedding_inputs / prepare_model_inputs
  • reject image inputs for standard text-only embedding processors with a 400
  • count usage from prepared multimodal inputs so image-only and text+image requests report non-zero usage
  • allow empty-string text items in the structured request format to match the legacy text input behavior
  • validate native embedding checkpoints for required keys and shape compatibility while still allowing extra checkpoint weights
  • add regression coverage for structured requests, multimodal usage accounting, native loading validation, and endpoint behavior

Testing

  • uv run pytest tests/test_embedding.py tests/integration/test_server_endpoints.py -k 'embedding'

Manual verification

  • started oMLX with Qwen3-VL-Embedding-2B-mxfp8 forced to embedding
  • verified /v1/embeddings succeeds for:
    • text-only items
    • image-only items
    • text+image items
    • image inputs provided as URL
    • image inputs provided as data URI
    • image inputs provided as local path

@MasakiMu319
Copy link
Copy Markdown
Contributor Author

Hi @jundot ,

This PR focuses on adding multimodal embedding items support.

The fix for custom processor input preparation (fix(embedding): support custom processor input preparation) is in a separate PR (#369). This PR depends on that change and focuses on the multimodal embedding items feature.

When you have time, could you please help review it and let me know if there’s anything you’d like me to adjust or improve?

Thanks!

@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 29, 2026

Thanks @MasakiMu319, reviewed the full diff and traced through every text embedding path to confirm nothing breaks for existing usage. Looks good.

I noticed a few minor things (dead is_likely_local_image_path(), bare except in _count_prepared_tokens(), stale total_tokens on compiled fallback, inconsistent oq_manager access pattern). None are blockers, i will clean them up in a follow-up commit.

Merging this.

Why:
- add structured multimodal embedding inputs without breaking the existing text input path
- support custom embedding processors that need image-aware input preparation
- keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights

What:
- add an items-based embedding request format for text and image inputs
- route structured items through embedding normalization, engine, and custom processor preparation
- count usage from prepared multimodal inputs and preserve empty-string text items
- extend tests for multimodal requests, custom processors, and native loading validation
@jundot jundot force-pushed the feat/multimodal-embeddings branch from 1dd4212 to ba6a774 Compare March 29, 2026 09:38
@jundot jundot merged commit cdf0046 into jundot:main Mar 29, 2026
jundot added a commit that referenced this pull request Mar 29, 2026
…fallback

- remove unused is_likely_local_image_path() and its import os
- reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value
- narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)
@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 29, 2026

@MasakiMu319 v0.3.0rc1 is out with your multimodal embedding work included (#373, #369). https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks!

@MasakiMu319
Copy link
Copy Markdown
Contributor Author

@MasakiMu319 v0.3.0rc1 is out with your multimodal embedding work included (#373, #369). https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks!

I’ve tested the multimodal embedding flow in ‎v0.3.0rc1, and it works as expected for text-only, image-only, and text+image items (including image URLs, data URIs, and local file paths). Thanks again for cutting the release and the follow-up fixes!

AlexTzk pushed a commit to AlexTzk/omlx that referenced this pull request Mar 29, 2026
Why:
- add structured multimodal embedding inputs without breaking the existing text input path
- support custom embedding processors that need image-aware input preparation
- keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights

What:
- add an items-based embedding request format for text and image inputs
- route structured items through embedding normalization, engine, and custom processor preparation
- count usage from prepared multimodal inputs and preserve empty-string text items
- extend tests for multimodal requests, custom processors, and native loading validation
AlexTzk pushed a commit to AlexTzk/omlx that referenced this pull request Mar 29, 2026
…piled fallback

- remove unused is_likely_local_image_path() and its import os
- reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value
- narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants