feat(embedding): add multimodal embedding items by MasakiMu319 · Pull Request #373 · jundot/omlx

MasakiMu319 · 2026-03-24T15:37:31Z

Summary

add structured items support to /v1/embeddings for text and image inputs
keep the existing input text path while making input and items mutually exclusive
route multimodal embedding requests through custom processor preparation when available
preserve native embedding loading safety by validating required weights and tensor shapes before allowing non-strict loading

Why

The existing embeddings API only accepted plain text input, which blocked multimodal embedding models such as Qwen3-VL-Embedding-2B-mxfp8 from using their processor-define text/image input path. The server also treated all embedding requests as text-only, so image-aware processors could not receive structured items at all.

At the same time, native embedding loading needed more precise validation. We want to allow extra Hugging Face checkpoint weights, but we should still fail closed when required native weights are missing or when provided tensor shapes do not match the expected model parameters.

What

add EmbeddingInputItem and items to the embeddings request schema
normalize structured embedding items into internal {text, image} payloads
update the embeddings server path and engine interface to accept structured items
pass multimodal items directly to custom embedding processors via prepare_embedding_inputs / prepare_model_inputs
reject image inputs for standard text-only embedding processors with a 400
count usage from prepared multimodal inputs so image-only and text+image requests report non-zero usage
allow empty-string text items in the structured request format to match the legacy text input behavior
validate native embedding checkpoints for required keys and shape compatibility while still allowing extra checkpoint weights
add regression coverage for structured requests, multimodal usage accounting, native loading validation, and endpoint behavior

Testing

uv run pytest tests/test_embedding.py tests/integration/test_server_endpoints.py -k 'embedding'

Manual verification

started oMLX with Qwen3-VL-Embedding-2B-mxfp8 forced to embedding
verified /v1/embeddings succeeds for:
- text-only items
- image-only items
- text+image items
- image inputs provided as URL
- image inputs provided as data URI
- image inputs provided as local path

MasakiMu319 · 2026-03-26T02:57:42Z

Hi @jundot ,

This PR focuses on adding multimodal embedding items support.

The fix for custom processor input preparation (fix(embedding): support custom processor input preparation) is in a separate PR (#369). This PR depends on that change and focuses on the multimodal embedding items feature.

When you have time, could you please help review it and let me know if there’s anything you’d like me to adjust or improve?

Thanks!

jundot · 2026-03-29T09:36:50Z

Thanks @MasakiMu319, reviewed the full diff and traced through every text embedding path to confirm nothing breaks for existing usage. Looks good.

I noticed a few minor things (dead is_likely_local_image_path(), bare except in _count_prepared_tokens(), stale total_tokens on compiled fallback, inconsistent oq_manager access pattern). None are blockers, i will clean them up in a follow-up commit.

Merging this.

Why: - add structured multimodal embedding inputs without breaking the existing text input path - support custom embedding processors that need image-aware input preparation - keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights What: - add an items-based embedding request format for text and image inputs - route structured items through embedding normalization, engine, and custom processor preparation - count usage from prepared multimodal inputs and preserve empty-string text items - extend tests for multimodal requests, custom processors, and native loading validation

…fallback - remove unused is_likely_local_image_path() and its import os - reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value - narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)

jundot · 2026-03-29T11:12:18Z

@MasakiMu319 v0.3.0rc1 is out with your multimodal embedding work included (#373, #369). https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks!

MasakiMu319 · 2026-03-29T12:53:55Z

@MasakiMu319 v0.3.0rc1 is out with your multimodal embedding work included (#373, #369). https://github.com/jundot/omlx/releases/tag/v0.3.0rc1 — if you get a chance, please give it a test and let me know if anything looks off. thanks!

I’ve tested the multimodal embedding flow in ‎v0.3.0rc1, and it works as expected for text-only, image-only, and text+image items (including image URLs, data URIs, and local file paths). Thanks again for cutting the release and the follow-up fixes!

Why: - add structured multimodal embedding inputs without breaking the existing text input path - support custom embedding processors that need image-aware input preparation - keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights What: - add an items-based embedding request format for text and image inputs - route structured items through embedding normalization, engine, and custom processor preparation - count usage from prepared multimodal inputs and preserve empty-string text items - extend tests for multimodal requests, custom processors, and native loading validation

…piled fallback - remove unused is_likely_local_image_path() and its import os - reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value - narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)

jundot force-pushed the main branch from 65b4ef1 to 2e39d71 Compare March 28, 2026 01:20

jundot force-pushed the feat/multimodal-embeddings branch from 1dd4212 to ba6a774 Compare March 29, 2026 09:38

jundot merged commit cdf0046 into jundot:main Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embedding): add multimodal embedding items#373

feat(embedding): add multimodal embedding items#373
jundot merged 1 commit intojundot:mainfrom
MasakiMu319:feat/multimodal-embeddings

MasakiMu319 commented Mar 24, 2026

Uh oh!

MasakiMu319 commented Mar 26, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

MasakiMu319 commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MasakiMu319 commented Mar 24, 2026

Summary

Why

What

Testing

Manual verification

Uh oh!

MasakiMu319 commented Mar 26, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

jundot commented Mar 29, 2026

Uh oh!

MasakiMu319 commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants