Releases · jundot/omlx

@TipKnuckle

This release fixes the memory increase issue introduced by mlx-vlm batching support in v0.3.3.

Highlights

Improved Gemma 4 support

Upgraded mlx-lm to 4469ad4 (BatchGenerator refactor) and mlx-vlm to 90732bd. Gemma 4 vision, audio, and MoE model support. Multi-image vision tower crash fix for different resolutions. Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565). Added omlx-specific customizations for Gemma 4 VLM continuous batching compatibility.

TurboQuant: near-zero overhead decode

Rewrote BatchTurboQuantKVCache as a TurboQuantKVCache subclass instead of wrapping with delegation. Combined with @Blaizzy's new fused Metal kernels (score+softmax+value in 1 dispatch), decode overhead dropped from 43% to 8% vs baseline.

Fixed double-softmax bug in hybrid attention (#556) that caused models to lose focus and loop. Fixed continuous batching shape mismatch (#559) when multiple requests join the batch.

Qwen3.5-4B-MLX-4bit, 8k context, 3-bit TQ:

	baseline	TurboQuant	ratio
decode	117.9 tok/s	109.6 tok/s	0.93x
peak mem	5.19 GB	4.90 GB	-5.6%
KV cache	0.30 GB	0.10 GB	-67%

Continuous batching with TurboQuant

TurboQuant now works with multiple concurrent requests. Batch operations (merge/extract/extend/filter) handle quantized state correctly across batch size changes.

New Features

Vision feature cache for multi-turn image reuse
MCP tool call loop, engine status timeline in chat UI by @rayone (#509)
Gemma 4 reasoning parser and agentic tool calling by @TipKnuckle (#565)
Auto-update Homebrew formula on tag push

Bug Fixes

Fix VLM decode model memory duplication (#582)
Fix VLM batched decode degeneration via mlx-lm decode model
Fix grammar constrained generation for new BatchGenerator pipelining
Fix Gemma 4 multi-image vision tower crash on different resolutions
Fix preserve image_url parts in Gemma 4 message extractor
Fix apply Gemma 4 message extractor in Anthropic handler
Fix VLM sanitize proxy missing audio_tower attribute
Fix pad undersized RotatingKVCache on SSD restore to prevent merge crash
Fix null num_experts in oQ for Gemma 4 dense models (#554)
Fix bfloat16 audio in TTS wav conversion (#551)
Bypass proxy for local oMLX health checks by @MKuBMax (#558)
Fix chat UI: hide _ui:false messages, remove stray </think> on abort

Dependencies

Bump mlx-lm to 4469ad4 (BatchGenerator refactor + Gemma 4)
Bump mlx-vlm to 90732bd (fused TurboQuant Metal kernels)

New Contributors

@MKuBMax — Bypass proxy for local health checks (#558)
@rayone — MCP tool call loop, engine status timeline, and chat UI polish (#509)

Full changelog: v0.3.2...v0.3.4

@Blaizzy

Highlights

Gemma 4 support

Bumped mlx-vlm to 43b9b20 which adds Gemma 4 vision, audio, and MoE model support. Also includes chunked prefill fixes for KV-shared models.

TurboQuant is back

Based on @Blaizzy's mlx-vlm TurboQuant integration. Imports mlx-vlm's multi-codec engine directly — Prod, MSE, Polar, and Split codecs with fractional bit support (e.g. 3.5-bit).

I built BatchTurboQuantKVCache on top of mlx-vlm's single-request TurboQuantKVCache for omlx's continuous batching scheduler. KV cache is quantized immediately during prefill to reduce peak memory, with batch-quantized decode tokens (every 32 tokens) and hybrid attention for buffered fp16 + quantized state.

Enable from the admin dashboard per-model settings or via model_settings.json.

Qwen3.5-27B-4bit, 3-bit TQ:

	32k context		128k context
	baseline	TQ	baseline	TQ
KV cache mem	2.14 GB	0.54 GB (-75%)	8.14 GB	1.70 GB (-79%)
Peak mem	22.47 GB	21.11 GB (-1.4 GB)	37.66 GB	33.55 GB (-4.1 GB)
Prefill	362 tok/s	353 tok/s	238 tok/s	226 tok/s
Decode	28.4 tok/s	17.9 tok/s	19.4 tok/s	7.3 tok/s

Peak memory savings scale with context length. Decode speed tradeoff is inherent to quantized KV attention — TQ is designed for memory-constrained long context, not speed.

Bug Fixes

Fix EXIF orientation not applied when loading images in VLM engine
Fix chunked prefill for Gemma 4 KV-shared models

Dependencies

Bump mlx-vlm to 43b9b20 (Gemma 4, TurboQuant)

Full changelog: v0.3.1...v0.3.2

@latent-variable

Bug Fixes

fix TTL expiration unloading models with active in-flight requests — all engine types (LLM, VLM, embedding, reranker, STT, TTS, STS) now report active request count so TTL check skips busy engines (#522)
fix VLM mRoPE position state lost during prefill — multi-turn conversations on Qwen2-VL/Qwen2.5-VL could produce degraded output (#531)
fix race condition between snapshot writer thread and cleanup
fix thinking fallback tool call extraction too greedy — tightened regex to prevent false matches (#484)
fix model aliases not resolving in audio endpoints (#525)
fix missing mlx-audio optional deps for TTS/STT/STS (#515)
fix force_lm benchmark loading failing on VLM-only models (#487)

Improvements

make xgrammar optional — auto-detects install method (pip vs uv) and shows correct install command
enable faulthandler for native crash diagnostics (#511, #520)
re-download notice toggle in HF uploader
oQ: update descriptions to reflect current implementation, temporarily disable enhanced quantization UI
deps: bump mlx-vlm to 9db27b5

New Contributors

@latent-variable made their first contribution in #517

@ethannortharc

Highlights

Audio support — STT, TTS, STS

mlx-audio integration brings three new engine types for audio models on Apple Silicon. by @ethannortharc

STT (Speech-to-Text): Whisper, Qwen3-ASR, Parakeet, Voxtral
TTS (Text-to-Speech): Qwen3-TTS, Kokoro, F5-TTS, Sesame CSM, Dia, Spark, CosyVoice
STS (Speech-to-Speech): DeepFilterNet, MossFormer2, SAMAudio, LFM2.5-Audio

Three new OpenAI-compatible endpoints: /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/process (oMLX-specific). Audio models are automatically detected from the mlx-audio registry and show up in the admin dashboard alongside LLM/VLM models.

Optional dependency — pip install 'omlx[audio]'. Homebrew and DMG builds include mlx-audio by default. (#365)

XGrammar constrained decoding

xgrammar-based structured output generation by @leuski. Enforces grammar constraints at the logit level using bitmasks, running in parallel with the model forward pass via mx.async_eval.

Supported grammar types:

json — JSON schema validation
regex — regular expression patterns
grammar — EBNF/GBNF grammars
choice — allowed string lists

Uses vLLM-compatible structured_outputs field (in extra_body). Per-model reasoning_parser config maps xgrammar's structural tags to model protocols (Qwen, Harmony, DeepSeek, Llama, etc.). Performance overhead is 9–24% on decode (no TTFT impact), decreasing with larger models.

Optional dependency — pip install 'omlx[grammar]'. Homebrew and DMG builds include xgrammar by default. Without it, response_format falls back to prompt injection and structured_outputs returns a 400 with install instructions. (#335)

New Features

XTC sampler — XTC (eXclude Top Choices) sampling support. Pass xtc_probability and xtc_threshold through any API endpoint. Defaults to 0.0 (disabled) (#337 by @blightbow)
MCP Streamable HTTP — MCP now supports Streamable HTTP transport in addition to stdio (#286 by @tianfeng98)
Multimodal embedding items — /v1/embeddings accepts structured items with text + image input. Tested with Qwen3-VL-Embedding-2B-mxfp8 (#373 by @MasakiMu319)
Custom processor embedding support — Embedding requests route through custom processor hooks when available, fixing models like Qwen3-VL-Embedding that broke on the generic tokenizer path (#369 by @MasakiMu319)
System prompt support in chat UI — Chat interface now accepts system prompts
Clear all SSD cache button — Admin dashboard has a new button to clear all SSD cache blocks
SSD cache size display — Shows SSD cache size even when no models are loaded
Responsive admin dashboard — Admin dashboard now works on mobile devices
Real-time menubar updates — macOS menubar status (Stopped → Starting → Running) and button states now update in real time while the menu is open, without needing to close and reopen (#426 by @EmotionalAmo)
App bundle size reduction — Stripped torch, cv2, pyarrow, pandas, sympy from the app bundle (~780MB saved)
mlx-vlm bump — Updated to v0.4.2 (7f7a04c)
mlx-embeddings bump — Updated to v0.1.0 (32981fa)

Bug Fixes

Fix ContentPart list content causing 500 error when Qwen Code CLI sends [{"type": "text", ...}] instead of string (#433 by @mbauer)
Fix prefix index storing mutable reference to block_ids, causing cache corruption on CoW block changes (#391 by @andafterall)
Fix Anthropic document content blocks (base64 PDF) rejected by Pydantic validation (#434)
Fix "Total Tokens Processed" metric ignoring reasoning/thinking tokens — renamed to "Total Prefill Tokens" for clarity (#430)
Fix MCP client resource leak on partial connect failure
Fix MCP config errors crashing server startup — now gracefully disables MCP features (#474)
Fix ValueError from processor.apply_chat_template in VLM engine
Fix fake <think> tag prepended in TRACE log output
Fix VLM cache proxy using wrong offset for batched mask sizing
Fix specprefill_threshold not propagating from model_settings to scheduler
Fix max_position_embeddings not detected in nested text_config
Fix voice param not mapped to instruct for VoiceDesign TTS models
Fix voice param not routed to generate() for CustomVoice TTS models (#467)
Fix instruct fallback degrading CustomVoice audio quality
Fix Devstral/Mistral tool calling broken by empty tool_call_end marker
Fix settings.json overwritten on restart without explicit CLI args (#471)
Fix admin dashboard Lucide icons not rendering in Safari
Fix 8-bit quantization inconsistency — unified to affine/gs64 (removed mxfp8/gs32)
Fix admin dashboard polling continuing when browser tab is hidden (#352)
Fix _prefix_index type annotations to match tuple storage
Fix menubar menu status duplication and _build_menu guard during menu open
Fix mlx-audio dependency conflicts with mlx-lm version pinning
Fix ffmpeg discovery and video container support in GUI app
Fix auto-generate README when frontmatter-only stub exists
Sync 25 stale tests with current implementation

New Contributors

@ethannortharc — Audio integration (STT/TTS/STS) (#365)
@leuski — XGrammar constrained decoding (#335)
@tianfeng98 — MCP Streamable HTTP (#286)
@MasakiMu319 — Multimodal embedding items (#373, #369)
@mbauer — ContentPart list type error fix (#433)
@andafterall — Prefix index mutable reference fix (#391)
@EmotionalAmo — Menubar real-time update (#426)

Full changelog: v0.2.24...v0.3.0

@ethannortharc

This is a pre-release of oMLX 0.3.0. It will go through 1 day of testing before the official release. If you find any bugs, please report them on the issues page!

Highlights

Audio support — STT, TTS, STS

mlx-audio integration brings three new engine types for audio models on Apple Silicon. by @ethannortharc

STT (Speech-to-Text): Whisper, Qwen3-ASR, Parakeet, Voxtral
TTS (Text-to-Speech): Qwen3-TTS, Kokoro, F5-TTS, Sesame CSM, Dia, Spark, CosyVoice
STS (Speech-to-Speech): DeepFilterNet, MossFormer2, SAMAudio, LFM2.5-Audio

Three new OpenAI-compatible endpoints: /v1/audio/transcriptions, /v1/audio/speech, /v1/audio/process (oMLX-specific). Audio models are automatically detected from the mlx-audio registry and show up in the admin dashboard alongside LLM/VLM models.

Optional dependency — pip install 'omlx[audio]'. Homebrew and DMG builds include mlx-audio by default. (#365)

XGrammar constrained decoding

xgrammar-based structured output generation by @leuski. Enforces grammar constraints at the logit level using bitmasks, running in parallel with the model forward pass via mx.async_eval.

Supported grammar types:

json — JSON schema validation
regex — regular expression patterns
grammar — EBNF/GBNF grammars
choice — allowed string lists

Uses vLLM-compatible structured_outputs field (in extra_body). Per-model reasoning_parser config maps xgrammar's structural tags to model protocols (Qwen, Harmony, DeepSeek, Llama, etc.). Performance overhead is 9–24% on decode (no TTFT impact), decreasing with larger models.

Optional dependency — pip install 'omlx[grammar]'. Homebrew and DMG builds include xgrammar by default. Without it, response_format falls back to prompt injection and structured_outputs returns a 400 with install instructions. (#335)

New Features

XTC sampler — XTC (eXclude Top Choices) sampling support. Pass xtc_probability and xtc_threshold through any API endpoint. Defaults to 0.0 (disabled) (#337 by @blightbow)
MCP Streamable HTTP — MCP now supports Streamable HTTP transport in addition to stdio (#286 by @tianfeng98)
Multimodal embedding items — /v1/embeddings accepts structured items with text + image input. Tested with Qwen3-VL-Embedding-2B-mxfp8 (#373 by @MasakiMu319)
Custom processor embedding support — Embedding requests route through custom processor hooks when available, fixing models like Qwen3-VL-Embedding that broke on the generic tokenizer path (#369 by @MasakiMu319)
System prompt support in chat UI — Chat interface now accepts system prompts
Clear all SSD cache button — Admin dashboard has a new button to clear all SSD cache blocks
SSD cache size display — Shows SSD cache size even when no models are loaded
Responsive admin dashboard — Admin dashboard now works on mobile devices
Real-time menubar updates — macOS menubar status (Stopped → Starting → Running) and button states now update in real time while the menu is open, without needing to close and reopen (#426 by @EmotionalAmo)
App bundle size reduction — Stripped torch, cv2, pyarrow, pandas, sympy from the app bundle (~780MB saved)
mlx-vlm bump — Updated to v0.4.2 (7f7a04c)
mlx-embeddings bump — Updated to v0.1.0 (32981fa)

Bug Fixes

Fix ContentPart list content causing 500 error when Qwen Code CLI sends [{"type": "text", ...}] instead of string (#433 by @mbauer)
Fix prefix index storing mutable reference to block_ids, causing cache corruption on CoW block changes (#391 by @andafterall)
Fix Anthropic document content blocks (base64 PDF) rejected by Pydantic validation (#434)
Fix "Total Tokens Processed" metric ignoring reasoning/thinking tokens — renamed to "Total Prefill Tokens" for clarity (#430)
Fix MCP client resource leak on partial connect failure
Fix ValueError from processor.apply_chat_template in VLM engine
Fix fake <think> tag prepended in TRACE log output
Fix VLM cache proxy using wrong offset for batched mask sizing
Fix specprefill_threshold not propagating from model_settings to scheduler
Fix max_position_embeddings not detected in nested text_config
Fix voice param not mapped to instruct for VoiceDesign TTS models
Fix 8-bit quantization inconsistency — unified to affine/gs64 (removed mxfp8/gs32)
Fix admin dashboard polling continuing when browser tab is hidden (#352)
Fix _prefix_index type annotations to match tuple storage
Fix menubar menu status duplication and _build_menu guard during menu open
Fix mlx-audio dependency conflicts with mlx-lm version pinning
Fix auto-generate README when frontmatter-only stub exists
Sync 25 stale tests with current implementation

New Contributors

@ethannortharc — Audio integration (STT/TTS/STS) (#365)
@leuski — XGrammar constrained decoding (#335)
@tianfeng98 — MCP Streamable HTTP (#286)
@MasakiMu319 — Multimodal embedding items (#373, #369)
@mbauer — ContentPart list type error fix (#433)
@andafterall — Prefix index mutable reference fix (#391)
@EmotionalAmo — Menubar real-time update (#426)

Full changelog: v0.2.24...v0.3.0rc1

v0.2.24 Release Notes

Critical Bug Fixes

Fix VLM loading failure on all Qwen3.5 models — transformers 5.4.0 (released March 27) rewrote Qwen2VLImageProcessor from numpy/PIL to torch/torchvision backend, breaking VLM loading in environments without torch. Every Qwen3.5 model failed VLM init and fell back to LLM, causing double model loading and ~2x peak memory usage. Pinned transformers>=5.0.0,<5.4.0. (#431)
Fix IOKit kernel panic (completeMemory prepare count underflow) — Immediate mx.clear_cache() after request completion raced with IOKit's asynchronous reference count cleanup, causing kernel panics on M1/M2/M3 devices. Deferred Metal buffer clearing by 8 generation steps to allow IOKit callbacks to complete. (#435)
Fix swap during model load with memory guard enabled — mx.set_memory_limit() caused MLX to aggressively reclaim cached buffers during model loading, creating alloc/free churn that pushed the system into swap. Removed Metal-level memory limits entirely since all memory protection uses mx.get_active_memory() polling instead. (#429)

Bug Fixes

Fix GPTQ performance for large MoE models
Fix VLM tokenizer eager loading causing OOM during oQ quantization
Harden error recovery to prevent SIGABRT from secondary Metal errors during cleanup (#429, #435)

Updating to 0.2.23 is strongly recommended. 0.2.22 contains critical bugs that cause crashes and memory issues on long context and concurrent requests. Sorry for the trouble.

v0.2.23 Release Notes

Critical Bug Fixes

Fix Metal buffer accumulation during prefill causing crashes — 0.2.22 disabled buffer clearing between prefill chunks, causing GPU memory to build up across chunks until the Metal driver crashes. This affected all devices but was especially severe on machines with less memory. (#410, #412, #421)
Fix TTFT spikes from stale Metal buffers between requests — Freed buffers accumulated in the Metal buffer pool across requests, forcing expensive emergency GC during the next prefill. (#411)
Fix KVCache offset mismatch in cache reconstruction — Stored meta_state offset could exceed actual tensor length after partial prefix match, causing broadcast_shapes errors on hybrid attention models (Qwen3.5) at concurrency > 1. (#409)

Bug Fixes

Fix MoE router gate quantization causing model load failure
Fix TurboQuant KV cache conversion missing in cache-merge prefill path (#422)
Disable experimental TurboQuant feature pending further optimization

Improvements

oQ: Enhance bit allocation strategy
oQ: Enable enhanced quantization for Nemotron-H models

v0.2.22 Release Notes

Bug Fixes

Fix GPU synchronization before batch_generator.remove() in request abort path
Fix prefill performance regression from unnecessary per-chunk _sync_and_clear_cache() calls (#396)
Fix images being stripped from Anthropic tool_result content for VLM models (#393)
Fix GPTQ axis mismatch — align dequantize-quantize grouping with mx.quantize
Fix GPTQ group_size fallback crash on non-power-of-2 output dimensions
Fix accuracy benchmark forcing LM engine to avoid VLM empty responses

Improvements

Support x-api-key header for Anthropic SDK compatibility (#379)
oQ: MLP asymmetry for dense models — reduce up_proj bits while protecting gate_proj/down_proj
oQ: GPTQ performance and stability improvements, rename enhanced suffix to e

v0.2.21 Release Notes

Highlights

TurboQuant KV cache (experimental)

This is an experimental feature and may not work correctly in all scenarios.

Codebook-quantized KV cache that compresses key-value states during generation. Based on TurboQuant — random orthogonal rotation + Beta distribution codebook + boundary-based scalar quantization.

How it works: Prefill runs at full fp16 speed (no quality loss). At the first decode token, the accumulated KV cache is quantized to 3-bit or 4-bit codebook indices. Decode attention uses a fused 2-pass Flash Attention Metal kernel that reads directly from packed indices — no dequantization, no fp16 intermediate tensors.

Needle-in-Haystack (Qwen3.5-35B-A3B, 3-bit TurboQuant)

Context	Baseline	TurboQuant	KV Memory
32K	✅	✅	735MB → 195MB (73%)
64K	✅	✅	1407MB → 327MB (77%)
128K	✅	✅	2749MB → 589MB (79%)

Performance

Model	Prefill Speed	Decode Speed
Qwen3.5-35B-A3B	95%	87%
Qwen3.5-27B	97%	95%

Speed values are percentage of fp16 baseline performance.

Enable from admin UI → model settings → experimental features → TurboQuant KV Cache toggle.

oQe — enhanced quantization with GPTQ weight optimization

oQe adds Hessian-based GPTQ error compensation on top of oQ's sensitivity-driven mixed-precision system. Standard quantization rounds each weight independently. GPTQ processes columns sequentially and adjusts remaining weights to compensate for rounding error, guided by the inverse Hessian of calibration inputs. The result is the same mlx-lm compatible format — identical structure, identical inference speed — but with substantially lower output error.

For MoE models, routed experts (90%+ of total parameters) are processed using a batched algorithm: all experts in a layer share the same Hessian, so column-by-column optimization runs on all experts simultaneously. Qwen3.5-35B-A3B (256 experts × 40 layers): ~6 minutes vs ~90 minutes sequential (15x speedup, identical results).

Supported architectures: Qwen3.5 MoE/dense, MiniMax-M2.5, GLM, Step-3.5, Nemotron-Cascade, Llama/Mistral, VLM (vision weights kept fp16). See the oQ documentation for details.

Bug Fixes

Fix VLM cache proxy using authoritative mx.array offset instead of unreliable _offset shortcut
Fix monotonic _offset for BatchRotatingKVCache in VLM proxy
Fix disk-full logging for SSD cache writes (#342)
Fix LoRA adapter directories appearing in model discovery and admin downloads (#356)
Fix generation memory guard with Metal cache cleanup on OOM failure (#372)
Fix function_call_output accepting list/dict and serializing to JSON string (#367)
Fix download popup menu z-index in accuracy benchmark UI (#370)
Fix oq_calibration_data.json missing from package data (#374)
Fix merge chat_template_kwargs in eval to prevent duplicate kwarg error
Fix TemplateResponse calls for Starlette 1.0 compatibility (#351)
Fix homebrew formula HEAD install support

Full changelog: v0.2.20...v0.2.21

@yes999zc

Highlights

oQ — oMLX universal dynamic quantization

Quantization should not be exclusive to any particular inference server. oQ produces standard mlx-lm compatible models that work everywhere — oMLX, mlx-lm, and any app that supports MLX safetensors. No custom loader required.

oQ is a data-driven mixed-precision quantization system for Apple Silicon. Instead of assigning bits by fixed rules or tensor type, oQ measures each layer's actual quantization sensitivity through calibration and allocates bits where the data says they matter most. See the oQ documentation for details.

Benchmarks (Qwen3.5-35B-A3B)

Benchmark	Samples	2-bit mlx-lm	2-bit oQ	3-bit mlx-lm	3-bit oQ	4-bit mlx-lm	4-bit oQ
MMLU	300	14.0%	64.0%	76.3%	85.0%	79.7%	83.3%
TRUTHFULQA	300	17.0%	80.0%	81.7%	86.7%	87.7%	88.0%
HUMANEVAL	164 (full)	0.0%	78.0%	84.8%	86.6%	87.2%	85.4%
MBPP	300	0.3%	63.3%	69.0%	72.0%	71.7%	74.3%

oQ2-oQ8 levels with sensitivity-driven mixed-precision bit allocation
oQ3.5 base 3-bit + routed expert down_proj 4-bit (Super Weights protection)
AWQ weight equalization rewritten from scratch following the llm-compressor reference implementation. fixed critical double-scaling bug on hybrid attention models (Qwen3.5) and added per-layer mask-aware calibration
Sensitivity-driven budget plan. mandatory lm_head 8-bit protection, then data-driven tier allocation (+4/+2/+1 bits) with greedy fallback. no hardcoded tensor-type priorities — calibration data decides which layers matter
Proxy sensitivity model. select a quantized version of the source model for layer sensitivity analysis with ~4x less memory. 90% top-10 overlap with full-precision measurement validated on Qwen3.5-35B
New calibration dataset. 600 samples from codeparrot/self-instruct-starcoder (real code), allenai/c4 (web text), Open-Orca (conversation), gsm8k (reasoning), and wikipedia multilingual. replaces the old HumanEval/MBPP-only code samples
VLM support. quantize vision-language models with vision weight preservation (fp16)
FP8 model support. use native FP8 models (MiniMax, DeepSeek) as quantization source
MiniMax M2.5 support. block_sparse_moe architecture with SwitchGLU fused experts
DeepSeek V3.2 support. shared_experts (plural) + MLA projections. MLP AWQ works, MLA attention AWQ planned
Nemotron support. backbone.embeddings path detection for sensitivity measurement on hybrid Mamba+MoE+Attention architecture
AWQ grid size setting. configurable n_grid (10 fast / 20 recommended) from the web UI
HuggingFace Hub uploader. upload quantized models directly from the dashboard
blocks inference requests during quantization to prevent conflicts

Intelligence benchmark suite

Evaluate model intelligence across knowledge, reasoning, math, and coding benchmarks. All datasets bundled locally for offline use.

Knowledge: MMLU, ARC-Challenge, KMMLU (Korean), CMMLU (Chinese), JMMLU (Japanese)
Reasoning: HellaSwag, Winogrande, TruthfulQA, GSM8K
Coding: HumanEval (164 function completions, pass@1), MBPP
benchmark queue for sequential multi-model evaluation with persistent results
comparison table with mode/sample columns and text export
sample size options: 30/50/100/200/300/500/1000/2000/Full
batch processing: 1x/2x/4x/8x/16x/32x
download raw results as JSON

New Features

Prefill memory guard. prevents kernel panics on large context by detecting head_dim>128 O(n^2) SDPA fallback and enforcing safe prefill chunk sizes
Native BERT/XLMRoBERTa embedding. load BERT-family embedding models (bge-m3, mxbai-embed) without mlx-embeddings fallback (#330 by @yes999zc)
Jina v3 reranker. reranking via <|score_token|> logits for jinaai/jina-reranker-v3-mlx (#331 by @yes999zc)
Partial mode. assistant message prefill support for Moonshot/Kimi K2 models (partial field + name field passthrough) (#306 by @blightbow)
Codex smart config merging. non-destructive config merge with reasoning model auto-detection (#249 by @JasonYeYuhe)
i18n normalization. normalize translation files against en.json with missing key detection (#247 by @xiaoran007)
Web dashboard generating status. show generating status for active requests after prefill completes

Experimental Features

SpecPrefill. attention-based sparse prefill for MoE models. reduces prefill compute by skipping low-attention tokens. system prompt is protected from token dropping to preserve instruction following.

Bug Fixes

Fix lucide icon rendering race condition with Alpine.js microtask
Fix chat streaming failure not sending error message to client (#342)
Fix TTL auto-unload during benchmark causing Metal GPU crash
Fix MC benchmarks (MMLU, HellaSwag, TruthfulQA) always scoring 0% due to max_tokens=1
Fix HumanEval scoring. prepend prompt imports when model returns function only
Fix MBPP scoring. include test cases in prompt so model uses correct function name
Fix benchmark code extraction. extract last answer/code block instead of first
Fix benchmark penalties. force neutral presence_penalty=0 and repetition_penalty=1
Fix think prefix false positive for disabled thinking patterns (<think></think>)
Fix responses API image support for VLM + missing prompt_tokens in completions usage
Fix SSE streaming behind nginx reverse proxy (X-Accel-Buffering header) (#309)
Fix CausalLM-based embedding model detection (Qwen3-Embedding) (#327)
Fix web dashboard unload tooltip clipping in active models box (#314)
Fix web dashboard 401 warning log spam from dashboard polling
Fix web dashboard model settings not showing for embedding/reranker models
Fix PEP 735 dependency-groups for uv sync --dev (#305 by @blightbow)

New Contributors

@blightbow made their first contribution in #305
@yes999zc made their first contribution in #330
@JasonYeYuhe made their first contribution in #249
@xiaoran007 made their first contribution in #247

Full changelog: v0.2.19...v0.2.20

Releases: jundot/omlx

v0.3.4

Highlights

Improved Gemma 4 support

TurboQuant: near-zero overhead decode

Continuous batching with TurboQuant

New Features

Bug Fixes

Dependencies

New Contributors

Contributors

Uh oh!

v0.3.2

Highlights

Gemma 4 support

TurboQuant is back

Bug Fixes

Dependencies

Contributors

Uh oh!

v0.3.1

Bug Fixes

Improvements

New Contributors

Contributors

Uh oh!

v0.3.0

Highlights

Audio support — STT, TTS, STS

XGrammar constrained decoding

New Features

Bug Fixes

New Contributors

Contributors

Uh oh!

v0.3.0rc1

Highlights

Audio support — STT, TTS, STS

XGrammar constrained decoding

New Features

Bug Fixes

New Contributors

Contributors

Uh oh!

v0.2.24

v0.2.24 Release Notes

Critical Bug Fixes

Bug Fixes

Uh oh!

v0.2.23

v0.2.23 Release Notes

Critical Bug Fixes

Bug Fixes

Improvements

Uh oh!

v0.2.22

v0.2.22 Release Notes

Bug Fixes

Improvements

Uh oh!

v0.2.21

v0.2.21 Release Notes

Highlights

TurboQuant KV cache (experimental)

Needle-in-Haystack (Qwen3.5-35B-A3B, 3-bit TurboQuant)

Performance

oQe — enhanced quantization with GPTQ weight optimization

Bug Fixes

Uh oh!

v0.2.20

Highlights

oQ — oMLX universal dynamic quantization

Benchmarks (Qwen3.5-35B-A3B)

Intelligence benchmark suite

New Features

Experimental Features

Bug Fixes

New Contributors

Contributors

Uh oh!