Comparing changes

…esign remove legacy GPTQ/clip optimization code and quantize_oq() full-model path. only streaming quantization path remains. fix dense model budget plan bug: gate_proj/up_proj excluded from sensitivity boost due to dead MLP asymmetry reference.

…plementation old turboquant had MSE-only codec with C++ metal extensions and was 350x slower than SDPA. now imports from mlx-vlm which has multi-codec support (MSE, Prod, Polar, Split), fractional bits (3.5), and 3 optimized decode paths via inline mx.fast.metal_kernel. only BatchTurboQuantKVCache is implemented locally for omlx's continuous batching scheduler. re-enables turboquant in admin UI.

omlx was missing ImageOps.exif_transpose() that mlx-vlm applies when loading images. phone photos with EXIF rotation tags were passed to the vision model in the wrong orientation, causing inaccurate recognition.

quantize KV cache immediately during prefill (not deferred to first decode token) so peak memory is reduced throughout the entire inference. use concat-based state growth instead of in-place write. eval logits instead of cache states to work around mx.eval NamedTuple bulk traversal hang. disable prefill_attention Metal kernels for D>128 (hangs on Qwen3.5-27B with head_dim=256), fall back to dequantize+SDPA. Qwen3.5-27B-4bit pp32768/tg128: baseline: peak 21.82GB, TTFT 85.0s, tg 30.6 tok/s TQ 3-bit: peak 21.11GB, TTFT 92.8s, tg 16.3 tok/s

batch-quantize every 32 decode tokens instead of per-token quantization to improve GPU utilization (rotation matmul on 32 tokens vs 1 token). decode attention uses hybrid approach: TQ Metal kernels for quantized old tokens + standard dot product for buffered fp16 recent tokens. also fixes prefill_attention: disable for chunked prefill because mlx-vlm's value kernel unrolls n_repeats*L at compile time, hanging the Metal shader compiler for large L (e.g. 2048). Qwen3.5-27B-4bit pp32768/tg128 (3-bit TQ): before: TTFT 92.8s, tg 16.3 tok/s, peak 21.11GB after: TTFT 92.8s, tg 19.2 tok/s, peak 21.11GB (baseline without TQ: tg 30.6 tok/s, peak 21.82GB)

Commits on Apr 2, 2026

deps: bump mlx-vlm to 9db27b5

jundot committed Apr 2, 2026

Configuration menu

View commit details

Copy full SHA for 12e668a

Browse repository at this point

Copy the full SHA

12e668a View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Apr 2, 2026

Commits on Apr 3, 2026

This comparison is taking too long to generate.

Uh oh!