Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: jundot/omlx
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.3.1
Choose a base ref
...
head repository: jundot/omlx
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.3.2
Choose a head ref
  • 10 commits
  • 26 files changed
  • 1 contributor

Commits on Apr 2, 2026

  1. deps: bump mlx-vlm to 9db27b5

    jundot committed Apr 2, 2026
    Configuration menu
    Copy the full SHA
    12e668a View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2026

  1. formula: bump to v0.3.1

    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    4cd0e2e View commit details
    Browse the repository at this point in the history
  2. refactor: clean up oQ codebase for upcoming enhanced quantization red…

    …esign
    
    remove legacy GPTQ/clip optimization code and quantize_oq() full-model
    path. only streaming quantization path remains.
    
    fix dense model budget plan bug: gate_proj/up_proj excluded from
    sensitivity boost due to dead MLP asymmetry reference.
    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    127b08e View commit details
    Browse the repository at this point in the history
  3. deps: bump mlx-vlm to 43b9b20

    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    489b4d0 View commit details
    Browse the repository at this point in the history
  4. refactor: replace turboquant with mlx-vlm import instead of custom im…

    …plementation
    
    old turboquant had MSE-only codec with C++ metal extensions and was 350x slower
    than SDPA. now imports from mlx-vlm which has multi-codec support (MSE, Prod,
    Polar, Split), fractional bits (3.5), and 3 optimized decode paths via inline
    mx.fast.metal_kernel. only BatchTurboQuantKVCache is implemented locally for
    omlx's continuous batching scheduler. re-enables turboquant in admin UI.
    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    b627377 View commit details
    Browse the repository at this point in the history
  5. fix: apply EXIF orientation transpose in image loading

    omlx was missing ImageOps.exif_transpose() that mlx-vlm applies when
    loading images. phone photos with EXIF rotation tags were passed to the
    vision model in the wrong orientation, causing inaccurate recognition.
    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    5086d1c View commit details
    Browse the repository at this point in the history
  6. fix: turboquant immediate quantization and eval hang workaround

    quantize KV cache immediately during prefill (not deferred to first decode
    token) so peak memory is reduced throughout the entire inference. use
    concat-based state growth instead of in-place write. eval logits instead
    of cache states to work around mx.eval NamedTuple bulk traversal hang.
    disable prefill_attention Metal kernels for D>128 (hangs on Qwen3.5-27B
    with head_dim=256), fall back to dequantize+SDPA.
    
    Qwen3.5-27B-4bit pp32768/tg128:
      baseline: peak 21.82GB, TTFT 85.0s, tg 30.6 tok/s
      TQ 3-bit: peak 21.11GB, TTFT 92.8s, tg 16.3 tok/s
    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    26ecd0b View commit details
    Browse the repository at this point in the history
  7. fix: turboquant batch-quantize decode tokens and hybrid attention

    batch-quantize every 32 decode tokens instead of per-token quantization
    to improve GPU utilization (rotation matmul on 32 tokens vs 1 token).
    decode attention uses hybrid approach: TQ Metal kernels for quantized
    old tokens + standard dot product for buffered fp16 recent tokens.
    
    also fixes prefill_attention: disable for chunked prefill because
    mlx-vlm's value kernel unrolls n_repeats*L at compile time, hanging
    the Metal shader compiler for large L (e.g. 2048).
    
    Qwen3.5-27B-4bit pp32768/tg128 (3-bit TQ):
      before: TTFT 92.8s, tg 16.3 tok/s, peak 21.11GB
      after:  TTFT 92.8s, tg 19.2 tok/s, peak 21.11GB
      (baseline without TQ: tg 30.6 tok/s, peak 21.82GB)
    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    c2c8d93 View commit details
    Browse the repository at this point in the history
  8. bump 0.3.2

    jundot committed Apr 3, 2026
    Configuration menu
    Copy the full SHA
    98849a4 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    5ce8a4d View commit details
    Browse the repository at this point in the history
Loading