Comparing changes

) * fix: update TemplateResponse calls for Starlette 1.0 compatibility Starlette 1.0.0 changed the TemplateResponse signature from TemplateResponse(name, context) to TemplateResponse(request, name, context). The old positional API passed the context dict as the `name` parameter, causing Jinja2's LRUCache to receive an unhashable dict as a cache key: TypeError: unhashable type: 'dict' This broke the admin dashboard (login, dashboard, chat pages) with HTTP 500, while the /v1/ inference API was unaffected. - Update 3 TemplateResponse calls in admin/routes.py - Pass explicit empty context dict for consistency - Add TestLoginPage and TestDashboardPage test classes - Update existing TestChatPageApiKeyInjection assertions * address review: bump fastapi floor + improve login_page test assertion - Bump fastapi>=0.100.0 to >=0.108.0 to match the Starlette 1.0 TemplateResponse(request, name, context) signature requirement (Codex P1 review) - Use assert_called_once_with in login_page test to verify all args including context dict (Gemini review)

…370) animate-fade-in-up ended with transform: translateY(0) which creates a stacking context per card, trapping popup z-50 inside. changed to transform: none which doesn't create a stacking context.

…tring (#367)

…372) - defer scheduling new requests when active memory exceeds soft limit during generation to prevent Metal allocation failures - clear Metal buffer cache in fail_all_requests() to reclaim fragmented memory after batch generation errors - rename "Prefill memory guard" to "Memory guard" in admin UI and move toggle to top of Resource Management section

detect adapter_config.json to filter out LoRA/PEFT adapters that oMLX cannot load. show warning badge instead of download button in admin UI.

_IntOffsetCacheProxy returned _idx for RoPE offset, but BatchRotatingKVCache._idx wraps at max_size (e.g. 1024 -> 0). after the sliding window fills, RoPE positions reset to 0 causing gibberish output on Gemma3 and other sliding-window VLM models. use _offset (monotonic, never wraps) instead. only BatchRotatingKVCache has _offset, so BatchKVCache and other cache types are unaffected.

replace _idx/_offset shortcuts with direct offset[0].item() extraction. _idx wraps at max_size (continuous generation), _offset diverges after merge() which sets it to buffer size instead of actual token count (SSD cache restore). the mx.array offset is always correct. also add Gemma3-12B-QAT to boundary cache consistency tests.

- implement GPTQ column-wise error compensation for all quantizable weights - batched expert GPTQ processes 256 experts simultaneously (15x faster) - shared Hessian across experts in each MoE layer - sensitivity budget assigns per-tensor bits before GPTQ optimization - fix float32 norm weights to bfloat16 for mlx-lm inference parity - add eos_token_id from generation_config.json - add Step-3.5 MoE support (moe.*_proj pattern) - update admin UI: Enhanced Quantization(+) with GPTQ description - update docs/oQ_Quantization.md for GPTQ-based pipeline - clean up legacy equalization code

codebook-quantized KV cache that reduces memory ~50-70% during decode with near-lossless quality. lazy quantization keeps prefill at fp16 speed, then compresses at decode start. core: - TurboQuantKVCache + BatchTurboQuantKVCache with full batch lifecycle - 2-pass fused Flash Attention Metal kernel (no dequant during decode) - boundary-based quantization (19x faster than argmin) - batch decode_attention via same fused kernel (B>1 grid dispatch) integration: - attention patch with VLM _IntOffsetCacheProxy unwrap - prefix cache: save quantized blocks to SSD, dequant to KVCache on restore for merge compatibility. meta_state stores (offset, bits, seed) - type_registry: TurboQuantKVCache recognized as sliceable - admin UI: turboquant toggle with 3-bit/4-bit selection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Mar 23, 2026

Commits on Mar 24, 2026

Commits on Mar 25, 2026

This comparison is taking too long to generate.

Uh oh!