-
Notifications
You must be signed in to change notification settings - Fork 718
Kernel panic (IOGPUMemory completeMemory() prepare count underflow) after upgrading v0.2.20 → v0.2.23 #435
Description
Kernel panic (IOGPUMemory completeMemory() prepare count underflow) after upgrading from v0.2.20 to v0.2.23
Summary
Upgrading oMLX from v0.2.20 to v0.2.23 causes repeated kernel panics on Mac Studio M2 Ultra 64GB. Rolling back to v0.2.20 immediately resolves the issue. 4 identical panics occurred within ~1 hour of running v0.2.23.
Environment
- Hardware: Mac Studio M2 Ultra, 64GB unified memory
- macOS: 26.4 (Build 25E5233c), Darwin 25.4.0
- oMLX: v0.2.23 (panics) / v0.2.20 (stable)
- Models loaded: Qwen3.5-35B-A3B-4bit (pinned, ~20GB) + Qwen3.5-0.8B-4bit (SpecPrefill draft)
- Config:
--max-process-memory 80% --hot-cache-max-size 8GB --paged-ssd-cache-dir ~/.omlx/cache - SSD cache: ~92GB (1163 files)
Panic details
All 4 panics are identical:
panic(cpu 18 caller 0xfffffe00427725d8): "completeMemory() prepare count underflow" @IOGPUMemory.cpp:550
Timestamps: 21:11, 21:29, 21:40, 21:48 (2026-03-27, ~10-20 min intervals)
Reproduction steps
- Running oMLX v0.2.20 stably for days (no panics)
pip install "omlx @ git+https://github.com/jundot/[email protected]"launchctl kickstart -k gui/502/ai.jarvis.omlx- oMLX starts normally, models load, inference works
- Within 10-20 minutes: kernel panic
- After reboot, panics repeat on each boot until rollback
What v0.2.23 changed (from source diff)
Comparing v0.2.20 and v0.2.23, the relevant changes are:
- New files:
patches/turboquant_attention.py,turboquant_kv.py(TurboQuant KV cache, disabled via hardcodedturboquant_kv_enabled = False) cache/prefix_cache.py: Changed offset calculation to always use tensor shape instead of meta_state. Added TurboQuantKVCache handling with block slicing.cache/paged_ssd_cache.py: Added TurboQuant tensor serialization, disk pressure handling (ENOSPC/EDQUOT)cache/type_handlers.py: Changed cache offset logic:cache.offset = keys.shape[2]instead of using meta_stateengine/batched.py: Added TurboQuant initialization code (checksturboquant_kv_enabled, patches attention)model_settings.py: Addedturboquant_kv_enabledandturboquant_kv_bitsfields
Even though TurboQuant is disabled, the new cache handling code (prefix_cache offset fix, type_handlers change) runs unconditionally and likely changes Metal buffer allocation/deallocation patterns.
Analysis
The panic is a Metal GPU driver reference count underflow. The GPU driver's completeMemory() function encounters more deallocations than allocations, causing a negative prepare count. This is likely triggered by:
- Changed buffer lifecycle in the new prefix cache offset logic
- Metal buffer allocation patterns from TurboQuant infrastructure code (type registry, cache type detection) even when TurboQuant is disabled
- Interaction with SSD cache reconstruction on startup (92GB cache, 1163 files)
Workaround
Rolling back to v0.2.20 resolves the issue completely:
pip install "omlx @ git+https://github.com/jundot/[email protected]"Related issues
- mlx-lm #883: Same IOGPUMemory panic with 30B model on M3 Ultra
- oMLX Kernel panic after running few minutes #300: Kernel panics reported in v0.2.13/v0.2.18, stable in v0.2.11