Skip to content

Kernel panic (IOGPUMemory completeMemory() prepare count underflow) after upgrading v0.2.20 → v0.2.23 #435

@gizmax

Description

@gizmax

Kernel panic (IOGPUMemory completeMemory() prepare count underflow) after upgrading from v0.2.20 to v0.2.23

Summary

Upgrading oMLX from v0.2.20 to v0.2.23 causes repeated kernel panics on Mac Studio M2 Ultra 64GB. Rolling back to v0.2.20 immediately resolves the issue. 4 identical panics occurred within ~1 hour of running v0.2.23.

Environment

  • Hardware: Mac Studio M2 Ultra, 64GB unified memory
  • macOS: 26.4 (Build 25E5233c), Darwin 25.4.0
  • oMLX: v0.2.23 (panics) / v0.2.20 (stable)
  • Models loaded: Qwen3.5-35B-A3B-4bit (pinned, ~20GB) + Qwen3.5-0.8B-4bit (SpecPrefill draft)
  • Config: --max-process-memory 80% --hot-cache-max-size 8GB --paged-ssd-cache-dir ~/.omlx/cache
  • SSD cache: ~92GB (1163 files)

Panic details

All 4 panics are identical:

panic(cpu 18 caller 0xfffffe00427725d8): "completeMemory() prepare count underflow" @IOGPUMemory.cpp:550

Timestamps: 21:11, 21:29, 21:40, 21:48 (2026-03-27, ~10-20 min intervals)

Reproduction steps

  1. Running oMLX v0.2.20 stably for days (no panics)
  2. pip install "omlx @ git+https://github.com/jundot/[email protected]"
  3. launchctl kickstart -k gui/502/ai.jarvis.omlx
  4. oMLX starts normally, models load, inference works
  5. Within 10-20 minutes: kernel panic
  6. After reboot, panics repeat on each boot until rollback

What v0.2.23 changed (from source diff)

Comparing v0.2.20 and v0.2.23, the relevant changes are:

  1. New files: patches/turboquant_attention.py, turboquant_kv.py (TurboQuant KV cache, disabled via hardcoded turboquant_kv_enabled = False)
  2. cache/prefix_cache.py: Changed offset calculation to always use tensor shape instead of meta_state. Added TurboQuantKVCache handling with block slicing.
  3. cache/paged_ssd_cache.py: Added TurboQuant tensor serialization, disk pressure handling (ENOSPC/EDQUOT)
  4. cache/type_handlers.py: Changed cache offset logic: cache.offset = keys.shape[2] instead of using meta_state
  5. engine/batched.py: Added TurboQuant initialization code (checks turboquant_kv_enabled, patches attention)
  6. model_settings.py: Added turboquant_kv_enabled and turboquant_kv_bits fields

Even though TurboQuant is disabled, the new cache handling code (prefix_cache offset fix, type_handlers change) runs unconditionally and likely changes Metal buffer allocation/deallocation patterns.

Analysis

The panic is a Metal GPU driver reference count underflow. The GPU driver's completeMemory() function encounters more deallocations than allocations, causing a negative prepare count. This is likely triggered by:

  • Changed buffer lifecycle in the new prefix cache offset logic
  • Metal buffer allocation patterns from TurboQuant infrastructure code (type registry, cache type detection) even when TurboQuant is disabled
  • Interaction with SSD cache reconstruction on startup (92GB cache, 1163 files)

Workaround

Rolling back to v0.2.20 resolves the issue completely:

pip install "omlx @ git+https://github.com/jundot/[email protected]"

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions