Crashes during large-context  inference; server stops with Metal/MLX SIGABRT

## Bug: oMLX 0.3.0 server stops during large-context inference with MiniMax-M2.5-MLX-4bit, crashing in Metal/MLX backend on macOS Tahoe 26.3.1

### Summary
oMLX **0.3.0** crashed at **2026-04-01 18:56:56 -0500** while Hermes was sending a **large-context** request to **MiniMax-M2.5-MLX-4bit** through the oMLX API. During the request, oMLX went offline, stopped answering API calls, and the oMLX app showed **“server stopped.”**

The macOS crash report shows this was a native **SIGABRT** in the **Metal/MLX backend**, not a normal Python exception.

### Environment
- **oMLX version:** 0.3.0
- **App:** oMLX.app
- **Process:** `python3`
- **OS:** macOS Tahoe 26.3.1
- **Hardware:** Mac Studio M3 Ultra, 256 GB unified memory
- **Crash time:** **2026-04-01 18:56:56 -0500**
- **Client:** Hermes
- **Model in use at crash time:** `MiniMax-M2.5-MLX-4bit`

### What I was doing
Hermes was processing a **large-context** request against the oMLX API using `MiniMax-M2.5-MLX-4bit`.

At the time of the crash:
- oMLX was actively serving inference
- Hermes was waiting on the API response
- oMLX suddenly went offline
- subsequent API calls failed
- the oMLX app reported that the **server stopped**

### Expected behavior
oMLX should continue serving the request or fail gracefully with an error response, without terminating the server.

### Actual behavior
The oMLX server stopped during inference and became unavailable to API clients. The process crashed rather than returning a handled error.

### Crash details
The macOS crash report shows:

- **Signal:** `SIGABRT` / `Abort trap: 6`
- Relevant stack frames:
  - `__assert_rtn`
  - `MTLReportFailure`
  - `-[IOGPUMetalCommandBuffer encodeSignalEvent:value:]`
  - `-[AGXG15XFamilyCommandBuffer encodeSignalEvent:value:]`
  - `mlx::core::Event::signal(...)`
  - `mlx::core::eval_impl(...)`

This appears to be an assertion failure in Apple Metal triggered during MLX event signaling / synchronization.

### Reproduction
I do not yet have a minimal deterministic repro, but the observed scenario was:

1. Start oMLX 0.3.0
2. Serve `MiniMax-M2.5-MLX-4bit` over the API
3. Send a **large-context** request from Hermes
4. During inference, oMLX stops and the app reports **server stopped**

### Frequency
Observed several times before on previous oMLX version

### Why this seems important
This is not just a bad model response or API timeout. The server process itself terminated during inference, which makes oMLX unavailable to clients until manually restarted.

### Possible area to investigate
The crash stack suggests a backend issue related to MLX/Metal synchronization during evaluation of a large request, possibly around:

- `mlx::core::Event::signal(...)`
- `mlx::core::eval_impl(...)`
- Metal command buffer event signaling via `encodeSignalEvent:value:`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes during large-context inference; server stops with Metal/MLX SIGABRT #520

Bug: oMLX 0.3.0 server stops during large-context inference with MiniMax-M2.5-MLX-4bit, crashing in Metal/MLX backend on macOS Tahoe 26.3.1

Summary

Environment

What I was doing

Expected behavior

Actual behavior

Crash details

Reproduction

Frequency

Why this seems important

Possible area to investigate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Crashes during large-context inference; server stops with Metal/MLX SIGABRT #520

Description

Bug: oMLX 0.3.0 server stops during large-context inference with MiniMax-M2.5-MLX-4bit, crashing in Metal/MLX backend on macOS Tahoe 26.3.1

Summary

Environment

What I was doing

Expected behavior

Actual behavior

Crash details

Reproduction

Frequency

Why this seems important

Possible area to investigate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions