Problem
After the prefill performance fix in b9cb0d8 ("remove per-chunk _sync_and_clear_cache() calls during prefill"), resending an identical prompt shows a periodic TTFT spike roughly every 3 messages. The prompt is a near-exact cache match each time, so TTFT should be consistently low.
Pattern: Messages 1-2 are fast (cache hit), message 3 has a significant TTFT spike, then 4-5 fast, 6 spikes, etc.
Root cause
The fix in b9cb0d8 correctly removed per-chunk _sync_and_clear_cache() from the prefill loop to prevent Metal buffer fragmentation. But it also disabled mlx_cache_cleanup_interval entirely (512 → 0), leaving no Metal buffer cleanup during or after generation.
The only remaining cleanup is _sync_and_clear_cache() at scheduler.py:702, which fires at the prefill-to-generation transition, after the new request's prefill has already run. Stale Metal buffers from the previous request's generation are still occupying the allocator during the next request's prefill.
Over ~3 requests (~512 generation steps at ~170 tok/response), accumulated buffer pressure hits a threshold that stalls the next prefill on Metal allocator contention.
Suggested fix
Clear Metal buffers at request completion, not at next-request prefill.
The server is idle between serving a response and receiving the next request. Cleanup performed here is invisible to the user because there's no pending request whose TTFT clock is running. It's off-peak work: the cost is the same, but no one is waiting.
In _cleanup_finished(), after all request teardown is complete (~line 3821):
if finished_ids:
self._update_stop_tokens()
# Reclaim Metal buffers from completed generation while the
# server is idle. No pending request pays the latency cost.
mx.clear_cache()
This is the right place because:
_cleanup_finished already handles all other teardown (cache storage, ref count release, MLX array nulling). Metal buffer reclamation is the missing final step.
- No additional sync is needed.
mx.synchronize(generation_stream) already ran at line 3659, so clear_cache() is safe.
- No periodic tuning required. This eliminates the need to find an
mlx_cache_cleanup_interval value that works across response lengths and hardware. The interval stays at 0.
The existing _sync_and_clear_cache() at line 702 stays as-is for clearing prefill intermediates before generation starts.
Environment
- Machine: M3 Ultra (512GB)
- oMLX Version: 0.2.22 (includes b9cb0d8)
Related
Problem
After the prefill performance fix in b9cb0d8 ("remove per-chunk
_sync_and_clear_cache()calls during prefill"), resending an identical prompt shows a periodic TTFT spike roughly every 3 messages. The prompt is a near-exact cache match each time, so TTFT should be consistently low.Pattern: Messages 1-2 are fast (cache hit), message 3 has a significant TTFT spike, then 4-5 fast, 6 spikes, etc.
Root cause
The fix in b9cb0d8 correctly removed per-chunk
_sync_and_clear_cache()from the prefill loop to prevent Metal buffer fragmentation. But it also disabledmlx_cache_cleanup_intervalentirely (512 → 0), leaving no Metal buffer cleanup during or after generation.The only remaining cleanup is
_sync_and_clear_cache()atscheduler.py:702, which fires at the prefill-to-generation transition, after the new request's prefill has already run. Stale Metal buffers from the previous request's generation are still occupying the allocator during the next request's prefill.Over ~3 requests (~512 generation steps at ~170 tok/response), accumulated buffer pressure hits a threshold that stalls the next prefill on Metal allocator contention.
Suggested fix
Clear Metal buffers at request completion, not at next-request prefill.
The server is idle between serving a response and receiving the next request. Cleanup performed here is invisible to the user because there's no pending request whose TTFT clock is running. It's off-peak work: the cost is the same, but no one is waiting.
In
_cleanup_finished(), after all request teardown is complete (~line 3821):This is the right place because:
_cleanup_finishedalready handles all other teardown (cache storage, ref count release, MLX array nulling). Metal buffer reclamation is the missing final step.mx.synchronize(generation_stream)already ran at line 3659, soclear_cache()is safe.mlx_cache_cleanup_intervalvalue that works across response lengths and hardware. The interval stays at 0.The existing
_sync_and_clear_cache()at line 702 stays as-is for clearing prefill intermediates before generation starts.Environment
Related
_sync_and_clear_cache()calls during prefill (Crashes #396)"