TTFT regression every ~3 messages on cache-hit prompts after b9cb0d8

## Problem

After the prefill performance fix in b9cb0d8 ("remove per-chunk `_sync_and_clear_cache()` calls during prefill"), resending an identical prompt shows a periodic TTFT spike roughly every 3 messages. The prompt is a near-exact cache match each time, so TTFT should be consistently low.

**Pattern:** Messages 1-2 are fast (cache hit), message 3 has a significant TTFT spike, then 4-5 fast, 6 spikes, etc.

## Root cause

The fix in b9cb0d8 correctly removed per-chunk `_sync_and_clear_cache()` from the prefill loop to prevent Metal buffer fragmentation. But it also disabled `mlx_cache_cleanup_interval` entirely (512 → 0), leaving no Metal buffer cleanup during or after generation.

The only remaining cleanup is `_sync_and_clear_cache()` at `scheduler.py:702`, which fires at the prefill-to-generation transition, *after* the new request's prefill has already run. Stale Metal buffers from the previous request's generation are still occupying the allocator during the next request's prefill.

Over ~3 requests (~512 generation steps at ~170 tok/response), accumulated buffer pressure hits a threshold that stalls the next prefill on Metal allocator contention.

## Suggested fix

**Clear Metal buffers at request completion, not at next-request prefill.**

The server is idle between serving a response and receiving the next request. Cleanup performed here is invisible to the user because there's no pending request whose TTFT clock is running. It's off-peak work: the cost is the same, but no one is waiting.

In `_cleanup_finished()`, after all request teardown is complete (~line 3821):

```python
        if finished_ids:
            self._update_stop_tokens()
            # Reclaim Metal buffers from completed generation while the
            # server is idle. No pending request pays the latency cost.
            mx.clear_cache()
```

This is the right place because:

- `_cleanup_finished` already handles all other teardown (cache storage, ref count release, MLX array nulling). Metal buffer reclamation is the missing final step.
- No additional sync is needed. `mx.synchronize(generation_stream)` already ran at line 3659, so `clear_cache()` is safe.
- No periodic tuning required. This eliminates the need to find an `mlx_cache_cleanup_interval` value that works across response lengths and hardware. The interval stays at 0.

The existing `_sync_and_clear_cache()` at line 702 stays as-is for clearing prefill intermediates before generation starts.

## Environment

- Machine: M3 Ultra (512GB)
- oMLX Version: 0.2.22 (includes b9cb0d8)

## Related

- b9cb0d8: "fix: remove per-chunk `_sync_and_clear_cache()` calls during prefill (#396)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTFT regression every ~3 messages on cache-hit prompts after b9cb0d8 #411

Problem

Root cause

Suggested fix

Environment

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

TTFT regression every ~3 messages on cache-hit prompts after b9cb0d8 #411

Description

Problem

Root cause

Suggested fix

Environment

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions