Compaction deadlock blocks session recovery (/new, /reset queue behind timed-out compaction)

## Summary

Compaction timeouts create an unrecoverable deadlock on the main session lane. When compaction fails (timeout at 300s or 600s), recovery commands (`/new`, `/reset`, `--reset-session`) queue behind the compaction in the same session lane and cannot execute. The only recovery path is `kill -9` + manual session file rename — which took the user ~1 hour to discover.

This has occurred **twice in three days** (March 6 and March 8, 2026).

---

## Incident 1 — March 6, 2026

**Trigger:** Large `toolResult` payloads in session history (single blobs up to 399,999 and 167,483 chars).

### Compaction failures:

**Session `f84eb979`** (Anthropic claude-sonnet-4-6):
- 12:31 PM — compaction start. `pre.estTokens=417895`, `pre.toolResultChars=1,292,139`. Top contributor: `toolResult:gateway = 399,999 chars`
- 12:36 PM — **timeout** after 300,119ms
- 12:36 PM — retry. `pre.estTokens=260736`, `pre.toolResultChars=666,940`
- 12:41 PM — **timeout** after 300,150ms

**Session `46d47d54`** (openai-codex/gpt-5.3-codex):
- 6:56 PM — compaction start. `pre.estTokens=192823`, `pre.toolResultChars=514,373`. Top contributor: `toolResult:exec = 167,483 chars`
- 7:01 PM — **timeout** after 300,070ms
- 7:03 PM — retry with gpt-5.2-codex
- 7:08 PM — **timeout** after 300,071ms

**Additional failure mode:** Anthropic summarization returned repeated `429` rate-limit errors during compaction (~6:49–6:50 PM), causing both full and partial summarization to fail before the timeout even hit.

---

## Incident 2 — March 8, 2026

**Trigger:** Main Telegram DM session (`cd8786f3`) grew to ~3MB / 759 messages / ~1.19M characters with `compactionCount: 0` — compaction had **never completed successfully** on this session.

### Timeline (EST):
- **~4:15–4:32 PM** — Telegram polling stalls begin. Six stall detections with increasing backoff (2s → 30s).
- **4:20 PM** — First compaction timeout. `runId=9ad93d4f`, `timeoutMs=600000`. Gateway fell back to current snapshot.
- **5:19 PM** — Second compaction timeout. `runId=33a3c6ef`, `timeoutMs=600000`. Lane wait hit 506,539ms (8.4 minutes) with zero jobs ahead — the compaction itself was the blocker.
- **5:22–5:25 PM** — Subagent announce retries (4 attempts) all failed with gateway timeout (60,000ms each).
- **5:26–5:48 PM** — Six gateway restarts via SIGTERM. Each restart: gateway starts → Telegram poller connects → typing indicator shows ~2 min → typing TTL expires → no response → SIGTERM. Gateway could not break the cycle.
- **~5:50 PM** — User tried `/new` in TUI. TUI had stale auth token (v2026.2.26 token mismatch — 112 occurrences). Command did not execute.
- **~5:55 PM** — User tried `openclaw acp --session "agent:main:main" --reset-session`. Command **hung** — session locked in compaction, reset queued behind it.
- **~6:00 PM** — User tried new ACP session with `uuidgen`. Opened but did not affect Telegram DM routing (pinned to `agent:main:main`).
- **~9:45 PM** — **Resolution:** `kill -9`, manually renamed session `.jsonl` to `.jsonl.reset.manual`, LaunchAgent restarted gateway with fresh session.

### The deadlock:
Every incoming Telegram message triggered safeguard-mode compaction → compaction timed out after 10 minutes → blocked the session lane → all recovery commands (`/new`, `/reset`) entered the same lane queue → could not execute until compaction completed → compaction never completed.

---

## Root Cause

1. **Session lane is single-threaded.** Compaction, message processing, and administrative commands (`/new`, `/reset`) all share the same lane. A timed-out compaction blocks everything.
2. **No compaction circuit breaker.** Sessions that fail compaction repeatedly will keep attempting it on every incoming message, consuming the full timeout window each time.
3. **No out-of-band session reset.** All reset paths go through the gateway session lane. If the lane is blocked, there is no recovery without filesystem surgery.

---

## Expected Behavior

- `/new` and `/reset` should **preempt or abort** an active compaction, not queue behind it
- Compaction should have a **circuit breaker** — after N failures, stop retrying on every message
- Session size should trigger a **warning or auto-action** before compaction becomes untenable (e.g., >500K chars or >500 messages)
- A CLI command should exist for **direct session file operations** without going through the gateway (e.g., `openclaw sessions reset --agent main --force`)

---

## Environment

- OpenClaw gateway (LaunchAgent, macOS)
- Compaction providers: `anthropic/claude-sonnet-4-6`, `openai-codex/gpt-5.3-codex`, `openai-codex/gpt-5.2-codex`
- Compaction timeouts: 300s (March 6), 600s (March 8)
- Channel: Telegram DM

## Log Sources

- Gateway logs: `~/.openclaw/logs/gateway.err.log`
- Session logs: `/tmp/openclaw/openclaw-2026-03-06.log`, `/tmp/openclaw/openclaw-2026-03-08.log`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compaction deadlock blocks session recovery (/new, /reset queue behind timed-out compaction) #40295

Summary

Incident 1 — March 6, 2026

Compaction failures:

Incident 2 — March 8, 2026

Timeline (EST):

The deadlock:

Root Cause

Expected Behavior

Environment

Log Sources

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Compaction deadlock blocks session recovery (/new, /reset queue behind timed-out compaction) #40295

Description

Summary

Incident 1 — March 6, 2026

Compaction failures:

Incident 2 — March 8, 2026

Timeline (EST):

The deadlock:

Root Cause

Expected Behavior

Environment

Log Sources

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions