-
-
Notifications
You must be signed in to change notification settings - Fork 69.4k
Compaction deadlock blocks session recovery (/new, /reset queue behind timed-out compaction) #40295
Description
Summary
Compaction timeouts create an unrecoverable deadlock on the main session lane. When compaction fails (timeout at 300s or 600s), recovery commands (/new, /reset, --reset-session) queue behind the compaction in the same session lane and cannot execute. The only recovery path is kill -9 + manual session file rename — which took the user ~1 hour to discover.
This has occurred twice in three days (March 6 and March 8, 2026).
Incident 1 — March 6, 2026
Trigger: Large toolResult payloads in session history (single blobs up to 399,999 and 167,483 chars).
Compaction failures:
Session f84eb979 (Anthropic claude-sonnet-4-6):
- 12:31 PM — compaction start.
pre.estTokens=417895,pre.toolResultChars=1,292,139. Top contributor:toolResult:gateway = 399,999 chars - 12:36 PM — timeout after 300,119ms
- 12:36 PM — retry.
pre.estTokens=260736,pre.toolResultChars=666,940 - 12:41 PM — timeout after 300,150ms
Session 46d47d54 (openai-codex/gpt-5.3-codex):
- 6:56 PM — compaction start.
pre.estTokens=192823,pre.toolResultChars=514,373. Top contributor:toolResult:exec = 167,483 chars - 7:01 PM — timeout after 300,070ms
- 7:03 PM — retry with gpt-5.2-codex
- 7:08 PM — timeout after 300,071ms
Additional failure mode: Anthropic summarization returned repeated 429 rate-limit errors during compaction (~6:49–6:50 PM), causing both full and partial summarization to fail before the timeout even hit.
Incident 2 — March 8, 2026
Trigger: Main Telegram DM session (cd8786f3) grew to ~3MB / 759 messages / ~1.19M characters with compactionCount: 0 — compaction had never completed successfully on this session.
Timeline (EST):
- ~4:15–4:32 PM — Telegram polling stalls begin. Six stall detections with increasing backoff (2s → 30s).
- 4:20 PM — First compaction timeout.
runId=9ad93d4f,timeoutMs=600000. Gateway fell back to current snapshot. - 5:19 PM — Second compaction timeout.
runId=33a3c6ef,timeoutMs=600000. Lane wait hit 506,539ms (8.4 minutes) with zero jobs ahead — the compaction itself was the blocker. - 5:22–5:25 PM — Subagent announce retries (4 attempts) all failed with gateway timeout (60,000ms each).
- 5:26–5:48 PM — Six gateway restarts via SIGTERM. Each restart: gateway starts → Telegram poller connects → typing indicator shows ~2 min → typing TTL expires → no response → SIGTERM. Gateway could not break the cycle.
- ~5:50 PM — User tried
/newin TUI. TUI had stale auth token (v2026.2.26 token mismatch — 112 occurrences). Command did not execute. - ~5:55 PM — User tried
openclaw acp --session "agent:main:main" --reset-session. Command hung — session locked in compaction, reset queued behind it. - ~6:00 PM — User tried new ACP session with
uuidgen. Opened but did not affect Telegram DM routing (pinned toagent:main:main). - ~9:45 PM — Resolution:
kill -9, manually renamed session.jsonlto.jsonl.reset.manual, LaunchAgent restarted gateway with fresh session.
The deadlock:
Every incoming Telegram message triggered safeguard-mode compaction → compaction timed out after 10 minutes → blocked the session lane → all recovery commands (/new, /reset) entered the same lane queue → could not execute until compaction completed → compaction never completed.
Root Cause
- Session lane is single-threaded. Compaction, message processing, and administrative commands (
/new,/reset) all share the same lane. A timed-out compaction blocks everything. - No compaction circuit breaker. Sessions that fail compaction repeatedly will keep attempting it on every incoming message, consuming the full timeout window each time.
- No out-of-band session reset. All reset paths go through the gateway session lane. If the lane is blocked, there is no recovery without filesystem surgery.
Expected Behavior
/newand/resetshould preempt or abort an active compaction, not queue behind it- Compaction should have a circuit breaker — after N failures, stop retrying on every message
- Session size should trigger a warning or auto-action before compaction becomes untenable (e.g., >500K chars or >500 messages)
- A CLI command should exist for direct session file operations without going through the gateway (e.g.,
openclaw sessions reset --agent main --force)
Environment
- OpenClaw gateway (LaunchAgent, macOS)
- Compaction providers:
anthropic/claude-sonnet-4-6,openai-codex/gpt-5.3-codex,openai-codex/gpt-5.2-codex - Compaction timeouts: 300s (March 6), 600s (March 8)
- Channel: Telegram DM
Log Sources
- Gateway logs:
~/.openclaw/logs/gateway.err.log - Session logs:
/tmp/openclaw/openclaw-2026-03-06.log,/tmp/openclaw/openclaw-2026-03-08.log