Skip to content

Compaction deadlock blocks session recovery (/new, /reset queue behind timed-out compaction) #40295

@sene1337

Description

@sene1337

Summary

Compaction timeouts create an unrecoverable deadlock on the main session lane. When compaction fails (timeout at 300s or 600s), recovery commands (/new, /reset, --reset-session) queue behind the compaction in the same session lane and cannot execute. The only recovery path is kill -9 + manual session file rename — which took the user ~1 hour to discover.

This has occurred twice in three days (March 6 and March 8, 2026).


Incident 1 — March 6, 2026

Trigger: Large toolResult payloads in session history (single blobs up to 399,999 and 167,483 chars).

Compaction failures:

Session f84eb979 (Anthropic claude-sonnet-4-6):

  • 12:31 PM — compaction start. pre.estTokens=417895, pre.toolResultChars=1,292,139. Top contributor: toolResult:gateway = 399,999 chars
  • 12:36 PM — timeout after 300,119ms
  • 12:36 PM — retry. pre.estTokens=260736, pre.toolResultChars=666,940
  • 12:41 PM — timeout after 300,150ms

Session 46d47d54 (openai-codex/gpt-5.3-codex):

  • 6:56 PM — compaction start. pre.estTokens=192823, pre.toolResultChars=514,373. Top contributor: toolResult:exec = 167,483 chars
  • 7:01 PM — timeout after 300,070ms
  • 7:03 PM — retry with gpt-5.2-codex
  • 7:08 PM — timeout after 300,071ms

Additional failure mode: Anthropic summarization returned repeated 429 rate-limit errors during compaction (~6:49–6:50 PM), causing both full and partial summarization to fail before the timeout even hit.


Incident 2 — March 8, 2026

Trigger: Main Telegram DM session (cd8786f3) grew to ~3MB / 759 messages / ~1.19M characters with compactionCount: 0 — compaction had never completed successfully on this session.

Timeline (EST):

  • ~4:15–4:32 PM — Telegram polling stalls begin. Six stall detections with increasing backoff (2s → 30s).
  • 4:20 PM — First compaction timeout. runId=9ad93d4f, timeoutMs=600000. Gateway fell back to current snapshot.
  • 5:19 PM — Second compaction timeout. runId=33a3c6ef, timeoutMs=600000. Lane wait hit 506,539ms (8.4 minutes) with zero jobs ahead — the compaction itself was the blocker.
  • 5:22–5:25 PM — Subagent announce retries (4 attempts) all failed with gateway timeout (60,000ms each).
  • 5:26–5:48 PM — Six gateway restarts via SIGTERM. Each restart: gateway starts → Telegram poller connects → typing indicator shows ~2 min → typing TTL expires → no response → SIGTERM. Gateway could not break the cycle.
  • ~5:50 PM — User tried /new in TUI. TUI had stale auth token (v2026.2.26 token mismatch — 112 occurrences). Command did not execute.
  • ~5:55 PM — User tried openclaw acp --session "agent:main:main" --reset-session. Command hung — session locked in compaction, reset queued behind it.
  • ~6:00 PM — User tried new ACP session with uuidgen. Opened but did not affect Telegram DM routing (pinned to agent:main:main).
  • ~9:45 PMResolution: kill -9, manually renamed session .jsonl to .jsonl.reset.manual, LaunchAgent restarted gateway with fresh session.

The deadlock:

Every incoming Telegram message triggered safeguard-mode compaction → compaction timed out after 10 minutes → blocked the session lane → all recovery commands (/new, /reset) entered the same lane queue → could not execute until compaction completed → compaction never completed.


Root Cause

  1. Session lane is single-threaded. Compaction, message processing, and administrative commands (/new, /reset) all share the same lane. A timed-out compaction blocks everything.
  2. No compaction circuit breaker. Sessions that fail compaction repeatedly will keep attempting it on every incoming message, consuming the full timeout window each time.
  3. No out-of-band session reset. All reset paths go through the gateway session lane. If the lane is blocked, there is no recovery without filesystem surgery.

Expected Behavior

  • /new and /reset should preempt or abort an active compaction, not queue behind it
  • Compaction should have a circuit breaker — after N failures, stop retrying on every message
  • Session size should trigger a warning or auto-action before compaction becomes untenable (e.g., >500K chars or >500 messages)
  • A CLI command should exist for direct session file operations without going through the gateway (e.g., openclaw sessions reset --agent main --force)

Environment

  • OpenClaw gateway (LaunchAgent, macOS)
  • Compaction providers: anthropic/claude-sonnet-4-6, openai-codex/gpt-5.3-codex, openai-codex/gpt-5.2-codex
  • Compaction timeouts: 300s (March 6), 600s (March 8)
  • Channel: Telegram DM

Log Sources

  • Gateway logs: ~/.openclaw/logs/gateway.err.log
  • Session logs: /tmp/openclaw/openclaw-2026-03-06.log, /tmp/openclaw/openclaw-2026-03-08.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions