Skip to content

Compaction timeout races against channel timeout, causing stale-response loop #25272

@merlinrabens

Description

@merlinrabens

Summary

When compaction triggers on a Telegram channel, three competing timeout layers race against each other. If the channel timeout fires first, it delivers a stale cached response and aborts the in-flight compaction. Since context is still over threshold, compaction immediately retriggers — creating a deterministic loop of stale responses until manual intervention.

Environment

  • OpenClaw: 2026.2.22-2 (latest stable)
  • Model: anthropic/claude-opus-4.6 (200K context window)
  • Channel: Telegram
  • Compaction mode: safeguard

Root Cause

Three timeout layers compete during compaction:

Layer Default Configurable?
channels.telegram.timeoutSeconds 240s Yes
EMBEDDED_COMPACTION_TIMEOUT_MS 300s No (hardcoded in pi-embedded-CZp-Kzhd.js)
Session lock maxHoldMs 300s + 120s grace No

The channel timeout (240s) is the tightest. When Opus 4.6 takes >4 minutes to compact a large context, the Telegram channel gives up waiting, delivers the stale "current snapshot", and unsubscribes — which aborts the in-flight compaction with AbortError: Unsubscribed during compaction.

Since the context hasn't actually been compacted, it's still over threshold, so the next message triggers compaction again. Loop.

Reproduction

Observed three consecutive compaction attempts, each aborting at ~4 min:

Run 1 (23:03:17 → 23:07:17):
  compaction start → compaction wait aborted (timeout)
  "using current snapshot: timed out during compaction"
  "compaction promise rejected: AbortError: Unsubscribed during compaction"

Run 2 (23:07:18 → 23:11:18):
  [identical pattern]

Run 3 (23:11:20 → 23:15:20):
  [identical pattern]

User sees the same stale message delivered three times.

Workaround

Set the channel timeout above the compaction timeout:

{
  "channels": {
    "telegram": {
      "timeoutSeconds": 600
    }
  }
}

This stops the loop immediately — the channel waits long enough for compaction to finish. Additionally, lowering contextTokens (e.g. 128000) triggers compaction earlier on smaller context, which Opus can summarize within the 5-min window.

Suggested Upstream Fixes

Any one of these would eliminate the race condition:

  1. Pause channel timeout during compaction. If the gateway knows compaction is in progress, the channel shouldn't be racing against it. This is the cleanest fix.

  2. Make EMBEDDED_COMPACTION_TIMEOUT_MS configurable. Something like compaction.timeoutMs in config. The hardcoded 300s is too short for Opus on large contexts.

  3. Allow configuring the compaction model (related: feat: add model fallback support for /compact (compaction) #14543). Using Sonnet for compaction instead of the primary Opus model would be 3-5x faster and stay well within the timeout.

  4. Circuit breaker. After N consecutive compaction timeouts, force a session reset instead of looping forever with stale responses.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions