-
-
Notifications
You must be signed in to change notification settings - Fork 69.2k
Compaction timeout races against channel timeout, causing stale-response loop #25272
Description
Summary
When compaction triggers on a Telegram channel, three competing timeout layers race against each other. If the channel timeout fires first, it delivers a stale cached response and aborts the in-flight compaction. Since context is still over threshold, compaction immediately retriggers — creating a deterministic loop of stale responses until manual intervention.
Environment
- OpenClaw: 2026.2.22-2 (latest stable)
- Model: anthropic/claude-opus-4.6 (200K context window)
- Channel: Telegram
- Compaction mode: safeguard
Root Cause
Three timeout layers compete during compaction:
| Layer | Default | Configurable? |
|---|---|---|
channels.telegram.timeoutSeconds |
240s | Yes |
EMBEDDED_COMPACTION_TIMEOUT_MS |
300s | No (hardcoded in pi-embedded-CZp-Kzhd.js) |
Session lock maxHoldMs |
300s + 120s grace | No |
The channel timeout (240s) is the tightest. When Opus 4.6 takes >4 minutes to compact a large context, the Telegram channel gives up waiting, delivers the stale "current snapshot", and unsubscribes — which aborts the in-flight compaction with AbortError: Unsubscribed during compaction.
Since the context hasn't actually been compacted, it's still over threshold, so the next message triggers compaction again. Loop.
Reproduction
Observed three consecutive compaction attempts, each aborting at ~4 min:
Run 1 (23:03:17 → 23:07:17):
compaction start → compaction wait aborted (timeout)
"using current snapshot: timed out during compaction"
"compaction promise rejected: AbortError: Unsubscribed during compaction"
Run 2 (23:07:18 → 23:11:18):
[identical pattern]
Run 3 (23:11:20 → 23:15:20):
[identical pattern]
User sees the same stale message delivered three times.
Workaround
Set the channel timeout above the compaction timeout:
{
"channels": {
"telegram": {
"timeoutSeconds": 600
}
}
}This stops the loop immediately — the channel waits long enough for compaction to finish. Additionally, lowering contextTokens (e.g. 128000) triggers compaction earlier on smaller context, which Opus can summarize within the 5-min window.
Suggested Upstream Fixes
Any one of these would eliminate the race condition:
-
Pause channel timeout during compaction. If the gateway knows compaction is in progress, the channel shouldn't be racing against it. This is the cleanest fix.
-
Make
EMBEDDED_COMPACTION_TIMEOUT_MSconfigurable. Something likecompaction.timeoutMsin config. The hardcoded 300s is too short for Opus on large contexts. -
Allow configuring the compaction model (related: feat: add model fallback support for /compact (compaction) #14543). Using Sonnet for compaction instead of the primary Opus model would be 3-5x faster and stay well within the timeout.
-
Circuit breaker. After N consecutive compaction timeouts, force a session reset instead of looping forever with stale responses.
Related Issues
- Compaction causes gateway to hang, requiring manual restart #13379 (closed — same symptom, root cause not identified)
- fix: wrap waitForCompactionRetry() in abortable() to prevent lane deadlock on timeout #13347 (gateway hang)
- Session lane stays stuck after embedded compaction run timeout #16331 (lanes stuck)
- feat: add model fallback support for /compact (compaction) #14543 (compaction model selection)