Compaction timeout races against channel timeout, causing stale-response loop

## Summary

When compaction triggers on a Telegram channel, three competing timeout layers race against each other. If the channel timeout fires first, it delivers a stale cached response and aborts the in-flight compaction. Since context is still over threshold, compaction immediately retriggers — creating a deterministic loop of stale responses until manual intervention.

## Environment
- **OpenClaw:** 2026.2.22-2 (latest stable)
- **Model:** anthropic/claude-opus-4.6 (200K context window)
- **Channel:** Telegram
- **Compaction mode:** safeguard

## Root Cause

Three timeout layers compete during compaction:

| Layer | Default | Configurable? |
|-------|---------|--------------|
| `channels.telegram.timeoutSeconds` | 240s | Yes |
| `EMBEDDED_COMPACTION_TIMEOUT_MS` | 300s | **No** (hardcoded in `pi-embedded-CZp-Kzhd.js`) |
| Session lock `maxHoldMs` | 300s + 120s grace | No |

The channel timeout (240s) is the tightest. When Opus 4.6 takes >4 minutes to compact a large context, the Telegram channel gives up waiting, delivers the stale "current snapshot", and unsubscribes — which aborts the in-flight compaction with `AbortError: Unsubscribed during compaction`.

Since the context hasn't actually been compacted, it's still over threshold, so the next message triggers compaction again. Loop.

## Reproduction

Observed three consecutive compaction attempts, each aborting at ~4 min:

```
Run 1 (23:03:17 → 23:07:17):
  compaction start → compaction wait aborted (timeout)
  "using current snapshot: timed out during compaction"
  "compaction promise rejected: AbortError: Unsubscribed during compaction"

Run 2 (23:07:18 → 23:11:18):
  [identical pattern]

Run 3 (23:11:20 → 23:15:20):
  [identical pattern]
```

User sees the same stale message delivered three times.

## Workaround

Set the channel timeout above the compaction timeout:

```json
{
  "channels": {
    "telegram": {
      "timeoutSeconds": 600
    }
  }
}
```

This stops the loop immediately — the channel waits long enough for compaction to finish. Additionally, lowering `contextTokens` (e.g. 128000) triggers compaction earlier on smaller context, which Opus can summarize within the 5-min window.

## Suggested Upstream Fixes

Any one of these would eliminate the race condition:

1. **Pause channel timeout during compaction.** If the gateway knows compaction is in progress, the channel shouldn't be racing against it. This is the cleanest fix.

2. **Make `EMBEDDED_COMPACTION_TIMEOUT_MS` configurable.** Something like `compaction.timeoutMs` in config. The hardcoded 300s is too short for Opus on large contexts.

3. **Allow configuring the compaction model** (related: #14543). Using Sonnet for compaction instead of the primary Opus model would be 3-5x faster and stay well within the timeout.

4. **Circuit breaker.** After N consecutive compaction timeouts, force a session reset instead of looping forever with stale responses.

## Related Issues
- #13379 (closed — same symptom, root cause not identified)
- #13347 (gateway hang)
- #16331 (lanes stuck)
- #14543 (compaction model selection)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compaction timeout races against channel timeout, causing stale-response loop #25272

Summary

Environment

Root Cause

Reproduction

Workaround

Suggested Upstream Fixes

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Layer	Default	Configurable?
`channels.telegram.timeoutSeconds`	240s	Yes
`EMBEDDED_COMPACTION_TIMEOUT_MS`	300s	No (hardcoded in `pi-embedded-CZp-Kzhd.js`)
Session lock `maxHoldMs`	300s + 120s grace	No

Uh oh!

Compaction timeout races against channel timeout, causing stale-response loop #25272

Description

Summary

Environment

Root Cause

Reproduction

Workaround

Suggested Upstream Fixes

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions