Skip to content

fix: wrap waitForCompactionRetry() in abortable() to prevent lane deadlock on timeout#13347

Closed
smartleos wants to merge 1 commit intoopenclaw:mainfrom
smartleos:fix/compaction-retry-abort-aware
Closed

fix: wrap waitForCompactionRetry() in abortable() to prevent lane deadlock on timeout#13347
smartleos wants to merge 1 commit intoopenclaw:mainfrom
smartleos:fix/compaction-retry-abort-aware

Conversation

@smartleos
Copy link
Copy Markdown

@smartleos smartleos commented Feb 10, 2026

Summary

  • Wrap waitForCompactionRetry() in the existing abortable() helper so the embedded run timeout's abort signal can interrupt the compaction wait
  • Without this, a timeout during compaction permanently blocks the affected DM lane, leaks the session in processing state, and leaves a zombie run in the active count

Problem

In src/agents/pi-embedded-runner/run/attempt.ts, the main prompt call is correctly wrapped:

await abortable(activeSession.prompt(effectivePrompt));  // ✅ abort-aware

But the compaction wait immediately after is not:

await waitForCompactionRetry();  // ❌ bare await, abort signal ignored

When the run timeout fires during compaction (e.g., OpenAI Batch API slow), abortRun(true) signals the runAbortController, but waitForCompactionRetry() never sees it. The finally block with clearActiveEmbeddedRun() never executes.

Result: session stuck in processing, run never cleared, lane task never completes → DM channel permanently dead until gateway restart.

Fix

One-line change — wrap in abortable() (already in scope, already used for the prompt):

- await waitForCompactionRetry();
+ await abortable(waitForCompactionRetry());

The existing catch block already handles AbortError correctly.

Test plan

  • Existing tests pass (pnpm test)
  • abortable() immediately rejects when signal already aborted (covers case where timeout fired before reaching compaction wait)
  • abortable() rejects via abort listener when signal fires during wait (covers case where timeout fires while compaction is in-flight)
  • finally block runs in both cases, calling clearActiveEmbeddedRun() and unsubscribe()

Fixes #13341

Note

AI-assisted PR. The fix was identified through production log analysis and verified by reading the source. The one-line change connects two existing, well-tested mechanisms (abortable() and waitForCompactionRetry()) that were simply not wired together.

Greptile Overview

Greptile Summary

This PR changes the embedded runner’s post-prompt compaction wait to be abort-aware by wrapping waitForCompactionRetry() with the existing local abortable() helper in src/agents/pi-embedded-runner/run/attempt.ts. This makes the compaction-wait phase respond to the same runAbortController signal used for the main activeSession.prompt(...) call, so run timeouts/aborts can propagate through the compaction wait and reliably reach the outer finally cleanup that unsubscribes and clears the active embedded run/lane handle.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • The change is a single-line wiring fix that reuses an existing, locally-defined abort helper already used for the main prompt call. The abort error is explicitly recognized by isRunnerAbortError, and non-abort errors are still rethrown, so behavior outside the timeout/abort path remains unchanged.
  • src/agents/pi-embedded-runner/run/attempt.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

…dlock on timeout

When an embedded run times out during the post-reply compaction phase,
abortRun(true) fires but the abort signal never reaches
waitForCompactionRetry() because it is a bare await — not wrapped in
abortable(). This causes the finally cleanup block to never execute,
permanently blocking the affected DM lane, leaking the session in
"processing" state, and leaving a zombie run in the active count.

The fix wraps waitForCompactionRetry() in the existing abortable()
helper (already used for activeSession.prompt() in the same scope),
so the abort signal properly interrupts the compaction wait and
allows the finally block to run clearActiveEmbeddedRun().

Fixes openclaw#13341
@sebslight
Copy link
Copy Markdown
Member

Closing as duplicate of #12227. If this is incorrect, please contact us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Embedded run timeout fails to clean up session/lane state when compaction is in-flight (waitForCompactionRetry not abort-aware)

2 participants