fix(embedded-runner): abort compaction wait on timeout by wangai-studio · Pull Request #15449 · openclaw/openclaw

wangai-studio · 2026-02-13T13:18:20Z

Summary

Fix a hang where an embedded attempt can get stuck waiting for compaction retry even after the run timeout/abort fires.
Add a regression test that reproduces the stuck compaction wait and asserts cleanup still runs.

Context / Bug

runEmbeddedAttempt awaited waitForCompactionRetry() after the prompt completes. If auto-compaction enters a retry state and the expected end events never arrive, this wait can remain pending indefinitely.

Because the wait wasn’t tied to the run abort signal, a timeoutMs abort could still leave the attempt hanging before its finally cleanup (clearActiveEmbeddedRun(...)). This manifests as sessions stuck in processing and diagnostic stuck session spam, and (for Telegram) no reply until a gateway restart clears in-memory state.

Fix

Make the compaction retry wait abortable:

src/agents/pi-embedded-runner/run/attempt.ts: await abortable(waitForCompactionRetry())

Evidence (Sanitized)

From a Telegram “no reply” incident (PII removed):

2026-02-12T13:01:08.804Z [agent/embedded] embedded run compaction start: runId=79b5f8f5-6107-46c9-8742-d5d34553eff1
2026-02-12T13:01:36.406Z [agent/embedded] embedded run compaction retry: runId=79b5f8f5-6107-46c9-8742-d5d34553eff1
2026-02-12T13:05:46.490Z [agent/embedded] embedded run timeout: runId=79b5f8f5-6107-46c9-8742-d5d34553eff1 sessionId=7f174700-813c-4fbc-a6f6-09cd04d67dbe timeoutMs=600000
2026-02-12T13:06:14.587Z [diagnostic] stuck session: sessionId=7f174700-813c-4fbc-a6f6-09cd04d67dbe sessionKey=unknown state=processing age=628s queueDepth=0

(Full sanitized excerpt available in my incident notes; happy to provide more if needed.)

Test Plan

pnpm check
pnpm build
pnpm vitest src/agents/pi-embedded-runner/run

AI-Assisted

AI-assisted (Codex). I reviewed the code and understand the change.

Greptile Overview

Greptile Summary

This PR makes the post-prompt waitForCompactionRetry() wait abortable by wrapping it in the attempt’s abort signal, preventing embedded runs from hanging past timeoutMs and skipping clearActiveEmbeddedRun(...) cleanup.

It also adds a regression test that simulates a compaction retry wait that never resolves and asserts the attempt exits on timeout and still runs cleanup. The behavior change is localized to the embedded runner attempt flow (src/agents/pi-embedded-runner/run/attempt.ts), specifically the section after the prompt completes where compaction retry waits previously could block indefinitely.

Confidence Score: 4/5

This PR is close to safe to merge, with the main risk being a potentially flaky regression test under fake timers.
The production change is small and correctly ties the compaction retry wait to the run abort signal. The primary remaining concern is the new test’s timing/awaiting pattern, which may intermittently fail in CI due to fake-timers microtask flushing rather than real logic regressions.
src/agents/pi-embedded-runner/run/attempt.compaction-timeout.test.ts

_{Last reviewed commit: bba5521}

_{(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!}

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-13T13:25:06Z

src/agents/pi-embedded-runner/run/attempt.compaction-timeout.test.ts

+    let finished = false;
+    void runPromise.finally(() => {
+      finished = true;
+    });
+
+    try {
+      await waitCalled;
+      await vi.advanceTimersByTimeAsync(timeoutMs + 1);
+      await Promise.resolve();
+
+      // Expect runner to end on timeout instead of hanging in compaction wait.
+      expect(finished).toBe(true);
+      expect(clearActiveEmbeddedRun).toHaveBeenCalledTimes(1);
+    } finally {


Flaky completion assertion

This test doesn’t await runPromise and instead asserts finished (set via runPromise.finally) after advancing fake timers and a single Promise.resolve(). With Vitest fake timers, the timeout callback + async finally cleanup (where clearActiveEmbeddedRun is called) can take additional microtask flushes, so the assertion can fail even when the fix works. Consider awaiting the promise (or awaiting the .finally() you attach) rather than relying on the finished flag/microtask timing.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/pi-embedded-runner/run/attempt.compaction-timeout.test.ts Line: 347:360 Comment: **Flaky completion assertion** This test doesn’t await `runPromise` and instead asserts `finished` (set via `runPromise.finally`) after advancing fake timers and a single `Promise.resolve()`. With Vitest fake timers, the timeout callback + async `finally` cleanup (where `clearActiveEmbeddedRun` is called) can take additional microtask flushes, so the assertion can fail even when the fix works. Consider awaiting the promise (or awaiting the `.finally()` you attach) rather than relying on the `finished` flag/microtask timing. How can I resolve this? If you propose a fix, please make it concise.

When compaction enters a retry and the agent never emits the expected end events, waitForCompactionRetry() can hang forever. This prevents the attempt cleanup from running and keeps the active run state stuck in processing. Wrap the compaction wait in the runner abort controller and add a regression test. Test: pnpm vitest src/agents/pi-embedded-runner/run

…openclaw#15449)

…n-wait # Conflicts: # src/agents/pi-embedded-runner/run/attempt.ts

openclaw-barnacle · 2026-02-20T10:48:02Z