fix(agents): wait for agent idle before flushing pending tool results by rodbland2021 · Pull Request #13746 · openclaw/openclaw

rodbland2021 · 2026-02-11T00:02:26Z

Problem

Tool results are intermittently lost during normal agent operation, with the system inserting synthetic errors:

[openclaw] missing tool result in session history; inserted synthetic error result for transcript repair.

This causes agents to go silent and stop responding. Affects all versions including 2026.2.9.

Fixes #8643, #13351. Refs #6682, #12595.

Root Cause

The bug is a race condition between pi-agent-core's auto-retry mechanism and the attempt lifecycle in run/attempt.ts.

When an API error occurs (e.g., overloaded_error, rate limit), pi-agent-core's _handleRetryableError() retries the LLM call via agent.continue(). When the retry succeeds with a tool call:

_resolveRetry() fires on message_end (assistant message received) — before tool execution completes
waitForRetry() resolves in agent-session.prompt()
prompt() returns to attempt.ts
The attempt's finally block calls flushPendingToolResults()
The tool call was registered in the guard's pending map but never executed — synthetic error inserted

Evidence

From a real session (Arthur agent, Discord channel):

Line	Timestamp	Event
407	23:27:55.659Z	Assistant: `overloaded_error`, empty content, `stopReason: "error"`
408	23:28:01.726Z	Retry assistant: tool call `toolu_015kn1n1vixFyxMSyHCTWfPt` (exec), `stopReason: "toolUse"`
409	23:28:01.779Z	Synthetic error inserted — only 53ms after the tool call

The exec command normally takes 1-5 seconds. 53ms proves the tool never executed before being flushed.

Why existing PRs don't fix this

fix(agent): prevent session lock deadlock on timeout during compaction #9855 (compaction deadlock) — different code path
fix(agents): drop orphan tool results #3622, fix(agents): strip orphaned tool_result when tool_use is sanitized on retry #12487, fix(agents): validate tool_use exists before synthetic result creation #8294 — improve repair/cleanup logic after the loss, don't prevent it
fix(agents): instruct agent not to retry lost tool results #13282 — prompt engineering workaround ("don't retry lost results")

None address the timing gap where waitForRetry() resolves before tool execution completes.

Fix

Add agent.waitForIdle() before every flushPendingToolResults() call site. There are three locations:

1. Main attempt finally block (`run/attempt.ts`)

The primary path — runs after every agent turn completes.

2. Session setup catch block (`run/attempt.ts`)

Error handler during session initialization. Can fire if session loading throws while an agent retry has tool calls in flight.

3. Compaction finally block (`compact.ts`)

Teardown after context compaction. Can flush while a concurrent retry's tools are still executing.

All three now await agent.waitForIdle() with a 30-second safety timeout before flushing:

if (session?.agent?.waitForIdle) {
  try {
    await Promise.race([
      session.agent.waitForIdle(),
      new Promise<void>((resolve) => setTimeout(resolve, 30_000)),
    ]);
  } catch { /* best-effort */ }
}
sessionManager?.flushPendingToolResults?.();

Why all three are needed

Initial production testing (2026-02-11) showed that patching only the main finally block was insufficient — sub-agent sessions continued producing synthetic errors. Adding debug logging to the flush function revealed additional call sites being hit during sub-agent runs.

Upstream Root Cause

The deeper root cause is in @mariozechner/pi-agent-core's agent-session.ts: _resolveRetry() fires on the message_end event handler (when assistant message arrives) instead of on agent_end (when the full loop including tool execution completes).

We submitted an upstream PR to fix this at badlogic/pi-mono#1465, but it was auto-closed per their first-time contributor process. Issues are disabled on that repo, so we filed the bug as a discussion instead: badlogic/pi-mono#1466. Awaiting maintainer approval to resubmit.

This OpenClaw PR serves as a defensive workaround until the upstream fix lands in a new @mariozechner/pi-coding-agent release. The waitForIdle() calls become redundant but harmless once the upstream is fixed.

Testing

Hot-patched on a production VPS running multiple agents (Kit, Arthur, Cyrus and others) that were experiencing this bug regularly
Phase 1 (2026-02-11): Patched main finally block only — sub-agents still showed synthetic errors
Phase 2 (2026-02-11): Added debug logging + patched all 3 call sites — monitoring for recurrence
3 unit tests covering the flush race condition
CI checks (running)
Manual verification: confirm synthetic errors stop after full patch

The fix is minimal (three waitForIdle() calls) with safety timeouts, so risk of regression is low.

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595

Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748

The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors.

…openclaw#13746) * fix(agents): wait for agent idle before flushing pending tool results When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595 * test: add tests for tool result flush race condition Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748 * fix(agents): add waitForIdle to all flushPendingToolResults call sites The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors. * fix(agents): centralize idle-wait flush and clear timeout handle --------- Co-authored-by: Renue Development <[email protected]> Co-authored-by: Peter Steinberger <[email protected]>

openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 11, 2026

rodbland2021 mentioned this pull request Feb 11, 2026

fix(agent-session): resolve retry promise on agent_end instead of message_end badlogic/pi-mono#1465

Closed

rodbland2021 and others added 2 commits February 11, 2026 11:40

Merge branch 'main' into fix/tool-result-flush-race

d602bc1

mverrilli mentioned this pull request Feb 13, 2026

fix(agent): prevent session deadlock on timeout during tool execution #15688

Closed

fix(agents): centralize idle-wait flush and clear timeout handle

c110634

steipete merged commit d3b2135 into openclaw:main Feb 13, 2026
10 checks passed

openclaw-barnacle bot added the size: S label Feb 13, 2026

github-actions bot mentioned this pull request Feb 13, 2026

📡 Upstream Digest — 2026-02-13 20:29 UTC curtismercier/openclaw-mods#17

Open

adolago mentioned this pull request Feb 14, 2026

[OpenClaw #13746] Wait for agent idle before flushing pending tool results adolago/zee#335

Closed

4 tasks

nickytonline mentioned this pull request Feb 14, 2026

nickytonline/trusted auth docs update nickytonline/openclaw#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

fix(agents): wait for agent idle before flushing pending tool results#13746

fix(agents): wait for agent idle before flushing pending tool results#13746
steipete merged 5 commits intoopenclaw:mainfrom
rodbland2021:fix/tool-result-flush-race

rodbland2021 commented Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

rodbland2021 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Evidence

Why existing PRs don't fix this

Fix

1. Main attempt finally block (run/attempt.ts)

2. Session setup catch block (run/attempt.ts)

3. Compaction finally block (compact.ts)

Why all three are needed

Upstream Root Cause

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rodbland2021 commented Feb 11, 2026 •

edited

Loading

1. Main attempt finally block (`run/attempt.ts`)

2. Session setup catch block (`run/attempt.ts`)

3. Compaction finally block (`compact.ts`)