fix(agents): wait for agent idle before flushing pending tool results#13746
Merged
steipete merged 5 commits intoopenclaw:mainfrom Feb 13, 2026
Merged
fix(agents): wait for agent idle before flushing pending tool results#13746steipete merged 5 commits intoopenclaw:mainfrom
steipete merged 5 commits intoopenclaw:mainfrom
Conversation
When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595
This was referenced Feb 11, 2026
Closed
Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748
The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors.
zhangyang-crazy-one
pushed a commit
to zhangyang-crazy-one/openclaw
that referenced
this pull request
Feb 13, 2026
…openclaw#13746) * fix(agents): wait for agent idle before flushing pending tool results When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595 * test: add tests for tool result flush race condition Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748 * fix(agents): add waitForIdle to all flushPendingToolResults call sites The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors. * fix(agents): centralize idle-wait flush and clear timeout handle --------- Co-authored-by: Renue Development <[email protected]> Co-authored-by: Peter Steinberger <[email protected]>
steipete
added a commit
to azade-c/openclaw
that referenced
this pull request
Feb 14, 2026
…openclaw#13746) * fix(agents): wait for agent idle before flushing pending tool results When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595 * test: add tests for tool result flush race condition Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748 * fix(agents): add waitForIdle to all flushPendingToolResults call sites The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors. * fix(agents): centralize idle-wait flush and clear timeout handle --------- Co-authored-by: Renue Development <[email protected]> Co-authored-by: Peter Steinberger <[email protected]>
4 tasks
Hansen1018
pushed a commit
to Hansen1018/openclaw
that referenced
this pull request
Feb 14, 2026
…openclaw#13746) * fix(agents): wait for agent idle before flushing pending tool results When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595 * test: add tests for tool result flush race condition Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748 * fix(agents): add waitForIdle to all flushPendingToolResults call sites The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors. * fix(agents): centralize idle-wait flush and clear timeout handle --------- Co-authored-by: Renue Development <[email protected]> Co-authored-by: Peter Steinberger <[email protected]>
GwonHyeok
pushed a commit
to learners-superpumped/openclaw
that referenced
this pull request
Feb 15, 2026
…openclaw#13746) * fix(agents): wait for agent idle before flushing pending tool results When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595 * test: add tests for tool result flush race condition Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748 * fix(agents): add waitForIdle to all flushPendingToolResults call sites The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors. * fix(agents): centralize idle-wait flush and clear timeout handle --------- Co-authored-by: Renue Development <[email protected]> Co-authored-by: Peter Steinberger <[email protected]>
cloud-neutral
pushed a commit
to cloud-neutral-toolkit/openclawbot.svc.plus
that referenced
this pull request
Feb 15, 2026
…openclaw#13746) * fix(agents): wait for agent idle before flushing pending tool results When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit errors, it resolves waitForRetry() on assistant message receipt — before tool execution completes in the retried agent loop. This causes the attempt's finally block to call flushPendingToolResults() while tools are still executing, inserting synthetic 'missing tool result' errors and causing silent agent failures. The fix adds a waitForIdle() call before the flush to ensure the agent's retry loop (including tool execution) has fully completed. Evidence from real session: tool call and synthetic error were only 53ms apart — the tool never had a chance to execute before being flushed. Root cause is in pi-agent-core's _resolveRetry() firing on message_end instead of agent_end, but this workaround in OpenClaw prevents the symptom without requiring an upstream fix. Fixes openclaw#8643 Fixes openclaw#13351 Refs openclaw#6682, openclaw#12595 * test: add tests for tool result flush race condition Validates that: - Real tool results are not replaced by synthetic errors when they arrive in time - Flush correctly inserts synthetic errors for genuinely orphaned tool calls - Flush is a no-op after real tool results have already been received Refs openclaw#8643, openclaw#13748 * fix(agents): add waitForIdle to all flushPendingToolResults call sites The original fix only covered the main run finally block, but there are two additional call sites that can trigger flushPendingToolResults while tools are still executing: 1. The catch block in attempt.ts (session setup error handler) 2. The finally block in compact.ts (compaction teardown) Both now await agent.waitForIdle() with a 30s timeout before flushing, matching the pattern already applied to the main finally block. Production testing on VPS with debug logging confirmed these additional paths can fire during sub-agent runs, producing spurious synthetic 'missing tool result' errors. * fix(agents): centralize idle-wait flush and clear timeout handle --------- Co-authored-by: Renue Development <[email protected]> Co-authored-by: Peter Steinberger <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Tool results are intermittently lost during normal agent operation, with the system inserting synthetic errors:
This causes agents to go silent and stop responding. Affects all versions including 2026.2.9.
Fixes #8643, #13351. Refs #6682, #12595.
Root Cause
The bug is a race condition between pi-agent-core's auto-retry mechanism and the attempt lifecycle in
run/attempt.ts.When an API error occurs (e.g.,
overloaded_error, rate limit), pi-agent-core's_handleRetryableError()retries the LLM call viaagent.continue(). When the retry succeeds with a tool call:_resolveRetry()fires onmessage_end(assistant message received) — before tool execution completeswaitForRetry()resolves inagent-session.prompt()prompt()returns toattempt.tsfinallyblock callsflushPendingToolResults()Evidence
From a real session (Arthur agent, Discord channel):
overloaded_error, empty content,stopReason: "error"toolu_015kn1n1vixFyxMSyHCTWfPt(exec),stopReason: "toolUse"The exec command normally takes 1-5 seconds. 53ms proves the tool never executed before being flushed.
Why existing PRs don't fix this
None address the timing gap where
waitForRetry()resolves before tool execution completes.Fix
Add
agent.waitForIdle()before everyflushPendingToolResults()call site. There are three locations:1. Main attempt finally block (
run/attempt.ts)The primary path — runs after every agent turn completes.
2. Session setup catch block (
run/attempt.ts)Error handler during session initialization. Can fire if session loading throws while an agent retry has tool calls in flight.
3. Compaction finally block (
compact.ts)Teardown after context compaction. Can flush while a concurrent retry's tools are still executing.
All three now await
agent.waitForIdle()with a 30-second safety timeout before flushing:Why all three are needed
Initial production testing (2026-02-11) showed that patching only the main finally block was insufficient — sub-agent sessions continued producing synthetic errors. Adding debug logging to the flush function revealed additional call sites being hit during sub-agent runs.
Upstream Root Cause
The deeper root cause is in
@mariozechner/pi-agent-core'sagent-session.ts:_resolveRetry()fires on themessage_endevent handler (when assistant message arrives) instead of onagent_end(when the full loop including tool execution completes).We submitted an upstream PR to fix this at badlogic/pi-mono#1465, but it was auto-closed per their first-time contributor process. Issues are disabled on that repo, so we filed the bug as a discussion instead: badlogic/pi-mono#1466. Awaiting maintainer approval to resubmit.
This OpenClaw PR serves as a defensive workaround until the upstream fix lands in a new
@mariozechner/pi-coding-agentrelease. ThewaitForIdle()calls become redundant but harmless once the upstream is fixed.Testing
The fix is minimal (three
waitForIdle()calls) with safety timeouts, so risk of regression is low.