fix(agent-session): resolve retry promise on agent_end instead of message_end#1465
Closed
rodbland2021 wants to merge 1 commit intobadlogic:mainfrom
Closed
fix(agent-session): resolve retry promise on agent_end instead of message_end#1465rodbland2021 wants to merge 1 commit intobadlogic:mainfrom
rodbland2021 wants to merge 1 commit intobadlogic:mainfrom
Conversation
…sage_end When auto-retry handles overloaded/rate-limit errors and the retry succeeds with tool calls (stopReason: 'toolUse'), _resolveRetry() was firing on message_end — before tool execution completed. This caused waitForRetry() to resolve early, allowing consumers (e.g., OpenClaw) to proceed with post-prompt cleanup while tools were still in flight. The result: tool results lost, synthetic errors inserted, agents going silent. Affected ~5-10% of tool calls during API errors. Move _resolveRetry() to the agent_end handler so it only fires after the full agent loop (including tool execution) has completed. Evidence from production: tool call and premature flush were only 53ms apart — the tool never had a chance to execute. Ref: openclaw/openclaw#8643, openclaw/openclaw#13351
Contributor
|
Hi @rodbland2021, thanks for your interest in contributing! We ask new contributors to open an issue first before submitting a PR. This helps us discuss the approach and avoid wasted effort. Next steps:
This PR will be closed automatically. See https://github.com/badlogic/pi-mono/blob/main/CONTRIBUTING.md for more details. |
Merged
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When auto-retry handles API errors (overloaded, rate limit, etc.) and the retry succeeds with tool calls (
stopReason: 'toolUse'),_resolveRetry()fires onmessage_end— before tool execution completes. This causeswaitForRetry()to resolve early, allowing consumers to proceed with post-prompt cleanup while tools are still in flight.Impact
This is the root cause of a widely-reported bug in OpenClaw where tool results are intermittently lost, with synthetic errors inserted:
Affects ~5-10% of tool calls during API error/retry scenarios. Multiple open issues: openclaw/openclaw#8643, openclaw/openclaw#13351. 15+ PRs attempted symptom-level fixes without identifying this root cause.
Evidence
From a production session:
overloaded_errorfrom Anthropic53ms between tool call and flush. The exec command normally takes 1-5 seconds — the tool never ran.
The Bug
In
agent-session.ts, themessage_endhandler (line ~364):The
message_endevent fires when the assistant response arrives, but tool execution (tool_execution_start→tool_execution_end) happens AFTER in the agent loop. SowaitForRetry()resolves while tools are still pending.Fix
Move
_resolveRetry()to theagent_endhandler, which fires after the full agent loop (including tool execution) completes:What changes
_resolveRetry()now fires onagent_endinstead ofmessage_endwaitForRetry()only resolves after tool execution completesawait waitForRetry(), they just get the correct completion signal nowCurrent workaround
OpenClaw has a workaround in openclaw/openclaw#13746 that calls
agent.waitForIdle()before cleanup. This PR is the proper upstream fix that eliminates the need for that workaround.