Skip to content

Comments

fix(agent-session): resolve retry promise on agent_end instead of message_end#1465

Closed
rodbland2021 wants to merge 1 commit intobadlogic:mainfrom
rodbland2021:fix/resolve-retry-on-agent-end
Closed

fix(agent-session): resolve retry promise on agent_end instead of message_end#1465
rodbland2021 wants to merge 1 commit intobadlogic:mainfrom
rodbland2021:fix/resolve-retry-on-agent-end

Conversation

@rodbland2021
Copy link

Problem

When auto-retry handles API errors (overloaded, rate limit, etc.) and the retry succeeds with tool calls (stopReason: 'toolUse'), _resolveRetry() fires on message_endbefore tool execution completes. This causes waitForRetry() to resolve early, allowing consumers to proceed with post-prompt cleanup while tools are still in flight.

Impact

This is the root cause of a widely-reported bug in OpenClaw where tool results are intermittently lost, with synthetic errors inserted:

[openclaw] missing tool result in session history; inserted synthetic error result for transcript repair.

Affects ~5-10% of tool calls during API error/retry scenarios. Multiple open issues: openclaw/openclaw#8643, openclaw/openclaw#13351. 15+ PRs attempted symptom-level fixes without identifying this root cause.

Evidence

From a production session:

Event Timestamp
overloaded_error from Anthropic 23:27:55.659Z
Retry assistant tool call 23:28:01.726Z
Premature flush (synthetic error) 23:28:01.779Z

53ms between tool call and flush. The exec command normally takes 1-5 seconds — the tool never ran.

The Bug

In agent-session.ts, the message_end handler (line ~364):

// BEFORE (fires too early):
if (event.message.role === 'assistant') {
    // ...
    if (assistantMsg.stopReason !== 'error' && this._retryAttempt > 0) {
        this._resolveRetry(); // resolves BEFORE tools execute
    }
}

The message_end event fires when the assistant response arrives, but tool execution (tool_execution_starttool_execution_end) happens AFTER in the agent loop. So waitForRetry() resolves while tools are still pending.

Fix

Move _resolveRetry() to the agent_end handler, which fires after the full agent loop (including tool execution) completes:

// AFTER (fires at the right time):
if (event.type === 'agent_end' && this._lastAssistantMessage) {
    const assistantMsg = msg as AssistantMessage;
    if (assistantMsg.stopReason !== 'error' && this._retryAttempt > 0) {
        // Resolve AFTER tools have executed
        this._resolveRetry();
        return;
    }
    // ... existing retryable error / compaction checks
}

What changes

  • _resolveRetry() now fires on agent_end instead of message_end
  • waitForRetry() only resolves after tool execution completes
  • No change to retry logic, backoff timing, or error handling
  • Backward compatible — consumers already await waitForRetry(), they just get the correct completion signal now

Current workaround

OpenClaw has a workaround in openclaw/openclaw#13746 that calls agent.waitForIdle() before cleanup. This PR is the proper upstream fix that eliminates the need for that workaround.

…sage_end

When auto-retry handles overloaded/rate-limit errors and the retry
succeeds with tool calls (stopReason: 'toolUse'), _resolveRetry() was
firing on message_end — before tool execution completed. This caused
waitForRetry() to resolve early, allowing consumers (e.g., OpenClaw)
to proceed with post-prompt cleanup while tools were still in flight.

The result: tool results lost, synthetic errors inserted, agents going
silent. Affected ~5-10% of tool calls during API errors.

Move _resolveRetry() to the agent_end handler so it only fires after
the full agent loop (including tool execution) has completed.

Evidence from production: tool call and premature flush were only 53ms
apart — the tool never had a chance to execute.

Ref: openclaw/openclaw#8643, openclaw/openclaw#13351
@github-actions
Copy link
Contributor

Hi @rodbland2021, thanks for your interest in contributing!

We ask new contributors to open an issue first before submitting a PR. This helps us discuss the approach and avoid wasted effort.

Next steps:

  1. Open an issue describing what you want to change and why (keep it concise, write in your human voice, AI slop will be closed)
  2. Once a maintainer approves with lgtm, you'll be added to the approved contributors list
  3. Then you can submit your PR

This PR will be closed automatically. See https://github.com/badlogic/pi-mono/blob/main/CONTRIBUTING.md for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant