fix(agent-session): resolve retry promise on agent_end instead of message_end by rodbland2021 · Pull Request #1465 · badlogic/pi-mono

rodbland2021 · 2026-02-11T00:18:13Z

Problem

When auto-retry handles API errors (overloaded, rate limit, etc.) and the retry succeeds with tool calls (stopReason: 'toolUse'), _resolveRetry() fires on message_end — before tool execution completes. This causes waitForRetry() to resolve early, allowing consumers to proceed with post-prompt cleanup while tools are still in flight.

Impact

This is the root cause of a widely-reported bug in OpenClaw where tool results are intermittently lost, with synthetic errors inserted:

[openclaw] missing tool result in session history; inserted synthetic error result for transcript repair.

Affects ~5-10% of tool calls during API error/retry scenarios. Multiple open issues: openclaw/openclaw#8643, openclaw/openclaw#13351. 15+ PRs attempted symptom-level fixes without identifying this root cause.

Evidence

From a production session:

Event	Timestamp
`overloaded_error` from Anthropic	23:27:55.659Z
Retry assistant tool call	23:28:01.726Z
Premature flush (synthetic error)	23:28:01.779Z

53ms between tool call and flush. The exec command normally takes 1-5 seconds — the tool never ran.

The Bug

In agent-session.ts, the message_end handler (line ~364):

// BEFORE (fires too early):
if (event.message.role === 'assistant') {
    // ...
    if (assistantMsg.stopReason !== 'error' && this._retryAttempt > 0) {
        this._resolveRetry(); // resolves BEFORE tools execute
    }
}

The message_end event fires when the assistant response arrives, but tool execution (tool_execution_start → tool_execution_end) happens AFTER in the agent loop. So waitForRetry() resolves while tools are still pending.

Fix

Move _resolveRetry() to the agent_end handler, which fires after the full agent loop (including tool execution) completes:

// AFTER (fires at the right time):
if (event.type === 'agent_end' && this._lastAssistantMessage) {
    const assistantMsg = msg as AssistantMessage;
    if (assistantMsg.stopReason !== 'error' && this._retryAttempt > 0) {
        // Resolve AFTER tools have executed
        this._resolveRetry();
        return;
    }
    // ... existing retryable error / compaction checks
}

What changes

_resolveRetry() now fires on agent_end instead of message_end
waitForRetry() only resolves after tool execution completes
No change to retry logic, backoff timing, or error handling
Backward compatible — consumers already await waitForRetry(), they just get the correct completion signal now

Current workaround

OpenClaw has a workaround in openclaw/openclaw#13746 that calls agent.waitForIdle() before cleanup. This PR is the proper upstream fix that eliminates the need for that workaround.

…sage_end When auto-retry handles overloaded/rate-limit errors and the retry succeeds with tool calls (stopReason: 'toolUse'), _resolveRetry() was firing on message_end — before tool execution completed. This caused waitForRetry() to resolve early, allowing consumers (e.g., OpenClaw) to proceed with post-prompt cleanup while tools were still in flight. The result: tool results lost, synthetic errors inserted, agents going silent. Affected ~5-10% of tool calls during API errors. Move _resolveRetry() to the agent_end handler so it only fires after the full agent loop (including tool execution) has completed. Evidence from production: tool call and premature flush were only 53ms apart — the tool never had a chance to execute. Ref: openclaw/openclaw#8643, openclaw/openclaw#13351

github-actions · 2026-02-11T00:18:22Z

Hi @rodbland2021, thanks for your interest in contributing!

We ask new contributors to open an issue first before submitting a PR. This helps us discuss the approach and avoid wasted effort.

Next steps:

Open an issue describing what you want to change and why (keep it concise, write in your human voice, AI slop will be closed)
Once a maintainer approves with lgtm, you'll be added to the approved contributors list
Then you can submit your PR

This PR will be closed automatically. See https://github.com/badlogic/pi-mono/blob/main/CONTRIBUTING.md for more details.

github-actions bot closed this Feb 11, 2026

rodbland2021 mentioned this pull request Feb 11, 2026

fix(agents): wait for agent idle before flushing pending tool results openclaw/openclaw#13746

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(agent-session): resolve retry promise on agent_end instead of message_end#1465

fix(agent-session): resolve retry promise on agent_end instead of message_end#1465
rodbland2021 wants to merge 1 commit intobadlogic:mainfrom
rodbland2021:fix/resolve-retry-on-agent-end

rodbland2021 commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

rodbland2021 commented Feb 11, 2026

Problem

Impact

Evidence

The Bug

Fix

What changes

Current workaround

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant