Skip to content

Comments

fix(agents): wait for agent idle before flushing pending tool results#13746

Merged
steipete merged 5 commits intoopenclaw:mainfrom
rodbland2021:fix/tool-result-flush-race
Feb 13, 2026
Merged

fix(agents): wait for agent idle before flushing pending tool results#13746
steipete merged 5 commits intoopenclaw:mainfrom
rodbland2021:fix/tool-result-flush-race

Conversation

@rodbland2021
Copy link
Contributor

@rodbland2021 rodbland2021 commented Feb 11, 2026

Problem

Tool results are intermittently lost during normal agent operation, with the system inserting synthetic errors:

[openclaw] missing tool result in session history; inserted synthetic error result for transcript repair.

This causes agents to go silent and stop responding. Affects all versions including 2026.2.9.

Fixes #8643, #13351. Refs #6682, #12595.

Root Cause

The bug is a race condition between pi-agent-core's auto-retry mechanism and the attempt lifecycle in run/attempt.ts.

When an API error occurs (e.g., overloaded_error, rate limit), pi-agent-core's _handleRetryableError() retries the LLM call via agent.continue(). When the retry succeeds with a tool call:

  1. _resolveRetry() fires on message_end (assistant message received) — before tool execution completes
  2. waitForRetry() resolves in agent-session.prompt()
  3. prompt() returns to attempt.ts
  4. The attempt's finally block calls flushPendingToolResults()
  5. The tool call was registered in the guard's pending map but never executed — synthetic error inserted

Evidence

From a real session (Arthur agent, Discord channel):

Line Timestamp Event
407 23:27:55.659Z Assistant: overloaded_error, empty content, stopReason: "error"
408 23:28:01.726Z Retry assistant: tool call toolu_015kn1n1vixFyxMSyHCTWfPt (exec), stopReason: "toolUse"
409 23:28:01.779Z Synthetic error inserted — only 53ms after the tool call

The exec command normally takes 1-5 seconds. 53ms proves the tool never executed before being flushed.

Why existing PRs don't fix this

None address the timing gap where waitForRetry() resolves before tool execution completes.

Fix

Add agent.waitForIdle() before every flushPendingToolResults() call site. There are three locations:

1. Main attempt finally block (run/attempt.ts)

The primary path — runs after every agent turn completes.

2. Session setup catch block (run/attempt.ts)

Error handler during session initialization. Can fire if session loading throws while an agent retry has tool calls in flight.

3. Compaction finally block (compact.ts)

Teardown after context compaction. Can flush while a concurrent retry's tools are still executing.

All three now await agent.waitForIdle() with a 30-second safety timeout before flushing:

if (session?.agent?.waitForIdle) {
  try {
    await Promise.race([
      session.agent.waitForIdle(),
      new Promise<void>((resolve) => setTimeout(resolve, 30_000)),
    ]);
  } catch { /* best-effort */ }
}
sessionManager?.flushPendingToolResults?.();

Why all three are needed

Initial production testing (2026-02-11) showed that patching only the main finally block was insufficient — sub-agent sessions continued producing synthetic errors. Adding debug logging to the flush function revealed additional call sites being hit during sub-agent runs.

Upstream Root Cause

The deeper root cause is in @mariozechner/pi-agent-core's agent-session.ts: _resolveRetry() fires on the message_end event handler (when assistant message arrives) instead of on agent_end (when the full loop including tool execution completes).

We submitted an upstream PR to fix this at badlogic/pi-mono#1465, but it was auto-closed per their first-time contributor process. Issues are disabled on that repo, so we filed the bug as a discussion instead: badlogic/pi-mono#1466. Awaiting maintainer approval to resubmit.

This OpenClaw PR serves as a defensive workaround until the upstream fix lands in a new @mariozechner/pi-coding-agent release. The waitForIdle() calls become redundant but harmless once the upstream is fixed.

Testing

  • Hot-patched on a production VPS running multiple agents (Kit, Arthur, Cyrus and others) that were experiencing this bug regularly
  • Phase 1 (2026-02-11): Patched main finally block only — sub-agents still showed synthetic errors
  • Phase 2 (2026-02-11): Added debug logging + patched all 3 call sites — monitoring for recurrence
  • 3 unit tests covering the flush race condition
  • CI checks (running)
  • Manual verification: confirm synthetic errors stop after full patch

The fix is minimal (three waitForIdle() calls) with safety timeouts, so risk of regression is low.

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.

The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.

Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.

Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.

Fixes openclaw#8643
Fixes openclaw#13351
Refs openclaw#6682, openclaw#12595
Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received

Refs openclaw#8643, openclaw#13748
rodbland2021 and others added 2 commits February 11, 2026 11:40
The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:

1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)

Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.

Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.
@steipete steipete merged commit d3b2135 into openclaw:main Feb 13, 2026
10 checks passed
zhangyang-crazy-one pushed a commit to zhangyang-crazy-one/openclaw that referenced this pull request Feb 13, 2026
…openclaw#13746)

* fix(agents): wait for agent idle before flushing pending tool results

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.

The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.

Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.

Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.

Fixes openclaw#8643
Fixes openclaw#13351
Refs openclaw#6682, openclaw#12595

* test: add tests for tool result flush race condition

Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received

Refs openclaw#8643, openclaw#13748

* fix(agents): add waitForIdle to all flushPendingToolResults call sites

The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:

1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)

Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.

Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.

* fix(agents): centralize idle-wait flush and clear timeout handle

---------

Co-authored-by: Renue Development <[email protected]>
Co-authored-by: Peter Steinberger <[email protected]>
steipete added a commit to azade-c/openclaw that referenced this pull request Feb 14, 2026
…openclaw#13746)

* fix(agents): wait for agent idle before flushing pending tool results

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.

The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.

Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.

Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.

Fixes openclaw#8643
Fixes openclaw#13351
Refs openclaw#6682, openclaw#12595

* test: add tests for tool result flush race condition

Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received

Refs openclaw#8643, openclaw#13748

* fix(agents): add waitForIdle to all flushPendingToolResults call sites

The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:

1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)

Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.

Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.

* fix(agents): centralize idle-wait flush and clear timeout handle

---------

Co-authored-by: Renue Development <[email protected]>
Co-authored-by: Peter Steinberger <[email protected]>
Hansen1018 pushed a commit to Hansen1018/openclaw that referenced this pull request Feb 14, 2026
…openclaw#13746)

* fix(agents): wait for agent idle before flushing pending tool results

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.

The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.

Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.

Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.

Fixes openclaw#8643
Fixes openclaw#13351
Refs openclaw#6682, openclaw#12595

* test: add tests for tool result flush race condition

Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received

Refs openclaw#8643, openclaw#13748

* fix(agents): add waitForIdle to all flushPendingToolResults call sites

The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:

1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)

Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.

Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.

* fix(agents): centralize idle-wait flush and clear timeout handle

---------

Co-authored-by: Renue Development <[email protected]>
Co-authored-by: Peter Steinberger <[email protected]>
GwonHyeok pushed a commit to learners-superpumped/openclaw that referenced this pull request Feb 15, 2026
…openclaw#13746)

* fix(agents): wait for agent idle before flushing pending tool results

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.

The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.

Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.

Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.

Fixes openclaw#8643
Fixes openclaw#13351
Refs openclaw#6682, openclaw#12595

* test: add tests for tool result flush race condition

Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received

Refs openclaw#8643, openclaw#13748

* fix(agents): add waitForIdle to all flushPendingToolResults call sites

The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:

1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)

Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.

Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.

* fix(agents): centralize idle-wait flush and clear timeout handle

---------

Co-authored-by: Renue Development <[email protected]>
Co-authored-by: Peter Steinberger <[email protected]>
cloud-neutral pushed a commit to cloud-neutral-toolkit/openclawbot.svc.plus that referenced this pull request Feb 15, 2026
…openclaw#13746)

* fix(agents): wait for agent idle before flushing pending tool results

When pi-agent-core's auto-retry mechanism handles overloaded/rate-limit
errors, it resolves waitForRetry() on assistant message receipt — before
tool execution completes in the retried agent loop. This causes the
attempt's finally block to call flushPendingToolResults() while tools
are still executing, inserting synthetic 'missing tool result' errors
and causing silent agent failures.

The fix adds a waitForIdle() call before the flush to ensure the agent's
retry loop (including tool execution) has fully completed.

Evidence from real session: tool call and synthetic error were only 53ms
apart — the tool never had a chance to execute before being flushed.

Root cause is in pi-agent-core's _resolveRetry() firing on message_end
instead of agent_end, but this workaround in OpenClaw prevents the
symptom without requiring an upstream fix.

Fixes openclaw#8643
Fixes openclaw#13351
Refs openclaw#6682, openclaw#12595

* test: add tests for tool result flush race condition

Validates that:
- Real tool results are not replaced by synthetic errors when they arrive in time
- Flush correctly inserts synthetic errors for genuinely orphaned tool calls
- Flush is a no-op after real tool results have already been received

Refs openclaw#8643, openclaw#13748

* fix(agents): add waitForIdle to all flushPendingToolResults call sites

The original fix only covered the main run finally block, but there are
two additional call sites that can trigger flushPendingToolResults while
tools are still executing:

1. The catch block in attempt.ts (session setup error handler)
2. The finally block in compact.ts (compaction teardown)

Both now await agent.waitForIdle() with a 30s timeout before flushing,
matching the pattern already applied to the main finally block.

Production testing on VPS with debug logging confirmed these additional
paths can fire during sub-agent runs, producing spurious synthetic
'missing tool result' errors.

* fix(agents): centralize idle-wait flush and clear timeout handle

---------

Co-authored-by: Renue Development <[email protected]>
Co-authored-by: Peter Steinberger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Intermittent "missing tool result in session history" errors

2 participants