fix(llm): defer Anthropic stream start event until after message_start#90697
Conversation
Extended thinking sessions become permanently broken after gateway restart or Anthropic
prompt-cache TTL expiry because thinking-block signature validation errors were silently
swallowing the recovery retry.
In pumpStreamWithRecovery (thinking.ts), the yieldedOutput flag prevents retry once any
non-error event has been emitted. streamAnthropic previously pushed {type:'start'} before
the SSE event loop, so when a pre-stream SSE error arrived (Anthropic sends signature
validation errors before message_start), yieldedOutput was already true and the retry
branch was skipped.
Deferring stream.push({type:'start'}) into the message_start handler keeps yieldedOutput=false
when a pre-stream error arrives, letting pumpStreamWithRecovery strip all thinking blocks
from context and retry the request cleanly.
Fixes openclaw#90667
|
Codex review: passed. Reviewed June 5, 2026, 10:17 PM ET / 02:17 UTC. Summary PR surface: Source +5, Tests +149. Total +154 across 4 files. Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes from source: current main emits Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the narrow provider and transport ordering fix after normal maintainer/automerge gates, while preserving recovery's no-retry-after-output guard. Do we have a high-confidence way to reproduce the issue? Do we have a high-confidence way to reproduce the issue? Yes from source: current main emits Is this the best way to solve the issue? Is this the best way to solve the issue? Yes; fixing the provider and transport event ordering preserves the recovery safety invariant instead of weakening retry behavior after visible output has already streamed. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 3a2f54e6a866. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +5, Tests +149. Total +154 across 4 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
… message_start
Applies the same start-event deferral fix from src/llm/providers/anthropic.ts to the
embedded-agent default path. resolveEmbeddedAgentStreamFn routes anthropic-messages
through createBoundaryAwareStreamFnForModel → createAnthropicMessagesTransportStreamFn,
so the thinking-block recovery bug (pumpStreamWithRecovery yieldedOutput gate) affects
the production embedded path via this file, not just the provider stream.
Moves stream.push({type:'start'}) from before the SDK event loop into the message_start
handler, keeping yieldedOutput=false in pumpStreamWithRecovery when an SSE event: error
arrives before message_start (as Anthropic sends for invalid thinking signatures).
|
@clawsweeper automerge |
|
🦞✅ Source: What merged:
Automerge notes:
The automerge loop is complete. Automerge progress:
|
openclaw#90697) Summary: - The branch moves Anthropic `start` emission into `message_start` handling for the provider and transport stream paths and adds focused ordering/error tests. - PR surface: Source +5, Tests +149. Total +154 across 4 files. - Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes from source: current main emit ... ecovery intentionally refuses to retry after any non-error output; no live expired-cache run was performed. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(agents): defer Anthropic transport stream start event until after… Validation: - ClawSweeper review passed for head 399a243. - Required merge gates passed before the squash merge. Prepared head SHA: 399a243 Review: openclaw#90697 (comment) Co-authored-by: openperf <[email protected]> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <[email protected]>
openclaw#90697) Summary: - The branch moves Anthropic `start` emission into `message_start` handling for the provider and transport stream paths and adds focused ordering/error tests. - PR surface: Source +5, Tests +149. Total +154 across 4 files. - Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes from source: current main emit ... ecovery intentionally refuses to retry after any non-error output; no live expired-cache run was performed. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(agents): defer Anthropic transport stream start event until after… Validation: - ClawSweeper review passed for head 399a243. - Required merge gates passed before the squash merge. Prepared head SHA: 399a243 Review: openclaw#90697 (comment) Co-authored-by: openperf <[email protected]> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <[email protected]>
Summary
{type:'start'}before entering the SSE event loop. InpumpStreamWithRecovery(thinking.ts),yieldedOutputis settruefor every non-error event — once true, the thinking-block recovery retry is permanently disabled for that request. Anthropic sends thinking-signature validation errors as SSEevent: errorbeforemessage_start(pre-generation request validation). With the old placement, thestartevent setyieldedOutput=truebefore the SSE error arrived; recovery was silently skipped; the stale signatures stayed in the transcript; every subsequent turn also failed.stream.push({type:'start'})into themessage_starthandler in both stream implementations so a pre-stream SSE error always arrives whileyieldedOutput=false, lettingpumpStreamWithRecoverystrip all thinking blocks and retry cleanly.src/llm/providers/anthropic.ts— movestream.push({type:'start'})from before the SSE loop into themessage_startevent handler body (provider stream path).src/agents/anthropic-transport-stream.ts— same deferral for the embedded-agent default path (resolveEmbeddedAgentStreamFn→createBoundaryAwareStreamFnForModel→createAnthropicMessagesTransportStreamFn).src/llm/providers/anthropic.test.ts— two new tests: verifystartis deferred until aftermessage_startin normal flow; verify a pre-stream SSE error arrives with no precedingstart.src/agents/anthropic-transport-stream.test.ts— same two ordering tests for the transport path.pumpStreamWithRecovery,wrapAnthropicStreamWithRecovery,stripAllThinkingBlocks— unchanged; the fix only corrects the provider-level event ordering.Reproduction
event: error"Invalid signature in thinking block"beforemessage_start;{type:'start'}was already emitted before the SSE loop;yieldedOutput=true;pumpStreamWithRecoveryskips retry; the error propagates; all subsequent turns also fail permanently.{type:'start'}is deferred until inside themessage_starthandler in both stream implementations;yieldedOutput=falsewhen the pre-stream SSE error arrives;pumpStreamWithRecoverydetects the thinking-signature pattern, strips all thinking blocks, and retries; session continues normally.Real behavior proof
Behavior addressed (#90667): extended thinking sessions permanently broken after gateway restart or Anthropic prompt-cache TTL expiry — every turn after the first signature error fails because
yieldedOutput=truesilently blocked the recovery retry in both the provider and transport stream paths.Real environment tested (Linux, Node 22 — Vitest against the production
streamAnthropicSSE event loop,pumpStreamWithRecoveryrecovery harness, andcreateAnthropicMessagesTransportStreamFntransport stream):src/llm/providers/anthropic.test.tsexercises the deferral ordering contract directly againststreamAnthropic;src/agents/embedded-agent-runner/thinking.test.tsexercisespumpStreamWithRecoverywith a stream that emits{type:'error'}before any{type:'start'}— the exact scenario this fix enables;src/agents/anthropic-transport-stream.test.tscovers the embedded-agent default path.Exact steps or command run after this patch:
pnpm test src/llm/providers/anthropic.test.ts src/agents/embedded-agent-runner/thinking.test.ts src/agents/anthropic-transport-stream.test.ts;node scripts/run-oxlint.mjsandpnpm format:fixon all four changed files.Evidence after fix (Vitest output for touched test files):
The four new tests cover the deferral ordering contract:
startis emitted only aftermessage_startprocessing in both stream implementations; a pre-stream SSE error arrives with no precedingstartevent.Observed result after fix: the "retries pre-content terminal stream-error events" case in
thinking.test.tsexercises the exact recovery path — stream emits{type:'error'}with no preceding{type:'start'}→ retry fires (callCount=2). The newanthropic.test.tsandanthropic-transport-stream.test.tstests confirm both implementations now uphold theyieldedOutput=falseprecondition that recovery requires.What was not tested: live Anthropic API call with expired cache and thinking blocks (requires waiting 5+ min for cache TTL expiry in a real session).
Repro confirmation: the pre-existing "retries pre-content terminal stream-error events" test was already gated on
yieldedOutput=false— it tested the recovery machinery in isolation but could not catch the provider-level ordering bug. Before this fix, both stream implementations setyieldedOutput=truebefore any SSE error arrived, so that test passed while the real provider path failed silently. The new tests close that gap by driving both stream implementations directly.Risk / Mitigation
startevent now carries populatedusage(filled frommessage_start) instead of a zero-usage snapshot. Mitigation:start.partialis always treated as a partial/in-progress snapshot by all consumers; no caller depends on usage being zero; the change is strictly more informative.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #90667