Skip to content

fix(orchestrator): preserve completed responses on client disconnect#1059

Merged
looplj merged 1 commit intolooplj:release/v0.9.xfrom
Chengxiwei:pr/outbound-client-disconnect
Mar 14, 2026
Merged

fix(orchestrator): preserve completed responses on client disconnect#1059
looplj merged 1 commit intolooplj:release/v0.9.xfrom
Chengxiwei:pr/outbound-client-disconnect

Conversation

@Chengxiwei
Copy link
Copy Markdown
Contributor

@Chengxiwei Chengxiwei commented Mar 14, 2026

What this fixes

When the client disconnects near the end of a streamed response, AxonHub may observe ctx cancellation before the local stream consumer sees the terminal event.

In the Claude Code -> AxonHub -> Codex chain, that can misclassify an actually completed response as canceled / failed even though the full response has already been streamed upstream.

Approach

This update keeps the completion proof generic instead of special-casing OpenAI Responses status values.

The outbound finalization logic now treats a response as completed when either:

  • an explicit terminal event was observed
  • or aggregation succeeds and meta.Usage != nil

At the same time:

  • explicit upstream stream errors still remain failures
  • ctx cancellation alone is no longer treated as automatic failure if completion has already been proven
  • missing usage is not treated as proof of failure; it is just absence of one completion signal

Why this shape

The maintainer feedback was to avoid protocol-specific completion handling in the orchestrator.

Using aggregated usage as an additional positive completion signal makes the rule work across providers / formats that already populate usage on complete responses, while still preserving terminal-event-based completion for streams that do not rely on usage.

Tests

This PR includes:

  • targeted orchestrator regression coverage in internal/server/orchestrator/outbound_test.go
  • an integration-style client-disconnect regression in integration_test/openai/responses/streaming/client_disconnect_test.go

Covered cases include:

  • missing usage does not count as aggregated completion
  • aggregated usage counts as completion
  • client cancel after the response is effectively complete is still persisted as completed
  • explicit stream errors still remain failures

Verification

  • go test ./internal/server/orchestrator/...

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical edge case in the orchestrator's outbound streaming logic where client disconnections near the completion of a stream could lead to incorrectly marking a fully completed response as 'canceled' or 'failed'. The changes introduce a more robust finalization mechanism that aggregates partial stream data to determine the true completion status, ensuring that valid responses are always persisted correctly, regardless of client-side timing issues. This improves the reliability of stream processing, particularly in complex multi-hop proxy environments.

Highlights

  • Outbound Stream Finalization Logic: The Close method in OutboundPersistentStream was refactored to intelligently determine the final status of a streaming response, especially when a client disconnects prematurely. It now aggregates collected chunks to check for completion even if a terminal event was not explicitly received.
  • Protocol-Aware Completion Detection: Introduced logic to detect completion based on the API format. For OpenAI Responses, a status: 'completed' field is required in the aggregated payload. For chat completions, completion can be inferred even without an explicit status field.
  • Preservation of Completed Responses: If the aggregated response is determined to be complete, the execution status is now correctly persisted as completed, even if the downstream client disconnected or the context was canceled before a terminal event was processed.
  • Enhanced Error Handling: Explicit upstream stream errors (not just context cancellations) are now treated as failures, ensuring that genuinely incomplete or corrupted responses are correctly marked.
  • Detailed Finalization Diagnostics: Added comprehensive logging to trace the decision-making process during stream finalization, providing better insights into why a stream was marked as completed, canceled, or failed.
  • New Integration Test Case: A new integration test (TestResponsesStreamingClientDisconnectAfterOutputTextDone) was added to specifically simulate a client disconnecting immediately after receiving the final text payload but before the explicit response.completed event, verifying the fix in a realistic scenario.
Changelog
  • integration_test/openai/responses/streaming/client_disconnect_test.go
    • Added TestResponsesStreamingClientDisconnectAfterOutputTextDone to simulate client disconnects after final text output but before explicit completion events.
  • internal/server/orchestrator/outbound.go
    • Imported encoding/json for JSON unmarshalling.
    • Refactored Close method to incorporate aggregation logic for determining stream completion status.
    • Introduced logFinalizationDecision function for detailed logging of stream finalization outcomes.
    • Added aggregatedOutboundResponseStatus function to extract status from aggregated JSON response bodies.
    • Implemented persistAggregatedResponse function to handle persistence of aggregated responses, including usage logs and metrics.
    • Created isCompletedAggregatedOutboundResponse function to perform protocol-aware completion detection based on API format and response status.
  • internal/server/orchestrator/outbound_test.go
    • Added new imports for authz, enttest, request, and requestexecution.
    • Modified mockTransformer to include fields for aggregatedResponse, aggregatedMeta, aggregatedErr, and apiFormat to support new aggregation tests.
    • Updated mockTransformer.AggregateStreamChunks and mockTransformer.APIFormat to use the new fields.
    • Added sliceEventStream helper struct for mocking stream events in tests.
    • Introduced TestIsCompletedAggregatedOutboundResponse to test the new completion detection logic.
    • Added TestOutboundPersistentStream_Close_AggregatedResponsesCompletionHandling with sub-tests covering various scenarios: in-progress responses, aggregated completed responses, and client-canceled streams with completed aggregated responses.
Activity
  • The author, Chengxiwei, has implemented new logic to preserve completed responses even if the client disconnects prematurely.
  • New unit tests were added in internal/server/orchestrator/outbound_test.go to cover various scenarios of aggregated response completion and client disconnection.
  • An integration test was added in integration_test/openai/responses/streaming/client_disconnect_test.go to validate the fix in a real-world client disconnect scenario.
  • Verification steps using go test and specific integration test commands were provided by the author.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an important edge case in stream handling where a completed stream could be incorrectly marked as canceled if the client disconnects near completion. The fix introduces logic to aggregate response chunks and determine completion status from the aggregated payload, even without a terminal stream event. The changes are well-tested with new integration and unit tests. My review focuses on a couple of areas for simplification and improved maintainability in the new logic.

@looplj
Copy link
Copy Markdown
Owner

looplj commented Mar 14, 2026

应该可以做的通用一点,不需要针对 response 特殊处理。

agg 以后,如果 usage 有数据,应该就是完整的了

@Chengxiwei
Copy link
Copy Markdown
Contributor Author

Chengxiwei commented Mar 14, 2026

Updated this based on your suggestion.

The completion check is now generic instead of special-casing Responses:

  • explicit terminal events still count as completion
  • after aggregation, meta.Usage != nil is treated as an additional positive completion signal
  • explicit upstream stream errors still remain failures

So ctx cancellation by itself no longer forces a failed / canceled outcome if completion has already been proven.

I also cleaned up the earlier review leftovers around the removed status-based path and kept the regression coverage for the client-disconnect case.

@Chengxiwei Chengxiwei force-pushed the pr/outbound-client-disconnect branch 2 times, most recently from 5c88768 to f913e45 Compare March 14, 2026 13:18
@Chengxiwei Chengxiwei force-pushed the pr/outbound-client-disconnect branch from f913e45 to fed294c Compare March 14, 2026 13:20
@looplj looplj merged commit 6b11671 into looplj:release/v0.9.x Mar 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants