Subagents: fix duplicate delivery when announce completion times out (#41235)#42098
Subagents: fix duplicate delivery when announce completion times out (#41235)#42098vasujain00 wants to merge 7 commits intoopenclaw:mainfrom
Conversation
…etries on gateway timeout Bump the default announce timeout from 60s to 90s to avoid false timeouts during long-running agent processes. Additionally, modify the announce delivery logic to prevent retries on gateway timeouts, which resolves issues with duplicate delivery of results to users. Update relevant tests and documentation to reflect these changes.
Greptile SummaryThis PR fixes a duplicate-delivery bug (#41235) in the subagent announce flow by (1) bumping the default
Confidence Score: 4/5
Last reviewed commit: 27ac051 |
| function isGatewayTimeoutError(error: unknown): boolean { | ||
| return /gateway timeout/i.test(summarizeDeliveryError(error)); | ||
| } |
There was a problem hiding this comment.
Duplicated regex may diverge from the transient-error list
isGatewayTimeoutError hard-codes /gateway timeout/i separately from the identical entry already in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS (line 112). If the pattern in the transient list is ever tightened or reworded (e.g. to disambiguate from a client-side timeout), isGatewayTimeoutError won't follow suit and the no-retry guard will silently stop working. Consider deriving it from the existing constant instead:
| function isGatewayTimeoutError(error: unknown): boolean { | |
| return /gateway timeout/i.test(summarizeDeliveryError(error)); | |
| } | |
| const GATEWAY_TIMEOUT_PATTERN = /gateway timeout/i; | |
| // (and replace the inline regex in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS with the same constant) | |
| function isGatewayTimeoutError(error: unknown): boolean { | |
| return GATEWAY_TIMEOUT_PATTERN.test(summarizeDeliveryError(error)); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce.ts
Line: 163-165
Comment:
**Duplicated regex may diverge from the transient-error list**
`isGatewayTimeoutError` hard-codes `/gateway timeout/i` separately from the identical entry already in `TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS` (line 112). If the pattern in the transient list is ever tightened or reworded (e.g. to disambiguate from a client-side timeout), `isGatewayTimeoutError` won't follow suit and the no-retry guard will silently stop working. Consider deriving it from the existing constant instead:
```suggestion
const GATEWAY_TIMEOUT_PATTERN = /gateway timeout/i;
// (and replace the inline regex in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS with the same constant)
function isGatewayTimeoutError(error: unknown): boolean {
return GATEWAY_TIMEOUT_PATTERN.test(summarizeDeliveryError(error));
}
```
How can I resolve this? If you propose a fix, please make it concise.| it("uses 90s timeout by default for direct announce agent call", async () => { | ||
| await runAnnounceFlowForTest("run-default-timeout"); | ||
|
|
||
| const directAgentCall = findGatewayCall( | ||
| (call) => call.method === "agent" && call.expectFinal === true, | ||
| ); | ||
| expect(directAgentCall?.timeoutMs).toBe(60_000); | ||
| expect(directAgentCall?.timeoutMs).toBe(90_000); | ||
| }); |
There was a problem hiding this comment.
No test for the core no-retry-on-gateway-timeout behavior
The test file only verifies the timeout value change (60s → 90s). The primary behavioral fix — that runAnnounceDeliveryWithRetry does not retry when noRetryOnGatewayTimeout is true and the error matches gateway timeout — is not tested at all.
Without a test that:
- sets
expectsCompletionMessage: true, - makes
callGatewaythrow a gateway-timeout error, and - asserts
callGatewaywas called exactly once,
a future refactor could inadvertently re-enable the retry and reintroduce the duplicate-delivery bug with no CI signal.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce.timeout.test.ts
Line: 134-141
Comment:
**No test for the core no-retry-on-gateway-timeout behavior**
The test file only verifies the timeout value change (60s → 90s). The primary behavioral fix — that `runAnnounceDeliveryWithRetry` does **not** retry when `noRetryOnGatewayTimeout` is true and the error matches `gateway timeout` — is not tested at all.
Without a test that:
1. sets `expectsCompletionMessage: true`,
2. makes `callGateway` throw a gateway-timeout error, and
3. asserts `callGateway` was called exactly once,
a future refactor could inadvertently re-enable the retry and reintroduce the duplicate-delivery bug with no CI signal.
How can I resolve this? If you propose a fix, please make it concise.Consolidate the error handling condition for gateway timeouts in the announce delivery function. This change improves code readability by reducing the number of lines and maintaining the same functionality. No changes to the overall behavior of the retry mechanism were made.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 508d4334f5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ? "completion direct announce agent call" | ||
| : "direct announce agent call", | ||
| signal: params.signal, | ||
| noRetryOnGatewayTimeout: params.expectsCompletionMessage, |
There was a problem hiding this comment.
Limit no-retry timeout policy to external completion delivery
Setting noRetryOnGatewayTimeout from params.expectsCompletionMessage makes all completion announces non-retriable, including internal subagent-to-subagent flows where deliver is false (sendSubagentAnnounceDirectly) and duplicate user delivery is not the risk this patch targets. In those cases, a transient gateway timeout now fails after a single attempt, and fallback queuing is not guaranteed (maybeQueueSubagentAnnounce can return none when the requester session is not active), so completions can be dropped and parent orchestration can stall waiting for missing child results.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73a5dffc8c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ? "completion direct announce agent call" | ||
| : "direct announce agent call", | ||
| signal: params.signal, | ||
| noRetryOnGatewayTimeout: params.expectsCompletionMessage, |
There was a problem hiding this comment.
Restrict timeout no-retry to externally delivered completions
noRetryOnGatewayTimeout is currently tied to expectsCompletionMessage, but completion mode is also the default for nested/internal subagent announces where deliver is false (shouldDeliverExternally is false for subagent requesters). In that internal path, a transient gateway timeout now exits after a single attempt, and the completion fallback queue can legitimately return none when the requester session is not active, so the child completion is dropped and parent orchestration can stall waiting for that result. Fresh evidence: spawn defaults set expectsCompletionMessage to true, so this branch affects normal nested runs, not only explicit user-delivery completions.
Useful? React with 👍 / 👎.
Updated the type definitions for the onPayload callback in various stream functions to ensure better type safety. This includes casting the onPayload parameter to accept unknown payloads and optional model parameters. Additionally, modified the tsconfig.json to exclude specific directories from compilation. These changes improve code clarity and maintainability without altering existing functionality.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb4111b4ca
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ? "completion direct announce agent call" | ||
| : "direct announce agent call", | ||
| signal: params.signal, | ||
| noRetryOnGatewayTimeout: params.expectsCompletionMessage, |
There was a problem hiding this comment.
Limit timeout no-retry to externally delivered completions
noRetryOnGatewayTimeout is now keyed only off expectsCompletionMessage, so completion announces skip retries even when they are internal (deliver=false) paths that do not risk duplicate user delivery. Fresh evidence in this commit is that noRetryOnGatewayTimeout is set unconditionally from expectsCompletionMessage, while external delivery is determined separately. In nested/internal subagent flows, a transient gateway timeout now fails after a single attempt and fallback queueing can still return none, which can drop the completion signal and leave parent orchestration waiting on a result that never arrives.
Useful? React with 👍 / 👎.
…equest_target usage Added a new section to the zizmor.yml file to specify dangerous triggers, allowing for automation based on labels while ensuring safety by not executing PR code. This change enhances the configuration for better handling of automated responses.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62192a923d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ? "completion direct announce agent call" | ||
| : "direct announce agent call", | ||
| signal: params.signal, | ||
| noRetryOnGatewayTimeout: params.expectsCompletionMessage, |
There was a problem hiding this comment.
Gate timeout no-retry on external completion delivery
noRetryOnGatewayTimeout is keyed only off expectsCompletionMessage, but completion mode is also the default for nested/subagent requesters where deliver is false (shouldDeliverExternally becomes false). In that internal path, a transient gateway timeout now bails after the first attempt, and the fallback queue can legitimately return none when the requester session is not active, so the completion signal can be dropped and the parent run may wait indefinitely for a child result.
Useful? React with 👍 / 👎.
|
Closing this PR in favor of a new narrow replacement. The current branch fixes a real bug, but it is carrying unrelated churn and has become a poor review artifact. The replacement PR should keep only:
Carry forward these review points into the replacement:
The new PR should tell one story only: prevent duplicate externally delivered completion announces on gateway timeout without weakening internal subagent completion reliability. |
|
Replacement narrowed PR is up in #43847. This carries forward the reviewed scope from #42098 only for:
It preserves the review follow-ups from #42098:
|
Fixes #41235.
When a subagent completes and the completion announcement times out (gateway timeout after 60s), the retry mechanism was retrying the agent call. That could cause the main agent to receive/process the result more than once and send duplicate messages to the user (e.g. on Telegram).
Changes
expectsCompletionMessageis true and the error is a gateway timeout, treat it as non-retryable. The server may have already delivered to the user; retrying risks duplicate delivery.Files modified
src/agents/subagent-announce.tssrc/config/types.agent-defaults.tssrc/agents/subagent-announce.timeout.test.tsCHANGELOG.md