Skip to content

Subagents: fix duplicate delivery when announce completion times out (#41235)#42098

Closed
vasujain00 wants to merge 7 commits intoopenclaw:mainfrom
vasujain00:fix/subagent-announce-duplicate-delivery-41235
Closed

Subagents: fix duplicate delivery when announce completion times out (#41235)#42098
vasujain00 wants to merge 7 commits intoopenclaw:mainfrom
vasujain00:fix/subagent-announce-duplicate-delivery-41235

Conversation

@vasujain00
Copy link
Copy Markdown

Fixes #41235.

When a subagent completes and the completion announcement times out (gateway timeout after 60s), the retry mechanism was retrying the agent call. That could cause the main agent to receive/process the result more than once and send duplicate messages to the user (e.g. on Telegram).

Changes

  1. Increase default announce timeout from 60s to 90s – agent runs regularly take 60–80s, so the previous 60s default caused avoidable timeouts.
  2. Do not retry completion announce on gateway timeout – when expectsCompletionMessage is true and the error is a gateway timeout, treat it as non-retryable. The server may have already delivered to the user; retrying risks duplicate delivery.

Files modified

  • src/agents/subagent-announce.ts
  • src/config/types.agent-defaults.ts
  • src/agents/subagent-announce.timeout.test.ts
  • CHANGELOG.md

…etries on gateway timeout

Bump the default announce timeout from 60s to 90s to avoid false timeouts during long-running agent processes. Additionally, modify the announce delivery logic to prevent retries on gateway timeouts, which resolves issues with duplicate delivery of results to users. Update relevant tests and documentation to reflect these changes.
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Mar 10, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR fixes a duplicate-delivery bug (#41235) in the subagent announce flow by (1) bumping the default announceTimeoutMs from 60 s to 90 s to match typical agent run lengths, and (2) making runAnnounceDeliveryWithRetry skip retries when a gateway-timeout error occurs on a completion-announce call (expectsCompletionMessage: true). The changes are well-scoped and the guard logic is correctly ordered before the general transient-retry path.

  • Timeout increase (DEFAULT_SUBAGENT_ANNOUNCE_TIMEOUT_MS 60 000 → 90 000 ms): reduces the frequency of gateway timeouts during normal operation, addressing the root trigger of the original bug.
  • No-retry guard (noRetryOnGatewayTimeout): prevents the retry loop from re-sending an agent completion call that the server may have already processed, eliminating the duplicate user message.
  • Test gap: the new noRetryOnGatewayTimeout behaviour — the core of the fix — has no direct test. Only the timeout value change is asserted. A test that mocks a gateway-timeout throw and confirms callGateway is invoked exactly once when expectsCompletionMessage is true would prevent regression.
  • Regex duplication: isGatewayTimeoutError reproduces the same /gateway timeout/i pattern already in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS; extracting a shared constant would prevent silent divergence if the pattern is ever refined.

Confidence Score: 4/5

  • PR is safe to merge; logic is correct and well-targeted, with one minor test coverage gap for the new no-retry path.
  • The fix is small, well-understood, and directly addresses the described bug. The guard is applied at the right point in the retry loop and the timeout bump is conservative and well-justified. The only concern is that the primary behavioral change (no retry on gateway timeout) lacks a dedicated test, leaving a regression risk if the code is refactored later.
  • src/agents/subagent-announce.timeout.test.ts — missing test for the no-retry-on-gateway-timeout behaviour

Last reviewed commit: 27ac051

Comment on lines +163 to +165
function isGatewayTimeoutError(error: unknown): boolean {
return /gateway timeout/i.test(summarizeDeliveryError(error));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated regex may diverge from the transient-error list

isGatewayTimeoutError hard-codes /gateway timeout/i separately from the identical entry already in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS (line 112). If the pattern in the transient list is ever tightened or reworded (e.g. to disambiguate from a client-side timeout), isGatewayTimeoutError won't follow suit and the no-retry guard will silently stop working. Consider deriving it from the existing constant instead:

Suggested change
function isGatewayTimeoutError(error: unknown): boolean {
return /gateway timeout/i.test(summarizeDeliveryError(error));
}
const GATEWAY_TIMEOUT_PATTERN = /gateway timeout/i;
// (and replace the inline regex in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS with the same constant)
function isGatewayTimeoutError(error: unknown): boolean {
return GATEWAY_TIMEOUT_PATTERN.test(summarizeDeliveryError(error));
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce.ts
Line: 163-165

Comment:
**Duplicated regex may diverge from the transient-error list**

`isGatewayTimeoutError` hard-codes `/gateway timeout/i` separately from the identical entry already in `TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS` (line 112). If the pattern in the transient list is ever tightened or reworded (e.g. to disambiguate from a client-side timeout), `isGatewayTimeoutError` won't follow suit and the no-retry guard will silently stop working. Consider deriving it from the existing constant instead:

```suggestion
const GATEWAY_TIMEOUT_PATTERN = /gateway timeout/i;
// (and replace the inline regex in TRANSIENT_ANNOUNCE_DELIVERY_ERROR_PATTERNS with the same constant)

function isGatewayTimeoutError(error: unknown): boolean {
  return GATEWAY_TIMEOUT_PATTERN.test(summarizeDeliveryError(error));
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +134 to 141
it("uses 90s timeout by default for direct announce agent call", async () => {
await runAnnounceFlowForTest("run-default-timeout");

const directAgentCall = findGatewayCall(
(call) => call.method === "agent" && call.expectFinal === true,
);
expect(directAgentCall?.timeoutMs).toBe(60_000);
expect(directAgentCall?.timeoutMs).toBe(90_000);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test for the core no-retry-on-gateway-timeout behavior

The test file only verifies the timeout value change (60s → 90s). The primary behavioral fix — that runAnnounceDeliveryWithRetry does not retry when noRetryOnGatewayTimeout is true and the error matches gateway timeout — is not tested at all.

Without a test that:

  1. sets expectsCompletionMessage: true,
  2. makes callGateway throw a gateway-timeout error, and
  3. asserts callGateway was called exactly once,

a future refactor could inadvertently re-enable the retry and reintroduce the duplicate-delivery bug with no CI signal.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce.timeout.test.ts
Line: 134-141

Comment:
**No test for the core no-retry-on-gateway-timeout behavior**

The test file only verifies the timeout value change (60s → 90s). The primary behavioral fix — that `runAnnounceDeliveryWithRetry` does **not** retry when `noRetryOnGatewayTimeout` is true and the error matches `gateway timeout` — is not tested at all.

Without a test that:
1. sets `expectsCompletionMessage: true`,
2. makes `callGateway` throw a gateway-timeout error, and
3. asserts `callGateway` was called exactly once,

a future refactor could inadvertently re-enable the retry and reintroduce the duplicate-delivery bug with no CI signal.

How can I resolve this? If you propose a fix, please make it concise.

Consolidate the error handling condition for gateway timeouts in the announce delivery function. This change improves code readability by reducing the number of lines and maintaining the same functionality. No changes to the overall behavior of the retry mechanism were made.
@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Mar 10, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 508d4334f5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

? "completion direct announce agent call"
: "direct announce agent call",
signal: params.signal,
noRetryOnGatewayTimeout: params.expectsCompletionMessage,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit no-retry timeout policy to external completion delivery

Setting noRetryOnGatewayTimeout from params.expectsCompletionMessage makes all completion announces non-retriable, including internal subagent-to-subagent flows where deliver is false (sendSubagentAnnounceDirectly) and duplicate user delivery is not the risk this patch targets. In those cases, a transient gateway timeout now fails after a single attempt, and fallback queuing is not guaranteed (maybeQueueSubagentAnnounce can return none when the requester session is not active), so completions can be dropped and parent orchestration can stall waiting for missing child results.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73a5dffc8c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

? "completion direct announce agent call"
: "direct announce agent call",
signal: params.signal,
noRetryOnGatewayTimeout: params.expectsCompletionMessage,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict timeout no-retry to externally delivered completions

noRetryOnGatewayTimeout is currently tied to expectsCompletionMessage, but completion mode is also the default for nested/internal subagent announces where deliver is false (shouldDeliverExternally is false for subagent requesters). In that internal path, a transient gateway timeout now exits after a single attempt, and the completion fallback queue can legitimately return none when the requester session is not active, so the child completion is dropped and parent orchestration can stall waiting for that result. Fresh evidence: spawn defaults set expectsCompletionMessage to true, so this branch affects normal nested runs, not only explicit user-delivery completions.

Useful? React with 👍 / 👎.

Updated the type definitions for the onPayload callback in various stream functions to ensure better type safety. This includes casting the onPayload parameter to accept unknown payloads and optional model parameters. Additionally, modified the tsconfig.json to exclude specific directories from compilation. These changes improve code clarity and maintainability without altering existing functionality.
@openclaw-barnacle openclaw-barnacle bot added size: M and removed cli CLI command changes size: XS labels Mar 11, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb4111b4ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

? "completion direct announce agent call"
: "direct announce agent call",
signal: params.signal,
noRetryOnGatewayTimeout: params.expectsCompletionMessage,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit timeout no-retry to externally delivered completions

noRetryOnGatewayTimeout is now keyed only off expectsCompletionMessage, so completion announces skip retries even when they are internal (deliver=false) paths that do not risk duplicate user delivery. Fresh evidence in this commit is that noRetryOnGatewayTimeout is set unconditionally from expectsCompletionMessage, while external delivery is determined separately. In nested/internal subagent flows, a transient gateway timeout now fails after a single attempt and fallback queueing can still return none, which can drop the completion signal and leave parent orchestration waiting on a result that never arrives.

Useful? React with 👍 / 👎.

vasujain00 and others added 2 commits March 11, 2026 16:57
…equest_target usage

Added a new section to the zizmor.yml file to specify dangerous triggers, allowing for automation based on labels while ensuring safety by not executing PR code. This change enhances the configuration for better handling of automated responses.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62192a923d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

? "completion direct announce agent call"
: "direct announce agent call",
signal: params.signal,
noRetryOnGatewayTimeout: params.expectsCompletionMessage,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate timeout no-retry on external completion delivery

noRetryOnGatewayTimeout is keyed only off expectsCompletionMessage, but completion mode is also the default for nested/subagent requesters where deliver is false (shouldDeliverExternally becomes false). In that internal path, a transient gateway timeout now bails after the first attempt, and the fallback queue can legitimately return none when the requester session is not active, so the completion signal can be dropped and the parent run may wait indefinitely for a child result.

Useful? React with 👍 / 👎.

@vincentkoc
Copy link
Copy Markdown
Member

Closing this PR in favor of a new narrow replacement.

The current branch fixes a real bug, but it is carrying unrelated churn and has become a poor review artifact. The replacement PR should keep only:

  • src/agents/subagent-announce.ts
  • src/agents/subagent-announce.timeout.test.ts
  • src/config/types.agent-defaults.ts
  • CHANGELOG.md

Carry forward these review points into the replacement:

  1. src/agents/subagent-announce.ts
  • Derive the gateway-timeout matcher from a shared GATEWAY_TIMEOUT_PATTERN constant instead of duplicating the regex separately from the transient-error list.
  • Gate noRetryOnGatewayTimeout on externally delivered completion announces only. Do not apply the no-retry rule to internal/nested completion flows where deliver=false, because those paths can legitimately need retries and otherwise risk dropping child completion signals.
  1. src/agents/subagent-announce.timeout.test.ts
  • Add one direct regression for the actual bug: completion announce + gateway-timeout error + external delivery path => exactly one gateway call, no retry.
  1. src/config/types.agent-defaults.ts
  • Keep the timeout default change only if it is still justified in the narrowed PR and tested alongside the no-retry behavior. Do not bring unrelated typing/runtime churn with it.

The new PR should tell one story only: prevent duplicate externally delivered completion announces on gateway timeout without weakening internal subagent completion reliability.

@vincentkoc
Copy link
Copy Markdown
Member

Replacement narrowed PR is up in #43847.

This carries forward the reviewed scope from #42098 only for:

  • src/agents/subagent-announce.ts
  • src/agents/subagent-announce.timeout.test.ts
  • src/config/types.agent-defaults.ts
  • CHANGELOG.md

It preserves the review follow-ups from #42098:

  • shared gateway-timeout matcher
  • no-retry only for externally delivered completion announces
  • direct regression for "completion announce gateway timeout does not retry"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subagent announce completion timeout causes duplicate messages to user

2 participants