Skip to content

fix(discord): retry outbound sends on transient network errors#45161

Open
codexGW wants to merge 2 commits intoopenclaw:mainfrom
codexGW:fix/discord-retry-network-errors
Open

fix(discord): retry outbound sends on transient network errors#45161
codexGW wants to merge 2 commits intoopenclaw:mainfrom
codexGW:fix/discord-retry-network-errors

Conversation

@codexGW
Copy link
Copy Markdown
Contributor

@codexGW codexGW commented Mar 13, 2026

Summary

Discord outbound message sends silently fail on transient network errors (TypeError: fetch failed, ECONNRESET, etc.) because the retry logic only handles HTTP status codes (429, 5xx), not network-level exceptions.

This causes the gateway to log [discord] final reply failed: TypeError: fetch failed and silently drop the user's reply.

Root Cause

Two retry predicates only checked HTTP status codes:

  1. isRetryableDiscordError() in reply-delivery.ts — checks .status for 429/5xx, but TypeError has no .status property
  2. shouldRetry in createDiscordRetryRunner() — only matched RateLimitError instances

Network-level errors (TypeError: fetch failed, ECONNRESET, ETIMEDOUT, etc.) bypassed both predicates entirely.

Changes

  • Added isRetryableDiscordNetworkError() helper to retry-policy.ts that detects:
    • TypeError with messages matching fetch failed or network
    • Errors with Node.js network codes: ECONNRESET, ECONNREFUSED, ETIMEDOUT, UND_ERR_CONNECT_TIMEOUT
  • Wired it into both isRetryableDiscordError() and createDiscordRetryRunner()
  • Follows the same pattern already used by the Telegram retry policy

Tests Added (all passing)

Test Verifies
retries bot send on network TypeError then succeeds TypeError('fetch failed') → retry → success
retries bot send on ECONNRESET then succeeds {code: 'ECONNRESET'} → retry → success
does not retry on 401 auth error 401 → immediate fail, 1 attempt
does not retry on 400 validation error 400 → immediate fail, 1 attempt
exhausts retries on persistent network failure TypeError on all attempts → fail after max attempts
Discord retry runner retries network errors (unit) Runner-level TypeError retry

All 28 tests in both test files pass. No existing tests were modified.

Impact

  • Transient network failures now retry with existing backoff defaults (3 attempts, 500ms-30s, 0.1 jitter)
  • No behavior change for non-network errors (429/5xx/4xx unchanged)
  • Duplicate message risk on ambiguous transport failures is inherent to any retry-on-POST; kept attempt count low to minimize

Fixes #5152

Previously, Discord reply delivery only retried on HTTP 429/5xx status codes.
Network-level failures (TypeError: fetch failed, ECONNRESET, etc.) were not
retried, causing silent message loss on transient network issues.

This adds network error detection to both the reply delivery retry predicate
and the shared Discord retry runner, matching the pattern already used by
the Telegram retry policy.

Fixes openclaw#5152
@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord size: S labels Mar 13, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR fixes a real gap in Discord outbound retry logic by adding isRetryableDiscordNetworkError() to handle transport-level failures (TypeError: fetch failed, ECONNRESET, etc.) that were previously bypassing both retry predicates and causing silent message drops. The change is wired into both isRetryableDiscordError() in reply-delivery.ts and shouldRetry in createDiscordRetryRunner(), and follows the same structural pattern as the existing Telegram retry policy.

Key observations:

  • The onRetry verbose log in createDiscordRetryRunner still emits "rate limited" for all retry triggers, which will be misleading for operators when the actual cause is a network error (e.g. in provider.ts where verbose: shouldLogVerbose() is passed).
  • isRetryableDiscordNetworkError short-circuits with false when an error carries a .code that isn't in DISCORD_RETRYABLE_ERROR_CODES, without falling through to the TypeError + message check. This is safe today since fetch TypeErrors from undici don't carry a .code, but the asymmetry is fragile against future error shape changes.
  • The new "does not retry on 400 validation error" test is functionally identical to the pre-existing "does not retry on 4xx client errors" test and adds no new coverage.

Confidence Score: 4/5

  • Safe to merge — fixes a real silent failure mode with no behavior change for non-network errors; minor observability and logic-asymmetry issues flagged but none are blockers.
  • The core fix is correct and well-tested. The duplicate-message risk on network retries is acknowledged in the PR description, and the attempt count (3) is kept low to minimize it. The two flagged issues (misleading log message, early-return asymmetry in isRetryableDiscordNetworkError) are non-blocking style concerns that don't affect correctness under current error shapes.
  • Review the onRetry log message in src/infra/retry-policy.ts and the early-return logic in isRetryableDiscordNetworkError before merging.

Comments Outside Diff (1)

  1. src/infra/retry-policy.ts, line 93-99 (link)

    Misleading "rate limited" log message for network retries

    The onRetry log always says "rate limited" regardless of the error type. Now that the runner also retries on network errors (TypeError, ECONNRESET, etc.), operators using verbose: true (e.g. in provider.ts via shouldLogVerbose()) will see confusing log entries like discord command deploy rate limited, retry 1/2 in 500ms for what is actually a transient network failure.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/infra/retry-policy.ts
    Line: 93-99
    
    Comment:
    **Misleading "rate limited" log message for network retries**
    
    The `onRetry` log always says `"rate limited"` regardless of the error type. Now that the runner also retries on network errors (`TypeError`, `ECONNRESET`, etc.), operators using `verbose: true` (e.g. in `provider.ts` via `shouldLogVerbose()`) will see confusing log entries like `discord command deploy rate limited, retry 1/2 in 500ms` for what is actually a transient network failure.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 93-99

Comment:
**Misleading "rate limited" log message for network retries**

The `onRetry` log always says `"rate limited"` regardless of the error type. Now that the runner also retries on network errors (`TypeError`, `ECONNRESET`, etc.), operators using `verbose: true` (e.g. in `provider.ts` via `shouldLogVerbose()`) will see confusing log entries like `discord command deploy rate limited, retry 1/2 in 500ms` for what is actually a transient network failure.

```suggestion
      onRetry: params.verbose
        ? (info) => {
            const labelText = info.label ?? "request";
            const maxRetries = Math.max(1, info.maxAttempts - 1);
            const reason =
              info.err instanceof RateLimitError ? "rate limited" : "network error";
            log.warn(
              `discord ${labelText} ${reason}, retry ${info.attempt}/${maxRetries} in ${info.delayMs}ms`,
            );
          }
        : undefined,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/discord/monitor/reply-delivery.test.ts
Line: 410-426

Comment:
**Duplicate test coverage for 400 error**

This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use `Object.assign(new Error("bad request"), { status: 400 })` and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 68-76

Comment:
**Early return suppresses TypeError fallback when an unknown `.code` is present**

The current logic short-circuits on any string `.code` property:

```
if (typeof (err as { code?: unknown }).code === "string") {
  return DISCORD_RETRYABLE_ERROR_CODES.has(code);  // returns false for unknown codes
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
```

A `TypeError("fetch failed")` that also happens to carry an unrecognized `.code` (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the `TypeError` + message check. This may not matter today since standard fetch `TypeError`s from undici don't carry a `.code`, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:

```typescript
export function isRetryableDiscordNetworkError(err: unknown): boolean {
  if (!(err instanceof Error)) {
    return false;
  }
  const code = (err as { code?: unknown }).code;
  if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) {
    return true;
  }
  return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: f9de2ac

Comment on lines +410 to +426
it("does not retry on 400 validation error", async () => {
const validationErr = Object.assign(new Error("bad request"), { status: 400 });
sendMessageDiscordMock.mockRejectedValueOnce(validationErr);

await expect(
deliverDiscordReply({
replies: [{ text: "fail" }],
target: "channel:123",
token: "token",
runtime,
cfg,
textLimit: 2000,
}),
).rejects.toThrow("bad request");

expect(sendMessageDiscordMock).toHaveBeenCalledTimes(1);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate test coverage for 400 error

This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use Object.assign(new Error("bad request"), { status: 400 }) and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/reply-delivery.test.ts
Line: 410-426

Comment:
**Duplicate test coverage for 400 error**

This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use `Object.assign(new Error("bad request"), { status: 400 })` and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +68 to +76
export function isRetryableDiscordNetworkError(err: unknown): boolean {
if (!(err instanceof Error)) {
return false;
}
if (typeof (err as { code?: unknown }).code === "string") {
return DISCORD_RETRYABLE_ERROR_CODES.has((err as { code: string }).code);
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return suppresses TypeError fallback when an unknown .code is present

The current logic short-circuits on any string .code property:

if (typeof (err as { code?: unknown }).code === "string") {
  return DISCORD_RETRYABLE_ERROR_CODES.has(code);  // returns false for unknown codes
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);

A TypeError("fetch failed") that also happens to carry an unrecognized .code (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the TypeError + message check. This may not matter today since standard fetch TypeErrors from undici don't carry a .code, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:

export function isRetryableDiscordNetworkError(err: unknown): boolean {
  if (!(err instanceof Error)) {
    return false;
  }
  const code = (err as { code?: unknown }).code;
  if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) {
    return true;
  }
  return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 68-76

Comment:
**Early return suppresses TypeError fallback when an unknown `.code` is present**

The current logic short-circuits on any string `.code` property:

```
if (typeof (err as { code?: unknown }).code === "string") {
  return DISCORD_RETRYABLE_ERROR_CODES.has(code);  // returns false for unknown codes
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
```

A `TypeError("fetch failed")` that also happens to carry an unrecognized `.code` (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the `TypeError` + message check. This may not matter today since standard fetch `TypeError`s from undici don't carry a `.code`, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:

```typescript
export function isRetryableDiscordNetworkError(err: unknown): boolean {
  if (!(err instanceof Error)) {
    return false;
  }
  const code = (err as { code?: unknown }).code;
  if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) {
    return true;
  }
  return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9de2ac189

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

...retryConfig,
label,
shouldRetry: (err) => err instanceof RateLimitError,
shouldRetry: (err) => err instanceof RateLimitError || isRetryableDiscordNetworkError(err),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove duplicated network retries across Discord send path

deliverDiscordReply already wraps outbound sends in sendWithRetry and now retries network exceptions there, so adding isRetryableDiscordNetworkError to createDiscordRetryRunner introduces two retry layers for the same failure class. In the common monitor flow this means a single transient network error can trigger up to 3 inner retries per call and then 3 outer retries of the whole send (up to 9 POST attempts), which materially increases duplicate-message risk when the first POST succeeded but the response was lost and also delays final failure reporting. Network retry should live in only one of these layers.

Useful? React with 👍 / 👎.

- Remove network retry from createDiscordRetryRunner to avoid double
  retry layers (reply-delivery sendWithRetry already retries). This
  prevents up to 9 POST attempts on a single transient failure.
- Fix early-return in isRetryableDiscordNetworkError: fall through to
  TypeError check when .code is present but unrecognized, instead of
  short-circuiting to false.
- Replace duplicate 400 test with 403 permission error test for
  distinct coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway crashes on unhandled fetch failure (TypeError: fetch failed)

1 participant