fix(discord): retry outbound sends on transient network errors by codexGW · Pull Request #45161 · openclaw/openclaw

codexGW · 2026-03-13T14:05:00Z

Summary

Discord outbound message sends silently fail on transient network errors (TypeError: fetch failed, ECONNRESET, etc.) because the retry logic only handles HTTP status codes (429, 5xx), not network-level exceptions.

This causes the gateway to log [discord] final reply failed: TypeError: fetch failed and silently drop the user's reply.

Root Cause

Two retry predicates only checked HTTP status codes:

isRetryableDiscordError() in reply-delivery.ts — checks .status for 429/5xx, but TypeError has no .status property
shouldRetry in createDiscordRetryRunner() — only matched RateLimitError instances

Network-level errors (TypeError: fetch failed, ECONNRESET, ETIMEDOUT, etc.) bypassed both predicates entirely.

Changes

Added isRetryableDiscordNetworkError() helper to retry-policy.ts that detects:
- TypeError with messages matching fetch failed or network
- Errors with Node.js network codes: ECONNRESET, ECONNREFUSED, ETIMEDOUT, UND_ERR_CONNECT_TIMEOUT
Wired it into both isRetryableDiscordError() and createDiscordRetryRunner()
Follows the same pattern already used by the Telegram retry policy

Tests Added (all passing)

Test	Verifies
retries bot send on network TypeError then succeeds	`TypeError('fetch failed')` → retry → success
retries bot send on ECONNRESET then succeeds	`{code: 'ECONNRESET'}` → retry → success
does not retry on 401 auth error	401 → immediate fail, 1 attempt
does not retry on 400 validation error	400 → immediate fail, 1 attempt
exhausts retries on persistent network failure	TypeError on all attempts → fail after max attempts
Discord retry runner retries network errors (unit)	Runner-level TypeError retry

All 28 tests in both test files pass. No existing tests were modified.

Impact

Transient network failures now retry with existing backoff defaults (3 attempts, 500ms-30s, 0.1 jitter)
No behavior change for non-network errors (429/5xx/4xx unchanged)
Duplicate message risk on ambiguous transport failures is inherent to any retry-on-POST; kept attempt count low to minimize

Fixes #5152

Previously, Discord reply delivery only retried on HTTP 429/5xx status codes. Network-level failures (TypeError: fetch failed, ECONNRESET, etc.) were not retried, causing silent message loss on transient network issues. This adds network error detection to both the reply delivery retry predicate and the shared Discord retry runner, matching the pattern already used by the Telegram retry policy. Fixes openclaw#5152

greptile-apps · 2026-03-13T14:07:46Z

Greptile Summary

This PR fixes a real gap in Discord outbound retry logic by adding isRetryableDiscordNetworkError() to handle transport-level failures (TypeError: fetch failed, ECONNRESET, etc.) that were previously bypassing both retry predicates and causing silent message drops. The change is wired into both isRetryableDiscordError() in reply-delivery.ts and shouldRetry in createDiscordRetryRunner(), and follows the same structural pattern as the existing Telegram retry policy.

Key observations:

The onRetry verbose log in createDiscordRetryRunner still emits "rate limited" for all retry triggers, which will be misleading for operators when the actual cause is a network error (e.g. in provider.ts where verbose: shouldLogVerbose() is passed).
isRetryableDiscordNetworkError short-circuits with false when an error carries a .code that isn't in DISCORD_RETRYABLE_ERROR_CODES, without falling through to the TypeError + message check. This is safe today since fetch TypeErrors from undici don't carry a .code, but the asymmetry is fragile against future error shape changes.
The new "does not retry on 400 validation error" test is functionally identical to the pre-existing "does not retry on 4xx client errors" test and adds no new coverage.

Confidence Score: 4/5

Safe to merge — fixes a real silent failure mode with no behavior change for non-network errors; minor observability and logic-asymmetry issues flagged but none are blockers.
The core fix is correct and well-tested. The duplicate-message risk on network retries is acknowledged in the PR description, and the attempt count (3) is kept low to minimize it. The two flagged issues (misleading log message, early-return asymmetry in isRetryableDiscordNetworkError) are non-blocking style concerns that don't affect correctness under current error shapes.
Review the onRetry log message in src/infra/retry-policy.ts and the early-return logic in isRetryableDiscordNetworkError before merging.

Comments Outside Diff (1)

src/infra/retry-policy.ts, line 93-99 (link)

Misleading "rate limited" log message for network retries

The onRetry log always says "rate limited" regardless of the error type. Now that the runner also retries on network errors (TypeError, ECONNRESET, etc.), operators using verbose: true (e.g. in provider.ts via shouldLogVerbose()) will see confusing log entries like discord command deploy rate limited, retry 1/2 in 500ms for what is actually a transient network failure.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 93-99

Comment:
**Misleading "rate limited" log message for network retries**

The `onRetry` log always says `"rate limited"` regardless of the error type. Now that the runner also retries on network errors (`TypeError`, `ECONNRESET`, etc.), operators using `verbose: true` (e.g. in `provider.ts` via `shouldLogVerbose()`) will see confusing log entries like `discord command deploy rate limited, retry 1/2 in 500ms` for what is actually a transient network failure.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 93-99

Comment:
**Misleading "rate limited" log message for network retries**

The `onRetry` log always says `"rate limited"` regardless of the error type. Now that the runner also retries on network errors (`TypeError`, `ECONNRESET`, etc.), operators using `verbose: true` (e.g. in `provider.ts` via `shouldLogVerbose()`) will see confusing log entries like `discord command deploy rate limited, retry 1/2 in 500ms` for what is actually a transient network failure.

```suggestion
      onRetry: params.verbose
        ? (info) => {
            const labelText = info.label ?? "request";
            const maxRetries = Math.max(1, info.maxAttempts - 1);
            const reason =
              info.err instanceof RateLimitError ? "rate limited" : "network error";
            log.warn(
              `discord ${labelText} ${reason}, retry ${info.attempt}/${maxRetries} in ${info.delayMs}ms`,
            );
          }
        : undefined,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/discord/monitor/reply-delivery.test.ts
Line: 410-426

Comment:
**Duplicate test coverage for 400 error**

This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use `Object.assign(new Error("bad request"), { status: 400 })` and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 68-76

Comment:
**Early return suppresses TypeError fallback when an unknown `.code` is present**

The current logic short-circuits on any string `.code` property:

```
if (typeof (err as { code?: unknown }).code === "string") {
  return DISCORD_RETRYABLE_ERROR_CODES.has(code);  // returns false for unknown codes
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
```

A `TypeError("fetch failed")` that also happens to carry an unrecognized `.code` (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the `TypeError` + message check. This may not matter today since standard fetch `TypeError`s from undici don't carry a `.code`, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:

```typescript
export function isRetryableDiscordNetworkError(err: unknown): boolean {
  if (!(err instanceof Error)) {
    return false;
  }
  const code = (err as { code?: unknown }).code;
  if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) {
    return true;
  }
  return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: f9de2ac}

greptile-apps · 2026-03-13T14:07:53Z

src/discord/monitor/reply-delivery.test.ts

+  it("does not retry on 400 validation error", async () => {
+    const validationErr = Object.assign(new Error("bad request"), { status: 400 });
+    sendMessageDiscordMock.mockRejectedValueOnce(validationErr);
+
+    await expect(
+      deliverDiscordReply({
+        replies: [{ text: "fail" }],
+        target: "channel:123",
+        token: "token",
+        runtime,
+        cfg,
+        textLimit: 2000,
+      }),
+    ).rejects.toThrow("bad request");
+
+    expect(sendMessageDiscordMock).toHaveBeenCalledTimes(1);
+  });


Duplicate test coverage for 400 error

This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use Object.assign(new Error("bad request"), { status: 400 }) and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/discord/monitor/reply-delivery.test.ts Line: 410-426 Comment: **Duplicate test coverage for 400 error** This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use `Object.assign(new Error("bad request"), { status: 400 })` and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T14:07:54Z

src/infra/retry-policy.ts

+export function isRetryableDiscordNetworkError(err: unknown): boolean {
+  if (!(err instanceof Error)) {
+    return false;
+  }
+  if (typeof (err as { code?: unknown }).code === "string") {
+    return DISCORD_RETRYABLE_ERROR_CODES.has((err as { code: string }).code);
+  }
+  return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
+}


Early return suppresses TypeError fallback when an unknown .code is present

The current logic short-circuits on any string .code property:

if (typeof (err as { code?: unknown }).code === "string") { return DISCORD_RETRYABLE_ERROR_CODES.has(code); // returns false for unknown codes } return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);

A TypeError("fetch failed") that also happens to carry an unrecognized .code (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the TypeError + message check. This may not matter today since standard fetch TypeErrors from undici don't carry a .code, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:

export function isRetryableDiscordNetworkError(err: unknown): boolean { if (!(err instanceof Error)) { return false; } const code = (err as { code?: unknown }).code; if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) { return true; } return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message); }

Prompt To Fix With AI

This is a comment left during a code review. Path: src/infra/retry-policy.ts Line: 68-76 Comment: **Early return suppresses TypeError fallback when an unknown `.code` is present** The current logic short-circuits on any string `.code` property: ``` if (typeof (err as { code?: unknown }).code === "string") { return DISCORD_RETRYABLE_ERROR_CODES.has(code); // returns false for unknown codes } return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message); ``` A `TypeError("fetch failed")` that also happens to carry an unrecognized `.code` (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the `TypeError` + message check. This may not matter today since standard fetch `TypeError`s from undici don't carry a `.code`, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise: ```typescript export function isRetryableDiscordNetworkError(err: unknown): boolean { if (!(err instanceof Error)) { return false; } const code = (err as { code?: unknown }).code; if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) { return true; } return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message); } ``` How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9de2ac189

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T14:11:31Z

src/infra/retry-policy.ts

      ...retryConfig,
      label,
-      shouldRetry: (err) => err instanceof RateLimitError,
+      shouldRetry: (err) => err instanceof RateLimitError || isRetryableDiscordNetworkError(err),


Remove duplicated network retries across Discord send path

deliverDiscordReply already wraps outbound sends in sendWithRetry and now retries network exceptions there, so adding isRetryableDiscordNetworkError to createDiscordRetryRunner introduces two retry layers for the same failure class. In the common monitor flow this means a single transient network error can trigger up to 3 inner retries per call and then 3 outer retries of the whole send (up to 9 POST attempts), which materially increases duplicate-message risk when the first POST succeeded but the response was lost and also delays final failure reporting. Network retry should live in only one of these layers.

Useful? React with 👍 / 👎.

- Remove network retry from createDiscordRetryRunner to avoid double retry layers (reply-delivery sendWithRetry already retries). This prevents up to 9 POST attempts on a single transient failure. - Fix early-return in isRetryableDiscordNetworkError: fall through to TypeError check when .code is present but unrecognized, instead of short-circuiting to false. - Replace duplicate 400 test with 403 permission error test for distinct coverage.

openclaw-barnacle bot added channel: discord Channel integration: discord size: S labels Mar 13, 2026

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(discord): retry outbound sends on transient network errors#45161

fix(discord): retry outbound sends on transient network errors#45161
codexGW wants to merge 2 commits intoopenclaw:mainfrom
codexGW:fix/discord-retry-network-errors

codexGW commented Mar 13, 2026

Uh oh!

greptile-apps bot commented Mar 13, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

codexGW commented Mar 13, 2026

Summary

Root Cause

Changes

Tests Added (all passing)

Impact

Uh oh!

greptile-apps bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 13, 2026 •

edited

Loading