fix(discord): retry outbound sends on transient network errors#45161
fix(discord): retry outbound sends on transient network errors#45161codexGW wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Previously, Discord reply delivery only retried on HTTP 429/5xx status codes. Network-level failures (TypeError: fetch failed, ECONNRESET, etc.) were not retried, causing silent message loss on transient network issues. This adds network error detection to both the reply delivery retry predicate and the shared Discord retry runner, matching the pattern already used by the Telegram retry policy. Fixes openclaw#5152
Greptile SummaryThis PR fixes a real gap in Discord outbound retry logic by adding Key observations:
Confidence Score: 4/5
|
| it("does not retry on 400 validation error", async () => { | ||
| const validationErr = Object.assign(new Error("bad request"), { status: 400 }); | ||
| sendMessageDiscordMock.mockRejectedValueOnce(validationErr); | ||
|
|
||
| await expect( | ||
| deliverDiscordReply({ | ||
| replies: [{ text: "fail" }], | ||
| target: "channel:123", | ||
| token: "token", | ||
| runtime, | ||
| cfg, | ||
| textLimit: 2000, | ||
| }), | ||
| ).rejects.toThrow("bad request"); | ||
|
|
||
| expect(sendMessageDiscordMock).toHaveBeenCalledTimes(1); | ||
| }); |
There was a problem hiding this comment.
Duplicate test coverage for 400 error
This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use Object.assign(new Error("bad request"), { status: 400 }) and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/reply-delivery.test.ts
Line: 410-426
Comment:
**Duplicate test coverage for 400 error**
This test ("does not retry on 400 validation error") is functionally identical to the existing "does not retry on 4xx client errors" test immediately below it (lines 428–444): both use `Object.assign(new Error("bad request"), { status: 400 })` and assert exactly 1 call. The new test adds no new coverage and may cause confusion. Consider removing it or replacing it with a distinct 4xx status (e.g. 422 or 403) to broaden coverage without duplication.
How can I resolve this? If you propose a fix, please make it concise.| export function isRetryableDiscordNetworkError(err: unknown): boolean { | ||
| if (!(err instanceof Error)) { | ||
| return false; | ||
| } | ||
| if (typeof (err as { code?: unknown }).code === "string") { | ||
| return DISCORD_RETRYABLE_ERROR_CODES.has((err as { code: string }).code); | ||
| } | ||
| return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message); | ||
| } |
There was a problem hiding this comment.
Early return suppresses TypeError fallback when an unknown .code is present
The current logic short-circuits on any string .code property:
if (typeof (err as { code?: unknown }).code === "string") {
return DISCORD_RETRYABLE_ERROR_CODES.has(code); // returns false for unknown codes
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
A TypeError("fetch failed") that also happens to carry an unrecognized .code (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the TypeError + message check. This may not matter today since standard fetch TypeErrors from undici don't carry a .code, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:
export function isRetryableDiscordNetworkError(err: unknown): boolean {
if (!(err instanceof Error)) {
return false;
}
const code = (err as { code?: unknown }).code;
if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) {
return true;
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/retry-policy.ts
Line: 68-76
Comment:
**Early return suppresses TypeError fallback when an unknown `.code` is present**
The current logic short-circuits on any string `.code` property:
```
if (typeof (err as { code?: unknown }).code === "string") {
return DISCORD_RETRYABLE_ERROR_CODES.has(code); // returns false for unknown codes
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
```
A `TypeError("fetch failed")` that also happens to carry an unrecognized `.code` (e.g. some future undici error variant or a library wrapper that attaches a code) will be classified as non-retryable and never reach the `TypeError` + message check. This may not matter today since standard fetch `TypeError`s from undici don't carry a `.code`, but the asymmetry is fragile. Consider checking the code-based path only when the code actually matches, and falling through otherwise:
```typescript
export function isRetryableDiscordNetworkError(err: unknown): boolean {
if (!(err instanceof Error)) {
return false;
}
const code = (err as { code?: unknown }).code;
if (typeof code === "string" && DISCORD_RETRYABLE_ERROR_CODES.has(code)) {
return true;
}
return err instanceof TypeError && DISCORD_NETWORK_ERROR_RE.test(err.message);
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9de2ac189
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/infra/retry-policy.ts
Outdated
| ...retryConfig, | ||
| label, | ||
| shouldRetry: (err) => err instanceof RateLimitError, | ||
| shouldRetry: (err) => err instanceof RateLimitError || isRetryableDiscordNetworkError(err), |
There was a problem hiding this comment.
Remove duplicated network retries across Discord send path
deliverDiscordReply already wraps outbound sends in sendWithRetry and now retries network exceptions there, so adding isRetryableDiscordNetworkError to createDiscordRetryRunner introduces two retry layers for the same failure class. In the common monitor flow this means a single transient network error can trigger up to 3 inner retries per call and then 3 outer retries of the whole send (up to 9 POST attempts), which materially increases duplicate-message risk when the first POST succeeded but the response was lost and also delays final failure reporting. Network retry should live in only one of these layers.
Useful? React with 👍 / 👎.
- Remove network retry from createDiscordRetryRunner to avoid double retry layers (reply-delivery sendWithRetry already retries). This prevents up to 9 POST attempts on a single transient failure. - Fix early-return in isRetryableDiscordNetworkError: fall through to TypeError check when .code is present but unrecognized, instead of short-circuiting to false. - Replace duplicate 400 test with 403 permission error test for distinct coverage.
Summary
Discord outbound message sends silently fail on transient network errors (
TypeError: fetch failed,ECONNRESET, etc.) because the retry logic only handles HTTP status codes (429, 5xx), not network-level exceptions.This causes the gateway to log
[discord] final reply failed: TypeError: fetch failedand silently drop the user's reply.Root Cause
Two retry predicates only checked HTTP status codes:
isRetryableDiscordError()inreply-delivery.ts— checks.statusfor 429/5xx, butTypeErrorhas no.statuspropertyshouldRetryincreateDiscordRetryRunner()— only matchedRateLimitErrorinstancesNetwork-level errors (
TypeError: fetch failed,ECONNRESET,ETIMEDOUT, etc.) bypassed both predicates entirely.Changes
isRetryableDiscordNetworkError()helper toretry-policy.tsthat detects:TypeErrorwith messages matchingfetch failedornetworkECONNRESET,ECONNREFUSED,ETIMEDOUT,UND_ERR_CONNECT_TIMEOUTisRetryableDiscordError()andcreateDiscordRetryRunner()Tests Added (all passing)
TypeError('fetch failed')→ retry → success{code: 'ECONNRESET'}→ retry → successAll 28 tests in both test files pass. No existing tests were modified.
Impact
Fixes #5152