fix(discord): retry bot sends on rate limit and server errors during chunk delivery#33054
fix(discord): retry bot sends on rate limit and server errors during chunk delivery#33054AnCoSONG wants to merge 1 commit intoopenclaw:mainfrom
Conversation
…chunk delivery When a long reply is split into multiple Discord message chunks, a 429 rate-limit or 5xx error on any chunk causes all remaining chunks to be silently dropped. Add sendWithRetry() that retries up to 2 times with backoff for retryable errors (429/5xx) and respects retry-after headers. Closes openclaw#32887 Made-with: Cursor
Greptile SummaryThis PR adds a The retry logic is sound — Additionally, the four new retry tests use real Confidence Score: 2/5
Last reviewed commit: 88009d4 |
| it("retries bot send on 429 rate limit then succeeds", async () => { | ||
| const rateLimitErr = Object.assign(new Error("rate limited"), { status: 429 }); | ||
| sendMessageDiscordMock | ||
| .mockRejectedValueOnce(rateLimitErr) | ||
| .mockResolvedValueOnce({ messageId: "msg-1", channelId: "channel-1" }); | ||
|
|
||
| await deliverDiscordReply({ | ||
| replies: [{ text: "retry me" }], | ||
| target: "channel:123", | ||
| token: "token", | ||
| runtime, | ||
| textLimit: 2000, | ||
| }); | ||
|
|
||
| expect(sendMessageDiscordMock).toHaveBeenCalledTimes(2); | ||
| }); | ||
|
|
||
| it("retries bot send on 500 server error then succeeds", async () => { | ||
| const serverErr = Object.assign(new Error("internal"), { status: 500 }); | ||
| sendMessageDiscordMock | ||
| .mockRejectedValueOnce(serverErr) | ||
| .mockResolvedValueOnce({ messageId: "msg-1", channelId: "channel-1" }); | ||
|
|
||
| await deliverDiscordReply({ | ||
| replies: [{ text: "retry me" }], | ||
| target: "channel:123", | ||
| token: "token", | ||
| runtime, | ||
| textLimit: 2000, | ||
| }); | ||
|
|
||
| expect(sendMessageDiscordMock).toHaveBeenCalledTimes(2); | ||
| }); | ||
|
|
||
| it("does not retry on 4xx client errors", async () => { | ||
| const clientErr = Object.assign(new Error("bad request"), { status: 400 }); | ||
| sendMessageDiscordMock.mockRejectedValueOnce(clientErr); | ||
|
|
||
| await expect( | ||
| deliverDiscordReply({ | ||
| replies: [{ text: "fail" }], | ||
| target: "channel:123", | ||
| token: "token", | ||
| runtime, | ||
| textLimit: 2000, | ||
| }), | ||
| ).rejects.toThrow("bad request"); | ||
|
|
||
| expect(sendMessageDiscordMock).toHaveBeenCalledTimes(1); | ||
| }); | ||
|
|
||
| it("throws after exhausting retry attempts", async () => { | ||
| const rateLimitErr = Object.assign(new Error("rate limited"), { status: 429 }); | ||
| sendMessageDiscordMock.mockRejectedValue(rateLimitErr); | ||
|
|
||
| await expect( | ||
| deliverDiscordReply({ | ||
| replies: [{ text: "persistent failure" }], | ||
| target: "channel:123", | ||
| token: "token", | ||
| runtime, | ||
| textLimit: 2000, | ||
| }), | ||
| ).rejects.toThrow("rate limited"); | ||
|
|
||
| // 1 initial + 2 retries = 3 attempts | ||
| expect(sendMessageDiscordMock).toHaveBeenCalledTimes(3); | ||
| }); |
There was a problem hiding this comment.
Retry tests run real timers instead of mocked timers, which adds ~5 seconds of wall-clock sleep to the test suite:
- "retries bot send on 429…" — sleeps 1000 ms
- "retries bot send on 500…" — sleeps 1000 ms
- "throws after exhausting retry attempts" — sleeps 1000 + 2000 = 3000 ms
The "touches bound-thread activity" test (line 214) demonstrates the correct pattern with vi.useFakeTimers(). Consider wrapping each retry test similarly:
it("retries bot send on 429 rate limit then succeeds", async () => {
vi.useFakeTimers();
try {
const rateLimitErr = Object.assign(new Error("rate limited"), { status: 429 });
sendMessageDiscordMock
.mockRejectedValueOnce(rateLimitErr)
.mockResolvedValueOnce({ messageId: "msg-1", channelId: "channel-1" });
const promise = deliverDiscordReply({ ... });
await vi.runAllTimersAsync();
await promise;
expect(sendMessageDiscordMock).toHaveBeenCalledTimes(2);
} finally {
vi.useRealTimers();
}
});This eliminates the real delays while still verifying retry logic.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/reply-delivery.test.ts
Line: 262-329
Comment:
Retry tests run real timers instead of mocked timers, which adds ~5 seconds of wall-clock sleep to the test suite:
- "retries bot send on 429…" — sleeps 1000 ms
- "retries bot send on 500…" — sleeps 1000 ms
- "throws after exhausting retry attempts" — sleeps 1000 + 2000 = 3000 ms
The "touches bound-thread activity" test (line 214) demonstrates the correct pattern with `vi.useFakeTimers()`. Consider wrapping each retry test similarly:
```ts
it("retries bot send on 429 rate limit then succeeds", async () => {
vi.useFakeTimers();
try {
const rateLimitErr = Object.assign(new Error("rate limited"), { status: 429 });
sendMessageDiscordMock
.mockRejectedValueOnce(rateLimitErr)
.mockResolvedValueOnce({ messageId: "msg-1", channelId: "channel-1" });
const promise = deliverDiscordReply({ ... });
await vi.runAllTimersAsync();
await promise;
expect(sendMessageDiscordMock).toHaveBeenCalledTimes(2);
} finally {
vi.useRealTimers();
}
});
```
This eliminates the real delays while still verifying retry logic.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (1)
Compare with the identical call pattern in Without this wrapper, users sending text + media payloads that encounter transient errors will still experience silent delivery failure. Prompt To Fix With AIThis is a comment left during a code review.
Path: src/discord/monitor/reply-delivery.ts
Line: 295-302
Comment:
This `sendMessageDiscord` call (non-voice, media-with-text path) is not wrapped in `sendWithRetry`, leaving it vulnerable to the same 429/5xx failures that the PR claims to fix.
Compare with the identical call pattern in `sendDiscordChunkWithFallback` (line 143-150) and `sendAdditionalDiscordMedia` (line 163-171), which both wrap `sendMessageDiscord` in `sendWithRetry`. This path should also be wrapped with the same retry pattern to maintain consistency and ensure users sending text + media payloads receive the same transient-error resilience as other delivery paths.
Without this wrapper, users sending text + media payloads that encounter transient errors will still experience silent delivery failure.
How can I resolve this? If you propose a fix, please make it concise. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88009d4a65
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const RETRY_BASE_DELAY_MS = 1000; | ||
|
|
||
| function isRetryableDiscordError(err: unknown): { retryable: boolean; retryAfterMs: number } { | ||
| const status = (err as { status?: number }).status ?? (err as { statusCode?: number }).statusCode; |
There was a problem hiding this comment.
Parse nested Discord status before deciding retryability
The new retry classifier only checks err.status and err.statusCode, but Discord failures in this repo are also handled with err.response.status (see extractDiscordErrorStatus in src/discord/monitor/thread-bindings.discord-api.ts). If sendMessageDiscord throws that shape for a 429 or 5xx, this function marks it non-retryable and the remaining chunks are still dropped after the first failed send, so the fix does not cover those real error objects.
Useful? React with 👍 / 👎.
|
Superseded by #33226 (combined Discord chunk ordering + retry fixes). |
Summary
sendWithRetry()wrapper around the bot sender path insendDiscordChunkWithFallbackandsendAdditionalDiscordMedia. Retries up to 2 times for 429/5xx with backoff, respectsretry-afterheader, and immediately throws for non-retryable 4xx errors.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
Security Impact (required)
No)No)No— same Discord API calls, just retried on failure)No)No)Repro + Verification
Environment
Steps
Expected
Actual
Evidence
Human Verification (required)
pnpm test src/discord/monitor/reply-delivery.test.ts— 13/13 passCompatibility / Migration
Yes)No)No)Failure Recovery (if this breaks)
src/discord/monitor/reply-delivery.tsRisks and Mitigations
Made with Cursor