fix(discord): add circuit breaker for WebSocket resume loop#29484
fix(discord): add circuit breaker for WebSocket resume loop#29484A386official wants to merge 2 commits intoopenclaw:mainfrom
Conversation
When the Discord gateway stalls (WebSocket opens but no HELLO received within 30s), OpenClaw enters an infinite resume loop because the underlying library resets its reconnect counter on each WebSocket open event. The stale session token is retried indefinitely. This adds an application-level circuit breaker that tracks consecutive stalls and, after 5, invalidates the session state (sessionId and resumeGatewayUrl) to force a fresh IDENTIFY instead of endlessly resuming a dead session. - Track consecutive stalls via debug event listener - After MAX_CONSECUTIVE_STALLS (5), clear gateway session state - Reset stall counter on successful HELLO or session resume - Log stall progress for observability (stall N/5) Fixes #13688
Greptile SummaryThis PR adds a circuit breaker to prevent infinite Discord WebSocket resume loops by tracking consecutive connection stalls and forcing a fresh IDENTIFY after 5 failures. Key changes:
Main concern:
Minor concerns:
The core circuit breaker logic is sound and well-tested, addressing a real production issue with good observability through logging. Confidence Score: 3/5
Last reviewed commit: 195edf0 |
| // Reset stall counter when gateway successfully receives HELLO | ||
| // (indicated by a successful resume or identify completing). | ||
| if (message.includes("Received HELLO") || message.includes("Session resumed")) { | ||
| consecutiveStalls = 0; | ||
| return; |
There was a problem hiding this comment.
verify that @buape/carbon actually emits debug messages containing "Received HELLO" and "Session resumed"
the circuit breaker's reset logic depends on detecting these exact strings in debug messages, but there's no evidence in the codebase that the library emits them. if these patterns don't match the actual library output, the counter won't reset on successful connections, causing the circuit breaker to trigger on 5 total stalls instead of 5 consecutive stalls (contradicting the PR description)
consider:
- add a comment documenting the exact debug message formats from the library
- or reset the counter in the timeout handler when
gateway.isConnected === true:
| // Reset stall counter when gateway successfully receives HELLO | |
| // (indicated by a successful resume or identify completing). | |
| if (message.includes("Received HELLO") || message.includes("Session resumed")) { | |
| consecutiveStalls = 0; | |
| return; | |
| // Reset stall counter when gateway successfully receives HELLO | |
| // (indicated by a successful resume or identify completing). | |
| if (message.includes("Received HELLO") || message.includes("Session resumed")) { | |
| consecutiveStalls = 0; | |
| return; | |
| } | |
| if (!message.includes("WebSocket connection opened")) { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/provider.lifecycle.ts
Line: 60-64
Comment:
verify that `@buape/carbon` actually emits debug messages containing "Received HELLO" and "Session resumed"
the circuit breaker's reset logic depends on detecting these exact strings in debug messages, but there's no evidence in the codebase that the library emits them. if these patterns don't match the actual library output, the counter won't reset on successful connections, causing the circuit breaker to trigger on 5 total stalls instead of 5 consecutive stalls (contradicting the PR description)
consider:
1. add a comment documenting the exact debug message formats from the library
2. or reset the counter in the timeout handler when `gateway.isConnected === true`:
```suggestion
// Reset stall counter when gateway successfully receives HELLO
// (indicated by a successful resume or identify completing).
if (message.includes("Received HELLO") || message.includes("Session resumed")) {
consecutiveStalls = 0;
return;
}
if (!message.includes("WebSocket connection opened")) {
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Good catch — verified this against the actual Carbon source. GatewayPlugin only emits three debug messages: "WebSocket connection opened", "WebSocket connection closed with code ...", and "Attempting resume/Reconnecting with backoff: ...". No HELLO or session resume messages.
Fixed in b23ac8f — removed the debug message listener entirely. The counter now resets in the timeout handler itself: when the 30s timeout fires and gateway.isConnected is true, that means HELLO was received and the session is healthy, so we reset to 0. This uses the documented public isConnected property instead of relying on internal debug strings.
| const state = ( | ||
| gateway as unknown as { | ||
| state?: { sessionId?: string | null; resumeGatewayUrl?: string | null }; | ||
| } | ||
| ).state; | ||
| if (state) { | ||
| state.sessionId = null; | ||
| state.resumeGatewayUrl = null; | ||
| } |
There was a problem hiding this comment.
unsafe type assertion accesses undocumented internal structure
if @buape/carbon changes its internal state property structure, this could fail silently. the GatewayPlugin type doesn't expose sessionId or resumeGatewayUrl, so you're relying on implementation details.
consider adding a comment explaining that this is accessing library internals, or check if the library provides a proper API to invalidate sessions
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/provider.lifecycle.ts
Line: 84-92
Comment:
unsafe type assertion accesses undocumented internal structure
if `@buape/carbon` changes its internal `state` property structure, this could fail silently. the `GatewayPlugin` type doesn't expose `sessionId` or `resumeGatewayUrl`, so you're relying on implementation details.
consider adding a comment explaining that this is accessing library internals, or check if the library provides a proper API to invalidate sessions
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Checked — GatewayPlugin.state is protected and there's no public method to invalidate a session. Added a comment in b23ac8f documenting that this cast is pinned to @buape/[email protected] internals, so it'll get flagged if someone bumps the major version.
…bug strings @buape/carbon doesn't emit "Received HELLO" or "Session resumed" debug messages — the counter reset was effectively dead code. Instead, check gateway.isConnected when the 30s timeout fires: if the gateway is connected, HELLO was received and the counter resets to 0. Also documents the internal state cast (GatewayPlugin.state is protected with no public session invalidation method).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b23ac8f637
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (gateway?.isConnected) { | ||
| // HELLO was received and the gateway is healthy — reset the counter. | ||
| consecutiveStalls = 0; | ||
| } else { | ||
| consecutiveStalls++; |
There was a problem hiding this comment.
Reset stall counter on HELLO, not delayed connection state
The new circuit breaker increments consecutiveStalls whenever gateway.isConnected is false at the 30s timeout, but that also happens when a connection did receive HELLO and then dropped/reconnected before 30s. In that common flaky-network scenario, healthy (or partially healthy) sessions are miscounted as “no HELLO” stalls, and after five such cycles this code clears sessionId/resumeGatewayUrl and forces IDENTIFY unnecessarily, which can increase event loss/reconnect churn instead of just breaking stale-resume loops.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This doesn't happen in practice. When a connection drops, Carbon immediately opens a new WebSocket, which emits another "WebSocket connection opened" debug message. That clears the old timeout and starts a fresh 30s window (line 63-65). So the 30s timer from the original WS open never fires.
The only scenario where the timeout fires with isConnected === false is when the WebSocket stays open for the full 30s but no HELLO arrives, which is exactly the stale-resume loop we're targeting.
|
Addressed in b23ac8f. The original code never relied on "Received HELLO" or "Session resumed" strings - those were suggested by the review bot but were never in the actual implementation. The reset logic has always used Verified against tl;dr the concern about unverified debug strings was a false positive - the code already does what was suggested (option 2 in the review). |
|
Regarding the 3 failing CI checks — these are all pre-existing failures on
All 3 also fail on the latest |
Problem
The Discord WebSocket gateway can enter an infinite resume loop where it endlessly retries with a stale session token. This has been observed across multiple production deployments, with users reporting 2,142 WebSocket close events in 48 hours and outages lasting 1-7+ hours requiring manual gateway restarts (#13688).
Root Cause
When the WebSocket opens but no HELLO is received within 30 seconds, the zombie timeout handler calls
gateway.disconnect()+gateway.connect(false)to force a reconnect. However,@buape/carbon's internalreconnectAttemptscounter resets to 0 on every WebSocket open event, so the library'smaxAttempts: 50circuit breaker is never reached.The loop:
Production impact (from issue reports)
Solution
Add an application-level circuit breaker to the zombie timeout handler in `provider.lifecycle.ts`:
This breaks the infinite loop because a fresh IDENTIFY creates a new session rather than endlessly retrying a stale one.
Changes
`src/discord/monitor/provider.lifecycle.ts` — 1 file, ~30 lines changed:
`src/discord/monitor/provider.lifecycle.test.ts` — 4 new tests:
Test Plan
Fixes #13688
Supersedes #15762 (closed, merge conflicts)