fix(discord): add circuit breaker for WebSocket resume loop#15762
fix(discord): add circuit breaker for WebSocket resume loop#15762funmerlin wants to merge 2 commits intoopenclaw:mainfrom
Conversation
…#13180) When the Discord WebSocket connection enters a stall loop (connects but never receives HELLO), the existing zombie timeout handler would disconnect and reconnect indefinitely. The library's reconnectAttempts counter resets on every WebSocket open event, so the maxAttempts circuit breaker is never reached. This adds an application-level circuit breaker: - Track consecutive stalls (WS opens but no HELLO within 30s) - After 5 consecutive stalls, invalidate the session state (sessionId, resumeGatewayUrl) and force a fresh IDENTIFY instead of resume - Log stall count on each attempt for observability - Reset counter on successful HELLO receipt This breaks the infinite resume loop observed in production where 1400+ reconnect attempts occurred over 12+ hours with a stale session token. Fixes: openclaw#13180
bfc1ccb to
f92900f
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
Confirming this bug — root cause is in
|
…ect-circuit-breaker
|
Closing as AI-assisted stale-fix triage. Linked issue #13180 ("Discord WebSocket: resume loop needs circuit breaker") is currently closed and was closed on 2026-02-14T02:07:23Z with state reason completed. If this specific implementation is still needed on current main, please reopen #15762 (or open a new focused fix PR) and reference #13180 for fast re-triage. |
|
This was closed citing that linked issue #13180 is resolved, but #13180 was closed as a duplicate of #13688, which remains open and is actively reproduced multiple times per week. The root issue is unsolved. MarcBickel's Feb 21 comment contains a full root cause trace and a working 10-line patch tracing the bug to @buape/carbon's reconnect counter reset — that work shouldn't be lost. Requesting this PR be reopened or the patch carried forward into a new PR. |
This can stay closed, or perhaps should close once the fix for buape/carbon#353 has made its way into our codebase. If @MarcBickel is right, then there is no reason to actually merge this workaround (combating symptoms). You are right though on the reason for closing to be faulty, however 😅 |
Problem
The Discord WebSocket connection can enter an unrecoverable resume loop where it endlessly retries with a stale session token. Observed in production: 1,400+ reconnect attempts over 12+ hours before manual intervention.
Root cause
When the WS opens but never receives a HELLO (Discord gateway stall), OpenClaw's zombie timeout handler calls
gateway.disconnect()+gateway.connect(false)to force a reconnect. However, the underlying library (@buape/carbon) resetsreconnectAttempts = 0on every WebSocket open event, so the library's own circuit breaker (maxAttempts) is never reached.The zombie timeout effectively creates an infinite loop:
Timeline from production logs (Feb 13, 2026)
Total: 717 resume attempts, 36 connection stalls, 708 WS close code 1005.
Fix
Add an application-level circuit breaker to the zombie timeout handler:
sessionId,resumeGatewayUrl) and force a freshIDENTIFYinstead of trying to resume with a dead session tokenThis breaks the loop because a fresh IDENTIFY creates a new session rather than trying to resume a stale one.
Changes
src/discord/monitor/provider.ts: AddedMAX_STALL_RETRIES(5) andconsecutiveStallscounter to the zombie timeout handler. On circuit breaker trip, nullifiesgateway.state.sessionIdandgateway.state.resumeGatewayUrlbefore reconnecting.Fixes #13180
Greptile Overview
Greptile Summary
This change adds an application-level circuit breaker to Discord gateway zombie-connection handling in
src/discord/monitor/provider.ts. It tracks consecutive stalls where the WebSocket opens but no HELLO is observed within 30s; after 5 stalls it clears the gateway session identifiers before reconnecting, forcing a fresh IDENTIFY instead of endlessly resuming a stale session.The logic is implemented by listening to gateway
debugmessages, resetting the stall counter on HELLO-related markers, and incrementing/reconnecting when the HELLO timeout expires after a connection-open event.Confidence Score: 3/5
gateway.statefields via casts. In this environment the carbon implementation is not available to verify that these markers always appear and thatstate.sessionId/state.resumeGatewayUrlexist and are intended to be mutated, so there is residual integration risk.Last reviewed commit: 3cc5974
(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!