fix(discord): circuit breaker for consecutive WS drop escalation#34497
fix(discord): circuit breaker for consecutive WS drop escalation#34497sophiearborbot wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Root cause: two bugs in provider.lifecycle.ts allowed infinite resume loops. Bug 1: resetHelloStallCounter called on close when gateway.isConnected=true, preventing stall counter from reaching MAX_CONSECUTIVE_HELLO_STALLS. Bug 2: no independent tracking of consecutive drops, so silent drops looped until the 5-minute stall watchdog fired. Fix: circuit breaker tracks consecutive close events in a 10-minute window. After 5 drops, clears resume state and forces fresh IDENTIFY. Also removes the buggy resetHelloStallCounter call in the close handler. Fixes: openclaw#13688, openclaw#20668. Related: openclaw#18410, openclaw#21099.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19acefa5fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }), | ||
| ); | ||
|
|
||
| const { lifecycleParams, runtimeLog } = createLifecycleHarness({ gateway }); |
There was a problem hiding this comment.
Move circuit-breaker tests into helper scope
The new describe("circuit breaker — consecutive drop escalation") block is defined after the original runDiscordGatewayLifecycle suite closes, but it calls createLifecycleHarness, which is declared inside that earlier suite and is block-scoped. As written, running this test file will throw a ReferenceError when these new tests execute, so the added regression coverage never runs. Please either move this block inside the existing suite or hoist the helper to module scope.
Useful? React with 👍 / 👎.
Greptile SummarySummaryThis PR fixes two related bugs in the Discord gateway lifecycle that were causing an infinite resume loop (264 reconnects observed in 24 hours in production). Bug 1 — Stall counter reset on close: Removing the Bug 2 — Circuit breaker for consecutive drops: The new circuit breaker implementation correctly:
The tests validate that resume state is properly nullified and that the drop counter resets as expected. The implementation is straightforward: when the threshold is reached, the code clears all resume-related state and logs the escalation. Code quality: Minor: redundant Confidence Score: 4/5
Last reviewed commit: 19acefa |
| // Circuit breaker: track consecutive drops and force fresh IDENTIFY after threshold. | ||
| // NOTE: do NOT reset the hello stall counter here — a brief isConnected=true | ||
| // is unreliable and was masking the stall counter, allowing infinite resume loops. | ||
| const now = Date.now(); |
There was a problem hiding this comment.
Redundant Date.now() call — at is already captured at the top of onGatewayDebug (line 160) and holds the same value. Replace now with at to avoid the extra function call and variable binding.
| const now = Date.now(); | |
| if (at - lastDropAt > DROP_RESET_WINDOW_MS) { | |
| consecutiveDrops = 0; | |
| } | |
| consecutiveDrops += 1; | |
| lastDropAt = at; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/provider.lifecycle.ts
Line: 166
Comment:
Redundant `Date.now()` call — `at` is already captured at the top of `onGatewayDebug` (line 160) and holds the same value. Replace `now` with `at` to avoid the extra function call and variable binding.
```suggestion
if (at - lastDropAt > DROP_RESET_WINDOW_MS) {
consecutiveDrops = 0;
}
consecutiveDrops += 1;
lastDropAt = at;
```
How can I resolve this? If you propose a fix, please make it concise.
Problem
The Discord gateway enters an infinite resume loop when connections drop repeatedly. Observed in production: 264 reconnects in 24 hours, all silent drops (no close code logged).
Two bugs in
provider.lifecycle.ts:Bug 1 — Stall counter reset on close (the main culprit):
This is a race condition.
isConnectedcan still betrueat the moment a close event fires. Resetting the stall counter here meansMAX_CONSECUTIVE_HELLO_STALLSis never reached, so the code never escalates to a fresh IDENTIFY.Bug 2 — No independent drop tracking:
Silent drops (code 1006, or no code at all) do not go through the HELLO stall path. They simply reconnect and try to RESUME again indefinitely, until the 5-minute stall watchdog fires and restarts the whole provider.
Fix
resetHelloStallCounter()call in the close handler — let the stall counter accumulate correctly.MAX_CONSECUTIVE_DROPS(5), clear resume state (sessionId,resumeGatewayUrl,sequence) and force a fresh IDENTIFY. Reset the drop counter afterDROP_RESET_WINDOW_MS(10 min) of stability.Evidence
Production log pattern (macOS, OpenClaw 2026.2.26):
Tests
Two new tests in
provider.lifecycle.test.ts:Fixes #13688, #20668. Related: #18410, #21099.