fix(discord): circuit breaker for consecutive WS drop escalation by sophiearborbot · Pull Request #34497 · openclaw/openclaw

sophiearborbot · 2026-03-04T13:51:48Z

Problem

The Discord gateway enters an infinite resume loop when connections drop repeatedly. Observed in production: 264 reconnects in 24 hours, all silent drops (no close code logged).

Two bugs in provider.lifecycle.ts:

Bug 1 — Stall counter reset on close (the main culprit):

if (gateway?.isConnected) {
  resetHelloStallCounter(); // BUG: resets to 0 on every close where isConnected is briefly true
}

This is a race condition. isConnected can still be true at the moment a close event fires. Resetting the stall counter here means MAX_CONSECUTIVE_HELLO_STALLS is never reached, so the code never escalates to a fresh IDENTIFY.

Bug 2 — No independent drop tracking:
Silent drops (code 1006, or no code at all) do not go through the HELLO stall path. They simply reconnect and try to RESUME again indefinitely, until the 5-minute stall watchdog fires and restarts the whole provider.

Fix

Remove the resetHelloStallCounter() call in the close handler — let the stall counter accumulate correctly.
Add a circuit breaker: track consecutive close events in a 10-minute window. After MAX_CONSECUTIVE_DROPS (5), clear resume state (sessionId, resumeGatewayUrl, sequence) and force a fresh IDENTIFY. Reset the drop counter after DROP_RESET_WINDOW_MS (10 min) of stability.

Evidence

Production log pattern (macOS, OpenClaw 2026.2.26):

264 reconnects in 24h (one every ~5.4 min)
Only 8 coded drops (1006); 256 were silent
Watchdog catching them within 5-8 min but messages lost in gap
caffeinate running, not a sleep issue

Tests

Two new tests in provider.lifecycle.test.ts:

Circuit breaker trips after 5 consecutive drops and clears resume state
Drop counter resets after DROP_RESET_WINDOW_MS, preventing false positives

Fixes #13688, #20668. Related: #18410, #21099.

Root cause: two bugs in provider.lifecycle.ts allowed infinite resume loops. Bug 1: resetHelloStallCounter called on close when gateway.isConnected=true, preventing stall counter from reaching MAX_CONSECUTIVE_HELLO_STALLS. Bug 2: no independent tracking of consecutive drops, so silent drops looped until the 5-minute stall watchdog fired. Fix: circuit breaker tracks consecutive close events in a 10-minute window. After 5 drops, clears resume state and forces fresh IDENTIFY. Also removes the buggy resetHelloStallCounter call in the close handler. Fixes: openclaw#13688, openclaw#20668. Related: openclaw#18410, openclaw#21099.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19acefa5fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T13:54:51Z

src/discord/monitor/provider.lifecycle.test.ts

+          }),
+      );
+
+      const { lifecycleParams, runtimeLog } = createLifecycleHarness({ gateway });


Move circuit-breaker tests into helper scope

The new describe("circuit breaker — consecutive drop escalation") block is defined after the original runDiscordGatewayLifecycle suite closes, but it calls createLifecycleHarness, which is declared inside that earlier suite and is block-scoped. As written, running this test file will throw a ReferenceError when these new tests execute, so the added regression coverage never runs. Please either move this block inside the existing suite or hoist the helper to module scope.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-04T13:56:23Z

Greptile Summary

Summary

This PR fixes two related bugs in the Discord gateway lifecycle that were causing an infinite resume loop (264 reconnects observed in 24 hours in production).

Bug 1 — Stall counter reset on close: Removing the resetHelloStallCounter() call in the close handler is correct. The original code was resetting the counter whenever gateway?.isConnected was briefly true during close, masking the counter and preventing escalation to a fresh IDENTIFY.

Bug 2 — Circuit breaker for consecutive drops: The new circuit breaker implementation correctly:

Tracks consecutive WebSocket close events within a 10-minute window
Clears resume state (sessionId, resumeGatewayUrl, sequence) after 5 consecutive drops
Resets the drop counter after 10 minutes of stability

The tests validate that resume state is properly nullified and that the drop counter resets as expected. The implementation is straightforward: when the threshold is reached, the code clears all resume-related state and logs the escalation.

Code quality: Minor: redundant Date.now() call in onGatewayDebug (line 166) — now duplicates the earlier at capture and can be consolidated.

Confidence Score: 4/5

Safe to merge. Core bug fix is correct; circuit breaker implementation matches spec and tests validate state clearing.
The removal of the racey resetHelloStallCounter() call directly fixes the primary production issue (infinite resume loop). The circuit breaker implementation clears resume state after 5 consecutive drops as specified and is validated by tests. One minor code quality issue (redundant Date.now call) noted but does not affect correctness.
No files require special attention. Minor style suggestion in provider.lifecycle.ts is non-critical.

_{Last reviewed commit: 19acefa}

greptile-apps · 2026-03-04T13:56:27Z

src/discord/monitor/provider.lifecycle.ts

+      // Circuit breaker: track consecutive drops and force fresh IDENTIFY after threshold.
+      // NOTE: do NOT reset the hello stall counter here — a brief isConnected=true
+      // is unreliable and was masking the stall counter, allowing infinite resume loops.
+      const now = Date.now();


Redundant Date.now() call — at is already captured at the top of onGatewayDebug (line 160) and holds the same value. Replace now with at to avoid the extra function call and variable binding.

Suggested change

const now = Date.now();

if (at - lastDropAt > DROP_RESET_WINDOW_MS) {

consecutiveDrops = 0;

}

consecutiveDrops += 1;

lastDropAt = at;

Prompt To Fix With AI

This is a comment left during a code review. Path: src/discord/monitor/provider.lifecycle.ts Line: 166 Comment: Redundant `Date.now()` call — `at` is already captured at the top of `onGatewayDebug` (line 160) and holds the same value. Replace `now` with `at` to avoid the extra function call and variable binding. ```suggestion if (at - lastDropAt > DROP_RESET_WINDOW_MS) { consecutiveDrops = 0; } consecutiveDrops += 1; lastDropAt = at; ``` How can I resolve this? If you propose a fix, please make it concise.

openclaw-barnacle bot added channel: discord Channel integration: discord size: S labels Mar 4, 2026

chatgpt-codex-connector bot reviewed Mar 4, 2026

View reviewed changes

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

harrison-ming mentioned this pull request Mar 20, 2026

Discord: startup race condition causes permanent connected: false after health-monitor restart #50836

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(discord): circuit breaker for consecutive WS drop escalation#34497

fix(discord): circuit breaker for consecutive WS drop escalation#34497
sophiearborbot wants to merge 1 commit intoopenclaw:mainfrom
sophiearborbot:fix/discord-ws-circuit-breaker

sophiearborbot commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-      const now = Date.now();
+      if (at - lastDropAt > DROP_RESET_WINDOW_MS) {
+        consecutiveDrops = 0;
+      }
+      consecutiveDrops += 1;
+      lastDropAt = at;

Uh oh!

Conversation

sophiearborbot commented Mar 4, 2026

Problem

Fix

Evidence

Tests

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant