Skip to content

fix(discord): circuit breaker for consecutive WS drop escalation#34497

Open
sophiearborbot wants to merge 1 commit intoopenclaw:mainfrom
sophiearborbot:fix/discord-ws-circuit-breaker
Open

fix(discord): circuit breaker for consecutive WS drop escalation#34497
sophiearborbot wants to merge 1 commit intoopenclaw:mainfrom
sophiearborbot:fix/discord-ws-circuit-breaker

Conversation

@sophiearborbot
Copy link
Copy Markdown

Problem

The Discord gateway enters an infinite resume loop when connections drop repeatedly. Observed in production: 264 reconnects in 24 hours, all silent drops (no close code logged).

Two bugs in provider.lifecycle.ts:

Bug 1 — Stall counter reset on close (the main culprit):

if (gateway?.isConnected) {
  resetHelloStallCounter(); // BUG: resets to 0 on every close where isConnected is briefly true
}

This is a race condition. isConnected can still be true at the moment a close event fires. Resetting the stall counter here means MAX_CONSECUTIVE_HELLO_STALLS is never reached, so the code never escalates to a fresh IDENTIFY.

Bug 2 — No independent drop tracking:
Silent drops (code 1006, or no code at all) do not go through the HELLO stall path. They simply reconnect and try to RESUME again indefinitely, until the 5-minute stall watchdog fires and restarts the whole provider.

Fix

  1. Remove the resetHelloStallCounter() call in the close handler — let the stall counter accumulate correctly.
  2. Add a circuit breaker: track consecutive close events in a 10-minute window. After MAX_CONSECUTIVE_DROPS (5), clear resume state (sessionId, resumeGatewayUrl, sequence) and force a fresh IDENTIFY. Reset the drop counter after DROP_RESET_WINDOW_MS (10 min) of stability.

Evidence

Production log pattern (macOS, OpenClaw 2026.2.26):

  • 264 reconnects in 24h (one every ~5.4 min)
  • Only 8 coded drops (1006); 256 were silent
  • Watchdog catching them within 5-8 min but messages lost in gap
  • caffeinate running, not a sleep issue

Tests

Two new tests in provider.lifecycle.test.ts:

  • Circuit breaker trips after 5 consecutive drops and clears resume state
  • Drop counter resets after DROP_RESET_WINDOW_MS, preventing false positives

Fixes #13688, #20668. Related: #18410, #21099.

Root cause: two bugs in provider.lifecycle.ts allowed infinite resume loops.

Bug 1: resetHelloStallCounter called on close when gateway.isConnected=true,
preventing stall counter from reaching MAX_CONSECUTIVE_HELLO_STALLS.

Bug 2: no independent tracking of consecutive drops, so silent drops
looped until the 5-minute stall watchdog fired.

Fix: circuit breaker tracks consecutive close events in a 10-minute window.
After 5 drops, clears resume state and forces fresh IDENTIFY. Also removes
the buggy resetHelloStallCounter call in the close handler.

Fixes: openclaw#13688, openclaw#20668. Related: openclaw#18410, openclaw#21099.
@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord size: S labels Mar 4, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19acefa5fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}),
);

const { lifecycleParams, runtimeLog } = createLifecycleHarness({ gateway });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move circuit-breaker tests into helper scope

The new describe("circuit breaker — consecutive drop escalation") block is defined after the original runDiscordGatewayLifecycle suite closes, but it calls createLifecycleHarness, which is declared inside that earlier suite and is block-scoped. As written, running this test file will throw a ReferenceError when these new tests execute, so the added regression coverage never runs. Please either move this block inside the existing suite or hoist the helper to module scope.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Summary

This PR fixes two related bugs in the Discord gateway lifecycle that were causing an infinite resume loop (264 reconnects observed in 24 hours in production).

Bug 1 — Stall counter reset on close: Removing the resetHelloStallCounter() call in the close handler is correct. The original code was resetting the counter whenever gateway?.isConnected was briefly true during close, masking the counter and preventing escalation to a fresh IDENTIFY.

Bug 2 — Circuit breaker for consecutive drops: The new circuit breaker implementation correctly:

  • Tracks consecutive WebSocket close events within a 10-minute window
  • Clears resume state (sessionId, resumeGatewayUrl, sequence) after 5 consecutive drops
  • Resets the drop counter after 10 minutes of stability

The tests validate that resume state is properly nullified and that the drop counter resets as expected. The implementation is straightforward: when the threshold is reached, the code clears all resume-related state and logs the escalation.

Code quality: Minor: redundant Date.now() call in onGatewayDebug (line 166) — now duplicates the earlier at capture and can be consolidated.

Confidence Score: 4/5

  • Safe to merge. Core bug fix is correct; circuit breaker implementation matches spec and tests validate state clearing.
  • The removal of the racey resetHelloStallCounter() call directly fixes the primary production issue (infinite resume loop). The circuit breaker implementation clears resume state after 5 consecutive drops as specified and is validated by tests. One minor code quality issue (redundant Date.now call) noted but does not affect correctness.
  • No files require special attention. Minor style suggestion in provider.lifecycle.ts is non-critical.

Last reviewed commit: 19acefa

// Circuit breaker: track consecutive drops and force fresh IDENTIFY after threshold.
// NOTE: do NOT reset the hello stall counter here — a brief isConnected=true
// is unreliable and was masking the stall counter, allowing infinite resume loops.
const now = Date.now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant Date.now() call — at is already captured at the top of onGatewayDebug (line 160) and holds the same value. Replace now with at to avoid the extra function call and variable binding.

Suggested change
const now = Date.now();
if (at - lastDropAt > DROP_RESET_WINDOW_MS) {
consecutiveDrops = 0;
}
consecutiveDrops += 1;
lastDropAt = at;
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/discord/monitor/provider.lifecycle.ts
Line: 166

Comment:
Redundant `Date.now()` call — `at` is already captured at the top of `onGatewayDebug` (line 160) and holds the same value. Replace `now` with `at` to avoid the extra function call and variable binding.

```suggestion
      if (at - lastDropAt > DROP_RESET_WINDOW_MS) {
        consecutiveDrops = 0;
      }
      consecutiveDrops += 1;
      lastDropAt = at;
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discord: WebSocket 1005/1006 disconnects with failing resume logic and unbounded backoff

1 participant