Skip to content

fix(discord): add circuit breaker for WebSocket resume loop#15762

Closed
funmerlin wants to merge 2 commits intoopenclaw:mainfrom
funmerlin:fix/discord-ws-reconnect-circuit-breaker
Closed

fix(discord): add circuit breaker for WebSocket resume loop#15762
funmerlin wants to merge 2 commits intoopenclaw:mainfrom
funmerlin:fix/discord-ws-reconnect-circuit-breaker

Conversation

@funmerlin
Copy link
Copy Markdown

@funmerlin funmerlin commented Feb 13, 2026

Problem

The Discord WebSocket connection can enter an unrecoverable resume loop where it endlessly retries with a stale session token. Observed in production: 1,400+ reconnect attempts over 12+ hours before manual intervention.

Root cause

When the WS opens but never receives a HELLO (Discord gateway stall), OpenClaw's zombie timeout handler calls gateway.disconnect() + gateway.connect(false) to force a reconnect. However, the underlying library (@buape/carbon) resets reconnectAttempts = 0 on every WebSocket open event, so the library's own circuit breaker (maxAttempts) is never reached.

The zombie timeout effectively creates an infinite loop:

  1. Connect → WS opens → counter resets to 0
  2. No HELLO arrives within 30s → zombie timeout fires
  3. Disconnect + reconnect (resume) → go to 1
  4. Session token is stale → resume always fails silently

Timeline from production logs (Feb 13, 2026)

Time Event Duration
07:30 Gateway boot, Discord login 25 min stable
07:55 First stall → resume loop begins 341 attempts
10:26 Briefly self-recovers 6 min stable
10:32 Loop resumes 222 attempts
13:54 Self-recovers 3 min stable
13:57 Loop resumes 73 attempts
15:45 Self-recovers 2 min stable
15:48 Loop resumes 79 attempts
17:09 Full gateway restart 2.5+ hours stable

Total: 717 resume attempts, 36 connection stalls, 708 WS close code 1005.

Fix

Add an application-level circuit breaker to the zombie timeout handler:

  • Track consecutive stalls (WS opens but no HELLO within 30s)
  • After 5 consecutive stalls, invalidate the session state (sessionId, resumeGatewayUrl) and force a fresh IDENTIFY instead of trying to resume with a dead session token
  • Log stall count on each attempt for observability
  • Reset counter on successful HELLO receipt

This breaks the loop because a fresh IDENTIFY creates a new session rather than trying to resume a stale one.

Changes

  • src/discord/monitor/provider.ts: Added MAX_STALL_RETRIES (5) and consecutiveStalls counter to the zombie timeout handler. On circuit breaker trip, nullifies gateway.state.sessionId and gateway.state.resumeGatewayUrl before reconnecting.

Fixes #13180

Greptile Overview

Greptile Summary

This change adds an application-level circuit breaker to Discord gateway zombie-connection handling in src/discord/monitor/provider.ts. It tracks consecutive stalls where the WebSocket opens but no HELLO is observed within 30s; after 5 stalls it clears the gateway session identifiers before reconnecting, forcing a fresh IDENTIFY instead of endlessly resuming a stale session.

The logic is implemented by listening to gateway debug messages, resetting the stall counter on HELLO-related markers, and incrementing/reconnecting when the HELLO timeout expires after a connection-open event.

Confidence Score: 3/5

  • This PR is likely safe, but depends on @buape/carbon gateway internals and debug message formats.
  • The change is small and localized, but it relies on parsing gateway debug strings for HELLO detection and directly mutating gateway.state fields via casts. In this environment the carbon implementation is not available to verify that these markers always appear and that state.sessionId/state.resumeGatewayUrl exist and are intended to be mutated, so there is residual integration risk.
  • src/discord/monitor/provider.ts

Last reviewed commit: 3cc5974

(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!

…#13180)

When the Discord WebSocket connection enters a stall loop (connects but
never receives HELLO), the existing zombie timeout handler would
disconnect and reconnect indefinitely. The library's reconnectAttempts
counter resets on every WebSocket open event, so the maxAttempts circuit
breaker is never reached.

This adds an application-level circuit breaker:

- Track consecutive stalls (WS opens but no HELLO within 30s)
- After 5 consecutive stalls, invalidate the session state (sessionId,
  resumeGatewayUrl) and force a fresh IDENTIFY instead of resume
- Log stall count on each attempt for observability
- Reset counter on successful HELLO receipt

This breaks the infinite resume loop observed in production where 1400+
reconnect attempts occurred over 12+ hours with a stale session token.

Fixes: openclaw#13180
@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Feb 21, 2026
@MarcBickel
Copy link
Copy Markdown

Confirming this bug — root cause is in @buape/carbon

Hit exactly this issue on v2026.2.19-2 with @buape/[email protected]. Discord gateway enters infinite resume loop with code 1005, never self-heals.

Root cause trace

The bug is in Carbon's GatewayPlugin.js, not OpenClaw itself. Two things compound:

  1. Counter reset on WS open (line 129): this.reconnectAttempts = 0 fires on every WebSocket open event — even when the connection immediately fails afterward. This means maxAttempts: 50 is never reached.

  2. Session state never cleared on code 1005: sessionId is only cleared when Discord explicitly sends InvalidSession (opcode 9) or close codes 4007/4009. Code 1005 doesn't trigger any of those paths, so canResume() always returns true and IDENTIFY is never attempted.

The loop:

connect(resume=true) → WS opens → reconnectAttempts = 0
→ HELLO → canResume() = true → sends RESUME (op 6)
→ Discord closes with 1005 (stale session, no InvalidSession sent)
→ handleClose → handleReconnectionAttempt → reconnectAttempts is 0
→ shouldResume = canResume() = true (sessionId never cleared)
→ connect(resume=true) → repeat forever

Backoff stays flat at 1000ms because Math.min(1000 * 2^0, 30000) = 1000 every time (counter was reset to 0).

Local fix that works

Applied a ~10 line patch to GatewayPlugin.js — adds a consecutiveResumeFailures counter that clears stale session state after 3 failures, forcing fresh IDENTIFY:

// New instance variable
consecutiveResumeFailures = 0;

// Reset on successful connection (READY/RESUMED dispatch)
if (t1 === "READY" || t1 === "RESUMED") {
    this.isConnected = true;
    this.consecutiveResumeFailures = 0;
}

// In handleReconnectionAttempt, before using shouldResume:
let shouldResume = !options.forceNoResume && this.canResume();
if (shouldResume) {
    this.consecutiveResumeFailures++;
    if (this.consecutiveResumeFailures >= 3) {
        this.state.sessionId = null;
        this.state.resumeGatewayUrl = null;
        this.state.sequence = null;
        this.sequence = null;
        this.pings = [];
        this.consecutiveResumeFailures = 0;
        shouldResume = false;
    }
}

This breaks the loop in ~3 seconds instead of requiring a manual gateway restart. The proper long-term fix should be upstream in @buape/carbon (see buape/carbon#353).

@openclaw-barnacle openclaw-barnacle bot removed the stale Marked as stale due to inactivity label Feb 22, 2026
@steipete
Copy link
Copy Markdown
Contributor

Closing as AI-assisted stale-fix triage.

Linked issue #13180 ("Discord WebSocket: resume loop needs circuit breaker") is currently closed and was closed on 2026-02-14T02:07:23Z with state reason completed.
Given that issue is closed, this fix PR is no longer needed in the active queue and is being closed as stale.

If this specific implementation is still needed on current main, please reopen #15762 (or open a new focused fix PR) and reference #13180 for fast re-triage.

@steipete steipete closed this Feb 24, 2026
@Stache73
Copy link
Copy Markdown

This was closed citing that linked issue #13180 is resolved, but #13180 was closed as a duplicate of #13688, which remains open and is actively reproduced multiple times per week. The root issue is unsolved.

MarcBickel's Feb 21 comment contains a full root cause trace and a working 10-line patch tracing the bug to @buape/carbon's reconnect counter reset — that work shouldn't be lost. Requesting this PR be reopened or the patch carried forward into a new PR.

@RonaldTreur
Copy link
Copy Markdown

RonaldTreur commented Feb 25, 2026

This was closed citing that linked issue #13180 is resolved, but #13180 was closed as a duplicate of #13688, which remains open and is actively reproduced multiple times per week. The root issue is unsolved.

MarcBickel's Feb 21 comment contains a full root cause trace and a working 10-line patch tracing the bug to @buape/carbon's reconnect counter reset — that work shouldn't be lost. Requesting this PR be reopened or the patch carried forward into a new PR.

This can stay closed, or perhaps should close once the fix for buape/carbon#353 has made its way into our codebase.

If @MarcBickel is right, then there is no reason to actually merge this workaround (combating symptoms).

You are right though on the reason for closing to be faulty, however 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discord WebSocket: resume loop needs circuit breaker

5 participants