fix(discord): add circuit breaker for WebSocket resume loop by funmerlin · Pull Request #15762 · openclaw/openclaw

funmerlin · 2026-02-13T21:11:54Z

Problem

The Discord WebSocket connection can enter an unrecoverable resume loop where it endlessly retries with a stale session token. Observed in production: 1,400+ reconnect attempts over 12+ hours before manual intervention.

Root cause

When the WS opens but never receives a HELLO (Discord gateway stall), OpenClaw's zombie timeout handler calls gateway.disconnect() + gateway.connect(false) to force a reconnect. However, the underlying library (@buape/carbon) resets reconnectAttempts = 0 on every WebSocket open event, so the library's own circuit breaker (maxAttempts) is never reached.

The zombie timeout effectively creates an infinite loop:

Connect → WS opens → counter resets to 0
No HELLO arrives within 30s → zombie timeout fires
Disconnect + reconnect (resume) → go to 1
Session token is stale → resume always fails silently

Timeline from production logs (Feb 13, 2026)

Time	Event	Duration
07:30	Gateway boot, Discord login	25 min stable
07:55	First stall → resume loop begins	341 attempts
10:26	Briefly self-recovers	6 min stable
10:32	Loop resumes	222 attempts
13:54	Self-recovers	3 min stable
13:57	Loop resumes	73 attempts
15:45	Self-recovers	2 min stable
15:48	Loop resumes	79 attempts
17:09	Full gateway restart	2.5+ hours stable

Total: 717 resume attempts, 36 connection stalls, 708 WS close code 1005.

Fix

Add an application-level circuit breaker to the zombie timeout handler:

Track consecutive stalls (WS opens but no HELLO within 30s)
After 5 consecutive stalls, invalidate the session state (sessionId, resumeGatewayUrl) and force a fresh IDENTIFY instead of trying to resume with a dead session token
Log stall count on each attempt for observability
Reset counter on successful HELLO receipt

This breaks the loop because a fresh IDENTIFY creates a new session rather than trying to resume a stale one.

Changes

src/discord/monitor/provider.ts: Added MAX_STALL_RETRIES (5) and consecutiveStalls counter to the zombie timeout handler. On circuit breaker trip, nullifies gateway.state.sessionId and gateway.state.resumeGatewayUrl before reconnecting.

Fixes #13180

Greptile Overview

Greptile Summary

This change adds an application-level circuit breaker to Discord gateway zombie-connection handling in src/discord/monitor/provider.ts. It tracks consecutive stalls where the WebSocket opens but no HELLO is observed within 30s; after 5 stalls it clears the gateway session identifiers before reconnecting, forcing a fresh IDENTIFY instead of endlessly resuming a stale session.

The logic is implemented by listening to gateway debug messages, resetting the stall counter on HELLO-related markers, and incrementing/reconnecting when the HELLO timeout expires after a connection-open event.

Confidence Score: 3/5

This PR is likely safe, but depends on @buape/carbon gateway internals and debug message formats.
The change is small and localized, but it relies on parsing gateway debug strings for HELLO detection and directly mutating gateway.state fields via casts. In this environment the carbon implementation is not available to verify that these markers always appear and that state.sessionId/state.resumeGatewayUrl exist and are intended to be mutated, so there is residual integration risk.
src/discord/monitor/provider.ts

_{Last reviewed commit: 3cc5974}

_{(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!}

…#13180) When the Discord WebSocket connection enters a stall loop (connects but never receives HELLO), the existing zombie timeout handler would disconnect and reconnect indefinitely. The library's reconnectAttempts counter resets on every WebSocket open event, so the maxAttempts circuit breaker is never reached. This adds an application-level circuit breaker: - Track consecutive stalls (WS opens but no HELLO within 30s) - After 5 consecutive stalls, invalidate the session state (sessionId, resumeGatewayUrl) and force a fresh IDENTIFY instead of resume - Log stall count on each attempt for observability - Reset counter on successful HELLO receipt This breaks the infinite resume loop observed in production where 1400+ reconnect attempts occurred over 12+ hours with a stale session token. Fixes: openclaw#13180

openclaw-barnacle · 2026-02-21T04:00:07Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

MarcBickel · 2026-02-21T10:56:29Z

Confirming this bug — root cause is in `@buape/carbon`

Hit exactly this issue on v2026.2.19-2 with @buape/[email protected]. Discord gateway enters infinite resume loop with code 1005, never self-heals.

Root cause trace

The bug is in Carbon's GatewayPlugin.js, not OpenClaw itself. Two things compound:

Counter reset on WS open (line 129): this.reconnectAttempts = 0 fires on every WebSocket open event — even when the connection immediately fails afterward. This means maxAttempts: 50 is never reached.
Session state never cleared on code 1005: sessionId is only cleared when Discord explicitly sends InvalidSession (opcode 9) or close codes 4007/4009. Code 1005 doesn't trigger any of those paths, so canResume() always returns true and IDENTIFY is never attempted.

The loop:

connect(resume=true) → WS opens → reconnectAttempts = 0
→ HELLO → canResume() = true → sends RESUME (op 6)
→ Discord closes with 1005 (stale session, no InvalidSession sent)
→ handleClose → handleReconnectionAttempt → reconnectAttempts is 0
→ shouldResume = canResume() = true (sessionId never cleared)
→ connect(resume=true) → repeat forever

Backoff stays flat at 1000ms because Math.min(1000 * 2^0, 30000) = 1000 every time (counter was reset to 0).

Local fix that works

Applied a ~10 line patch to GatewayPlugin.js — adds a consecutiveResumeFailures counter that clears stale session state after 3 failures, forcing fresh IDENTIFY:

// New instance variable
consecutiveResumeFailures = 0;

// Reset on successful connection (READY/RESUMED dispatch)
if (t1 === "READY" || t1 === "RESUMED") {
    this.isConnected = true;
    this.consecutiveResumeFailures = 0;
}

// In handleReconnectionAttempt, before using shouldResume:
let shouldResume = !options.forceNoResume && this.canResume();
if (shouldResume) {
    this.consecutiveResumeFailures++;
    if (this.consecutiveResumeFailures >= 3) {
        this.state.sessionId = null;
        this.state.resumeGatewayUrl = null;
        this.state.sequence = null;
        this.sequence = null;
        this.pings = [];
        this.consecutiveResumeFailures = 0;
        shouldResume = false;
    }
}

This breaks the loop in ~3 seconds instead of requiring a manual gateway restart. The proper long-term fix should be upstream in @buape/carbon (see buape/carbon#353).

…ect-circuit-breaker

steipete · 2026-02-24T06:13:59Z

Closing as AI-assisted stale-fix triage.

Linked issue #13180 ("Discord WebSocket: resume loop needs circuit breaker") is currently closed and was closed on 2026-02-14T02:07:23Z with state reason completed.
Given that issue is closed, this fix PR is no longer needed in the active queue and is being closed as stale.

If this specific implementation is still needed on current main, please reopen #15762 (or open a new focused fix PR) and reference #13180 for fast re-triage.

Stache73 · 2026-02-24T10:00:44Z

This was closed citing that linked issue #13180 is resolved, but #13180 was closed as a duplicate of #13688, which remains open and is actively reproduced multiple times per week. The root issue is unsolved.

MarcBickel's Feb 21 comment contains a full root cause trace and a working 10-line patch tracing the bug to @buape/carbon's reconnect counter reset — that work shouldn't be lost. Requesting this PR be reopened or the patch carried forward into a new PR.

RonaldTreur · 2026-02-25T13:51:08Z

This was closed citing that linked issue #13180 is resolved, but #13180 was closed as a duplicate of #13688, which remains open and is actively reproduced multiple times per week. The root issue is unsolved.

MarcBickel's Feb 21 comment contains a full root cause trace and a working 10-line patch tracing the bug to @buape/carbon's reconnect counter reset — that work shouldn't be lost. Requesting this PR be reopened or the patch carried forward into a new PR.

This can stay closed, or perhaps should close once the fix for buape/carbon#353 has made its way into our codebase.

If @MarcBickel is right, then there is no reason to actually merge this workaround (combating symptoms).

You are right though on the reason for closing to be faulty, however 😅

openclaw-barnacle bot added channel: discord Channel integration: discord size: S labels Feb 13, 2026

funmerlin mentioned this pull request Feb 13, 2026

Discord WebSocket: resume loop needs circuit breaker #13180

Closed

thewilloftheshadow force-pushed the main branch from bfc1ccb to f92900f Compare February 15, 2026 18:46

Stache73 mentioned this pull request Feb 20, 2026

Discord: WebSocket 1005/1006 disconnects with failing resume logic and unbounded backoff #13688

Open

openclaw-barnacle bot added the stale Marked as stale due to inactivity label Feb 21, 2026

Merge remote-tracking branch 'origin/main' into fix/discord-ws-reconn…

7f459d0

…ect-circuit-breaker

openclaw-barnacle bot removed the stale Marked as stale due to inactivity label Feb 22, 2026

steipete closed this Feb 24, 2026

A386official mentioned this pull request Feb 28, 2026

fix(discord): add circuit breaker for WebSocket resume loop #29484

Closed

4 tasks

jeanmonet mentioned this pull request Mar 7, 2026

Discord: health monitor restart loop — post-connection zombie sessions evade circuit breakers #38596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(discord): add circuit breaker for WebSocket resume loop#15762

fix(discord): add circuit breaker for WebSocket resume loop#15762
funmerlin wants to merge 2 commits intoopenclaw:mainfrom
funmerlin:fix/discord-ws-reconnect-circuit-breaker

funmerlin commented Feb 13, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

openclaw-barnacle bot commented Feb 21, 2026

Uh oh!

MarcBickel commented Feb 21, 2026

Uh oh!

steipete commented Feb 24, 2026

Uh oh!

Stache73 commented Feb 24, 2026

Uh oh!

RonaldTreur commented Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

funmerlin commented Feb 13, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Timeline from production logs (Feb 13, 2026)

Fix

Changes

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Uh oh!

openclaw-barnacle bot commented Feb 21, 2026

Uh oh!

MarcBickel commented Feb 21, 2026

Confirming this bug — root cause is in @buape/carbon

Root cause trace

Local fix that works

Uh oh!

steipete commented Feb 24, 2026

Uh oh!

Stache73 commented Feb 24, 2026

Uh oh!

RonaldTreur commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

funmerlin commented Feb 13, 2026 •

edited by greptile-apps bot

Loading

Confirming this bug — root cause is in `@buape/carbon`

RonaldTreur commented Feb 25, 2026 •

edited

Loading