Skip to content

fix(discord): break resume death spiral when session goes stale#25974

Closed
mr-sk wants to merge 5 commits intoopenclaw:mainfrom
mr-sk:fix/discord-resume-death-spiral
Closed

fix(discord): break resume death spiral when session goes stale#25974
mr-sk wants to merge 5 commits intoopenclaw:mainfrom
mr-sk:fix/discord-resume-death-spiral

Conversation

@mr-sk
Copy link
Copy Markdown

@mr-sk mr-sk commented Feb 25, 2026

Summary

  • When Discord sessions expire, the gateway gets stuck in an infinite resume loop (code 1005) because stale session state is never cleared
  • Adds ResilientGatewayPlugin subclass with resetSession() to invalidate stale sessions when the HELLO timeout fires (30s with no HELLO)
  • Reduces maxAttempts from 50 to 10 — 50 was ~25 minutes of pointless retrying

Test plan

  • npm run build passes
  • npx vitest run src/discord/monitor/provider.lifecycle.test.ts passes (3/3)
  • Gateway restart: all Discord bots connect cleanly, no resume death spiral in logs

🤖 Generated with Claude Code

Greptile Summary

Fixes Discord gateway resume death spiral by adding session reset capability when connections stall. The PR introduces ResilientGatewayPlugin that can invalidate stale session state (sessionId, resumeGatewayUrl, sequence) when the HELLO timeout fires (30s), forcing a fresh IDENTIFY instead of repeatedly attempting to resume a dead session. Also reduces reconnection attempts from 50 to 10 to avoid ~25 minutes of pointless retries.

Changes:

  • Created ResilientGatewayPlugin subclass with resetSession() method to clear session state
  • Updated all gateway creation paths to use ResilientGatewayPlugin instead of base GatewayPlugin
  • Added session reset logic in HELLO timeout handler before forcing reconnect
  • Reduced maxAttempts from 50 to 10 for faster failure detection

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix is well-scoped and addresses a clear issue (resume death spiral). The ResilientGatewayPlugin subclass cleanly extends existing functionality without breaking changes. The session reset logic is properly guarded with instanceof check. Tests pass and the implementation follows TypeScript best practices. Reducing maxAttempts from 50 to 10 is a sensible change that makes failures surface faster rather than hanging for 25+ minutes.
  • No files require special attention

Last reviewed commit: b5566a2

(4/5) You can add custom instructions or style guidelines for the agent here!

When Discord sessions expire (network blip, server-side timeout), the
gateway gets stuck in an infinite resume loop: it connects, Discord
immediately closes with code 1005, but the client never clears stale
session state so canResume() keeps returning true. The bots go fully
offline and only a manual restart recovers them.

Fix: when the HELLO timeout fires (30s with no HELLO from Discord),
invalidate the session state before reconnecting so the next connect
performs a fresh IDENTIFY instead of retrying a dead resume.

Also reduces maxAttempts from 50 to 10 — 50 attempts at exponential
backoff meant ~25 minutes of retrying before giving up.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord size: XS labels Feb 25, 2026
The fallback now returns ResilientGatewayPlugin instead of plain
GatewayPlugin, so the prototype check needs to match.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@mr-sk
Copy link
Copy Markdown
Author

mr-sk commented Feb 25, 2026

CI note: The check job failure is unrelated to this PR — it's a pre-existing TS2742 error on main in src/telegram/bot.media.test-utils.ts. Tracked in #26011.

This PR only modifies files in src/discord/monitor/. All discord monitor tests pass locally (186/186).

@mr-sk
Copy link
Copy Markdown
Author

mr-sk commented Mar 2, 2026

CI update: All checks pass except macos, which has timed out on the last 3 runs (~3h each) waiting for a runner. This is a runner availability issue — the job never actually executes our tests. All other platforms (Linux node, Linux bun, Windows node x2, Android) are green.

…death-spiral

# Conflicts:
#	src/discord/monitor/provider.lifecycle.ts
@mr-sk
Copy link
Copy Markdown
Author

mr-sk commented Mar 2, 2026

Closing — upstream has since implemented a more comprehensive version of this fix directly on main (consecutive stall counter, reconnect watchdog, status reporting, clearResumeState()). Our ResilientGatewayPlugin subclass approach is no longer needed.

@mr-sk mr-sk closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant