Skip to content

Discord health-monitor restart storm: 'disconnected' classified as 'stuck', no reconnect config parity with WhatsApp #36404

@ToneLoke

Description

@ToneLoke

Summary

Multi-agent setup (6 Discord bots) experienced 90+ health-monitor restarts overnight from a single Discord WebSocket 1006 disconnect event. Two compounding issues:

Issue 1: 'disconnected' classified as 'stuck' in restart reason

In channel-health-policy.ts, evaluateChannelHealth correctly returns { reason: 'disconnected' } when connected === false. But resolveChannelRestartReason only handles stale-socket and not-running explicitly — everything else falls through to return 'stuck'. So a bot that's simply disconnected from Discord shows reason: stuck in the logs, which is misleading.

// channel-health-policy.ts
export function resolveChannelRestartReason(...): ChannelRestartReason {
  if (evaluation.reason === 'stale-socket') return 'stale-socket';
  if (evaluation.reason === 'not-running') {
    return snapshot.reconnectAttempts >= 10 ? 'gave-up' : 'stopped';
  }
  return 'stuck'; // catches 'disconnected', 'stuck' — conflates two very different states
}

Proposed fix: Add explicit handling for disconnected → return 'stopped' or a new 'disconnected' reason.

Issue 2: No reconnect config for Discord (WhatsApp has it)

When Discord WS drops repeatedly (1006/1005), the health monitor restarts bots every 10 minutes (5min interval × 2 cooldown cycles). With 6 bots all hitting the same Discord drop simultaneously, this creates a restart storm that continues for hours.

WhatsApp channel supports channels.web.reconnect config (initialMs, maxMs, factor, jitter). Discord has no equivalent. Related: #13688.

Proposed config:

{
  "channels": {
    "discord": {
      "gateway": {
        "reconnect": {
          "maxBackoffMs": 30000,
          "maxAttempts": 10,
          "freshIdentifyAfterStalls": 3
        }
      }
    }
  }
}

Environment

  • OpenClaw: 2026.3.3
  • OS: macOS 26.3 (arm64), Node 22.22.0
  • 6 Discord bot accounts, all using allowBots: 'mentions'
  • Gateway: LaunchAgent, loopback bind

Log pattern (repeated 90+ times overnight)

[discord] gateway: WebSocket connection closed with code 1006
[health-monitor] [discord:bob] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:marshal] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:skipper] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:tai] health-monitor: restarting (reason: stuck)

Workaround

Increased gateway.channelHealthCheckMinutes from 5 → 10 to reduce restart frequency while waiting for upstream fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions