-
-
Notifications
You must be signed in to change notification settings - Fork 69.5k
Discord health-monitor restart storm: 'disconnected' classified as 'stuck', no reconnect config parity with WhatsApp #36404
Description
Summary
Multi-agent setup (6 Discord bots) experienced 90+ health-monitor restarts overnight from a single Discord WebSocket 1006 disconnect event. Two compounding issues:
Issue 1: 'disconnected' classified as 'stuck' in restart reason
In channel-health-policy.ts, evaluateChannelHealth correctly returns { reason: 'disconnected' } when connected === false. But resolveChannelRestartReason only handles stale-socket and not-running explicitly — everything else falls through to return 'stuck'. So a bot that's simply disconnected from Discord shows reason: stuck in the logs, which is misleading.
// channel-health-policy.ts
export function resolveChannelRestartReason(...): ChannelRestartReason {
if (evaluation.reason === 'stale-socket') return 'stale-socket';
if (evaluation.reason === 'not-running') {
return snapshot.reconnectAttempts >= 10 ? 'gave-up' : 'stopped';
}
return 'stuck'; // catches 'disconnected', 'stuck' — conflates two very different states
}Proposed fix: Add explicit handling for disconnected → return 'stopped' or a new 'disconnected' reason.
Issue 2: No reconnect config for Discord (WhatsApp has it)
When Discord WS drops repeatedly (1006/1005), the health monitor restarts bots every 10 minutes (5min interval × 2 cooldown cycles). With 6 bots all hitting the same Discord drop simultaneously, this creates a restart storm that continues for hours.
WhatsApp channel supports channels.web.reconnect config (initialMs, maxMs, factor, jitter). Discord has no equivalent. Related: #13688.
Proposed config:
{
"channels": {
"discord": {
"gateway": {
"reconnect": {
"maxBackoffMs": 30000,
"maxAttempts": 10,
"freshIdentifyAfterStalls": 3
}
}
}
}
}Environment
- OpenClaw: 2026.3.3
- OS: macOS 26.3 (arm64), Node 22.22.0
- 6 Discord bot accounts, all using
allowBots: 'mentions' - Gateway: LaunchAgent, loopback bind
Log pattern (repeated 90+ times overnight)
[discord] gateway: WebSocket connection closed with code 1006
[health-monitor] [discord:bob] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:marshal] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:skipper] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:tai] health-monitor: restarting (reason: stuck)
Workaround
Increased gateway.channelHealthCheckMinutes from 5 → 10 to reduce restart frequency while waiting for upstream fix.