-
-
Notifications
You must be signed in to change notification settings - Fork 69.7k
BUG: Discord channel fetch failure crashes gateway (unhandled rejection in GatewayPlugin.registerClient) #37375
Description
Bug type
Regression (worked before, now fails)
Summary
When the Discord API is unreachable during provider startup, @buape/carbon's GatewayPlugin.registerClient() throws an error that becomes an unhandled promise rejection, crashing the gateway. Combined with systemd auto-restart, this creates an infinite crash loop that persists until the network recovers or Discord is manually disabled. On our system, this caused 76 crashes in a single day (94% of all restarts).
Steps to reproduce
- Configure OpenClaw with Discord channel enabled
- Run on a system with intermittent network connectivity (WSL2 is a reliable reproducer due to known DNS/TCP instability)
- Wait for health-monitor to detect a stale socket (every 300s) during a network blip
- Health-monitor restarts the Discord provider
- Discord provider initialization calls GatewayPlugin.registerClient(), which fetches https://discord.com/api/v10/gateway/bot
- Fetch fails → unhandled promise rejection → process.exit(1)
- systemd restarts the gateway → Discord provider starts again → same fetch fails → crash loop
Minimal trigger: Any network interruption lasting >10 seconds during Discord provider startup.
Expected behavior
• Discord channel fetch failure should be treated as a transient network error
• Gateway should log a warning, skip or retry Discord connection, and continue running with other channels (Telegram, etc.)
• A single channel's connectivity issues should never crash the entire gateway
Actual behavior
Gateway crashes with:
[openclaw] Unhandled promise rejection: Error: Failed to get gateway information from Discord: fetch failed
at GatewayPlugin.registerClient (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/dist/src/plugins/gateway/GatewayPlugin.js:73:23)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
Followed by: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Root Cause Analysis
Three bugs interact to create this crash:
Bug 1 (High) — @buape/carbon Client constructor doesn't await async plugin registration
// @buape/carbon Client.js line 122
for (const plugin of plugins) {
plugin.registerClient?.(this); // No await — constructor can't be async
plugin.registerRoutes?.(this);
this.plugins.push({ id: plugin.id, plugin });
}
registerClient is async (returns a Promise), but the constructor calls it synchronously. The rejected Promise has no handler — it's a fire-and-forget async call. This means any error thrown in registerClient becomes an unhandled rejection by design.
Bug 2 (Medium) — @buape/carbon GatewayPlugin strips error cause chain
// @buape/carbon GatewayPlugin.js line 72-73
catch (error) {
throw new Error(Failed to get gateway information from Discord: ${error instanceof Error ? error.message : String(error)});
// Missing: { cause: error }
}
The original TypeError("fetch failed") is discarded. OpenClaw's isTransientNetworkError() walks the error graph via .cause, but there's nothing to find.
Note: OpenClaw's own ProxyGatewayPlugin (for proxy mode) correctly includes { cause: error } — this only affects the non-proxy code path.
Bug 3 (Medium) — OpenClaw's transient error detection misses wrapped "fetch failed" messages
isTransientNetworkError() checks:
• candidate instanceof TypeError && candidate.message === "fetch failed" → ❌ The wrapped error is Error, not TypeError
• message === "fetch failed" (exact match) → ❌ Actual message is "Failed to get gateway information from Discord: fetch failed"
• TRANSIENT_NETWORK_MESSAGE_SNIPPETS.some(s => message.includes(s)) → ❌ No snippet matches
The error falls through to the default process.exit(1) handler.
The Full Kill Chain
Network blip (e.g. WSL2 DNS/TCP instability)
→ health-monitor detects stale-socket (every 300s)
→ restarts Discord provider
→ channel resolve fails (gracefully handled — just a warning)
→ Carbon Client constructor calls registerClient() without await (Bug 1)
→ registerClient fetches discord.com/api/v10/gateway/bot → fails
→ throws Error without { cause } (Bug 2) → unhandled rejection
→ isTransientNetworkError misses wrapped message (Bug 3)
→ process.exit(1) → systemd restart → repeat
Timing-Dependent Behavior
The crash only occurs when the network is still down when registerClient's fetch fires (~15-25s after provider start). Of 99 channel resolve failures observed, only 76 led to crashes — the other 23 times, the network recovered before registerClient needed it.
During crash loops, systemd restarts in ~10s, and Discord immediately retries registerClient. If the outage lasts longer than the restart window, every restart hits the same failure. One cluster produced 7 consecutive crashes in 3 minutes.
Proposed Fixes
Any single fix would break the crash chain
| Fix | Where | What |
|---|---|---|
| A | @buape/carbon GatewayPlugin.js | Add { cause: error } to the rethrown Error — preserves error chain for isTransientNetworkError |
| B | OpenClaw isTransientNetworkError | Add "fetch failed" to TRANSIENT_NETWORK_MESSAGE_SNIPPETS — catches wrapped messages |
| C | OpenClaw isTransientNetworkError | Change message === "fetch failed" to message.includes("fetch failed") — more targeted variant of B |
| D | OpenClaw Discord provider | Wrap registerClient call in try/catch with retry/backoff, or catch the unhandled rejection from the Client constructor — prevents it from ever reaching the global handler |
Recommendation: Fix D is the most robust (handles the unawaited async call at the source). Fix B or C provides a safety net for any future wrapped transient errors. Fixes A and B could be reported upstream to @buape/carbon.
OpenClaw version
2026.3.2
Operating system
Linux (WSL2, x64)
Install method
pnpm (global)
Logs, screenshots, and evidence
Crash Statistics (single day)
• 81 total gateway restarts
• 76 caused by Discord fetch failure (94%)
• 6 caused by TLS/setSession bug (separate issue: #36585)
• 62 triggered by health-monitor stale-socket detection
• Uptime between crashes during loop: ~25-40 seconds
Representative crash sequence (one of 76):
[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
[discord] [default] starting provider
[discord] channel resolve failed; using config entries. fetch failed
[discord] failed to deploy native commands: fetch failed
[openclaw] Unhandled promise rejection: Error: Failed to get gateway information from Discord: fetch failed
at GatewayPlugin.registerClient (file:///usr/lib/node_modules/openclaw/node_modules/@buape/carbon/dist/src/plugins/gateway/GatewayPlugin.js:73:23)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Scheduled restart job, restart counter is at 61.
Note: Both channel resolve and command deploy fail gracefully (caught and logged). Only registerClient is unhandled.
Workaround
Disable Discord channel (channels.discord.enabled: false + plugins.entries.discord.enabled: false). This eliminated 94% of crashes immediately.
Related Issues
• #36585 — TLS setSession null dereference crash (different bug, same process.exit(1) handler behavior)
• #36588 — PR for TLS fix (adds isTlsSocketNullDeref detection)Impact and severity
Severity: Critical — causes gateway crash loop (complete service outage)
Impact:
• Any user with Discord enabled on an unstable network (WSL2, VPN, mobile hotspot, flaky ISP) is vulnerable
• A single network blip >10 seconds can trigger an infinite crash loop requiring manual intervention (disable Discord or wait for network recovery)
• All other channels (Telegram, WhatsApp, etc.) go down with Discord — one channel's failure takes out the entire gateway
• 76 crashes in a single day observed; worst cluster was 7 crashes in 3 minutes
• No user-facing warning before the loop starts — it just dies
Frequency: Reliable reproducer on WSL2; likely affects any environment with intermittent connectivity
Additional information
Node.js: v22.22.0
@buape/carbon: 0.0.0-beta-20260216184201 (bundled with OpenClaw)
Gateway service: systemd (user), Restart=always
• The earlyGatewayErrorGuard mechanism in the Discord provider catches gateway emitter "error" events, but registerClient throws a regular Error from an async function — different error propagation path, so the guard doesn't help here
• Telegram (Grammy-based) handles connection failures internally with retries — the failure never escapes as an unhandled rejection, which is why Telegram never crashes from network blips
• The crash is probabilistic: depends on whether the network is still down when registerClient's fetch fires. Sustained outages (>10s) reliably trigger the loop; brief blips (<15s) may not