-
-
Notifications
You must be signed in to change notification settings - Fork 69.6k
Health-monitor falsely detects Discord connections as 'stuck' during LLM inference, causing restart thrash and 'Gateway is draining' errors #36017
Description
Description
Health-monitor detects Discord channel connections as "stuck" and repeatedly restarts them, causing Gateway is draining errors for agents attempting to send messages during the restart window. This creates a cycle where connections are restarted every ~10 minutes.
Environment
- OpenClaw version: 2026.3.2 (latest stable)
- OS: macOS (arm64) + Ubuntu (Hetzner VPS, 503GB RAM)
- Setup: Multi-agent gateway with 4 Discord bots (macOS) / 10 Discord bots (Ubuntu)
- Channels: Discord + Telegram per agent
Symptoms
1. Health-monitor restart thrashing
Every ~10 minutes, health-monitor restarts Discord connections with reason: stuck:
[health-monitor] [discord:livius] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:artemis] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:onyx] health-monitor: restarting (reason: stuck)
Telegram connections also restart with reason: stale-socket at similar intervals.
Scale (4-hour window):
| Host | Health-monitor restarts | Slow listeners |
|---|---|---|
| macOS (4 bots) | ~15 | ~4 |
| Ubuntu (10 bots) | 145 | 1,025 |
2. Root cause: DiscordMessageListener blocks event queue
A single MESSAGE_CREATE event blocks the listener for minutes:
[discord/monitor] Slow listener detected: DiscordMessageListener took 483517ms (483.5 seconds) for MESSAGE_CREATE
[discord/monitor] Slow listener detected: DiscordMessageListener took 48153ms (48.2 seconds) for MESSAGE_CREATE
This appears to happen when an agent processes an LLM inference synchronously within the Discord message listener, blocking the entire event queue. While the inference runs, health-monitor sees no activity and marks the connection as "stuck."
3. DiscordReactionListener slow warnings
Additionally, DiscordReactionListener and DiscordReactionRemoveListener regularly exceed 1-2.4 seconds:
[EventQueue] Slow listener detected: DiscordReactionListener took 2398ms for MESSAGE_REACTION_ADD
[EventQueue] Slow listener detected: DiscordReactionRemoveListener took 1580ms for MESSAGE_REACTION_REMOVE
On the Ubuntu host, 1,025 slow listener warnings in 4 hours (~4 per minute).
4. Agent impact: "Gateway is draining"
When health-monitor restarts a connection, any agent trying to use that Discord connection gets:
Gateway is draining, please try again later
This disrupts agent workflows mid-task.
Expected behavior
- LLM inference within Discord message handling should not block the event queue (async processing)
- Health-monitor should distinguish between "connection dead" and "connection busy processing a long-running request"
- The grace window added in v2026.3.2 (fix(gateway): resolve message tool Unknown channel and health-monitor false-positive stuck detection #32367) helps but doesn't fully solve the issue when inference takes 48-483 seconds
Possible solutions
- Process LLM inference asynchronously - decouple the Discord event listener from the LLM response pipeline so the event queue stays responsive
- Extend health-monitor awareness - if a session is actively processing (LLM inference in progress), don't mark the connection as stuck
- Configurable stuck timeout - allow operators to set a longer threshold for multi-agent setups where long inferences are expected
Additional context
- The v2026.3.2 changelog mentions: "prevent health-monitor restart thrash for channels that just (re)started by adding a per-channel startup-connect grace window" (fix(gateway): resolve message tool Unknown channel and health-monitor false-positive stuck detection #32367) - this helps post-restart but doesn't prevent the initial "stuck" detection during long inferences
- Both hosts run the same version (2026.3.2) with identical behavior
- 56 zombie processes (defunct) accumulate on the Ubuntu host from repeated restarts (oldest from Feb 27)
- Potentially related: the zombie process accumulation suggests the restart flow doesn't fully clean up child processes