Skip to content

Health-monitor falsely detects Discord connections as 'stuck' during LLM inference, causing restart thrash and 'Gateway is draining' errors #36017

@Pygmalione

Description

@Pygmalione

Description

Health-monitor detects Discord channel connections as "stuck" and repeatedly restarts them, causing Gateway is draining errors for agents attempting to send messages during the restart window. This creates a cycle where connections are restarted every ~10 minutes.

Environment

  • OpenClaw version: 2026.3.2 (latest stable)
  • OS: macOS (arm64) + Ubuntu (Hetzner VPS, 503GB RAM)
  • Setup: Multi-agent gateway with 4 Discord bots (macOS) / 10 Discord bots (Ubuntu)
  • Channels: Discord + Telegram per agent

Symptoms

1. Health-monitor restart thrashing

Every ~10 minutes, health-monitor restarts Discord connections with reason: stuck:

[health-monitor] [discord:livius] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:artemis] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:onyx] health-monitor: restarting (reason: stuck)

Telegram connections also restart with reason: stale-socket at similar intervals.

Scale (4-hour window):

Host Health-monitor restarts Slow listeners
macOS (4 bots) ~15 ~4
Ubuntu (10 bots) 145 1,025

2. Root cause: DiscordMessageListener blocks event queue

A single MESSAGE_CREATE event blocks the listener for minutes:

[discord/monitor] Slow listener detected: DiscordMessageListener took 483517ms (483.5 seconds) for MESSAGE_CREATE
[discord/monitor] Slow listener detected: DiscordMessageListener took 48153ms (48.2 seconds) for MESSAGE_CREATE

This appears to happen when an agent processes an LLM inference synchronously within the Discord message listener, blocking the entire event queue. While the inference runs, health-monitor sees no activity and marks the connection as "stuck."

3. DiscordReactionListener slow warnings

Additionally, DiscordReactionListener and DiscordReactionRemoveListener regularly exceed 1-2.4 seconds:

[EventQueue] Slow listener detected: DiscordReactionListener took 2398ms for MESSAGE_REACTION_ADD
[EventQueue] Slow listener detected: DiscordReactionRemoveListener took 1580ms for MESSAGE_REACTION_REMOVE

On the Ubuntu host, 1,025 slow listener warnings in 4 hours (~4 per minute).

4. Agent impact: "Gateway is draining"

When health-monitor restarts a connection, any agent trying to use that Discord connection gets:

Gateway is draining, please try again later

This disrupts agent workflows mid-task.

Expected behavior

  1. LLM inference within Discord message handling should not block the event queue (async processing)
  2. Health-monitor should distinguish between "connection dead" and "connection busy processing a long-running request"
  3. The grace window added in v2026.3.2 (fix(gateway): resolve message tool Unknown channel and health-monitor false-positive stuck detection #32367) helps but doesn't fully solve the issue when inference takes 48-483 seconds

Possible solutions

  1. Process LLM inference asynchronously - decouple the Discord event listener from the LLM response pipeline so the event queue stays responsive
  2. Extend health-monitor awareness - if a session is actively processing (LLM inference in progress), don't mark the connection as stuck
  3. Configurable stuck timeout - allow operators to set a longer threshold for multi-agent setups where long inferences are expected

Additional context

  • The v2026.3.2 changelog mentions: "prevent health-monitor restart thrash for channels that just (re)started by adding a per-channel startup-connect grace window" (fix(gateway): resolve message tool Unknown channel and health-monitor false-positive stuck detection #32367) - this helps post-restart but doesn't prevent the initial "stuck" detection during long inferences
  • Both hosts run the same version (2026.3.2) with identical behavior
  • 56 zombie processes (defunct) accumulate on the Ubuntu host from repeated restarts (oldest from Feb 27)
  • Potentially related: the zombie process accumulation suggests the restart flow doesn't fully clean up child processes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions