Health-monitor falsely detects Discord connections as 'stuck' during LLM inference, causing restart thrash and 'Gateway is draining' errors

## Description

Health-monitor detects Discord channel connections as "stuck" and repeatedly restarts them, causing `Gateway is draining` errors for agents attempting to send messages during the restart window. This creates a cycle where connections are restarted every ~10 minutes.

## Environment

- **OpenClaw version**: 2026.3.2 (latest stable)
- **OS**: macOS (arm64) + Ubuntu (Hetzner VPS, 503GB RAM)
- **Setup**: Multi-agent gateway with 4 Discord bots (macOS) / 10 Discord bots (Ubuntu)
- **Channels**: Discord + Telegram per agent

## Symptoms

### 1. Health-monitor restart thrashing

Every ~10 minutes, health-monitor restarts Discord connections with `reason: stuck`:

```
[health-monitor] [discord:livius] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:artemis] health-monitor: restarting (reason: stuck)
[health-monitor] [discord:onyx] health-monitor: restarting (reason: stuck)
```

Telegram connections also restart with `reason: stale-socket` at similar intervals.

**Scale (4-hour window):**

| Host | Health-monitor restarts | Slow listeners |
|------|------------------------|----------------|
| macOS (4 bots) | ~15 | ~4 |
| Ubuntu (10 bots) | **145** | **1,025** |

### 2. Root cause: DiscordMessageListener blocks event queue

A single `MESSAGE_CREATE` event blocks the listener for **minutes**:

```
[discord/monitor] Slow listener detected: DiscordMessageListener took 483517ms (483.5 seconds) for MESSAGE_CREATE
[discord/monitor] Slow listener detected: DiscordMessageListener took 48153ms (48.2 seconds) for MESSAGE_CREATE
```

This appears to happen when an agent processes an LLM inference synchronously within the Discord message listener, blocking the entire event queue. While the inference runs, health-monitor sees no activity and marks the connection as "stuck."

### 3. DiscordReactionListener slow warnings

Additionally, `DiscordReactionListener` and `DiscordReactionRemoveListener` regularly exceed 1-2.4 seconds:

```
[EventQueue] Slow listener detected: DiscordReactionListener took 2398ms for MESSAGE_REACTION_ADD
[EventQueue] Slow listener detected: DiscordReactionRemoveListener took 1580ms for MESSAGE_REACTION_REMOVE
```

On the Ubuntu host, 1,025 slow listener warnings in 4 hours (~4 per minute).

### 4. Agent impact: "Gateway is draining"

When health-monitor restarts a connection, any agent trying to use that Discord connection gets:

```
Gateway is draining, please try again later
```

This disrupts agent workflows mid-task.

## Expected behavior

1. LLM inference within Discord message handling should not block the event queue (async processing)
2. Health-monitor should distinguish between "connection dead" and "connection busy processing a long-running request"
3. The grace window added in v2026.3.2 (#32367) helps but doesn't fully solve the issue when inference takes 48-483 seconds

## Possible solutions

1. **Process LLM inference asynchronously** - decouple the Discord event listener from the LLM response pipeline so the event queue stays responsive
2. **Extend health-monitor awareness** - if a session is actively processing (LLM inference in progress), don't mark the connection as stuck
3. **Configurable stuck timeout** - allow operators to set a longer threshold for multi-agent setups where long inferences are expected

## Additional context

- The v2026.3.2 changelog mentions: "prevent health-monitor restart thrash for channels that just (re)started by adding a per-channel startup-connect grace window" (#32367) - this helps post-restart but doesn't prevent the initial "stuck" detection during long inferences
- Both hosts run the same version (2026.3.2) with identical behavior
- 56 zombie processes (defunct) accumulate on the Ubuntu host from repeated restarts (oldest from Feb 27)
- Potentially related: the zombie process accumulation suggests the restart flow doesn't fully clean up child processes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Health-monitor falsely detects Discord connections as 'stuck' during LLM inference, causing restart thrash and 'Gateway is draining' errors #36017

Description

Environment

Symptoms

1. Health-monitor restart thrashing

2. Root cause: DiscordMessageListener blocks event queue

3. DiscordReactionListener slow warnings

4. Agent impact: "Gateway is draining"

Expected behavior

Possible solutions

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Health-monitor falsely detects Discord connections as 'stuck' during LLM inference, causing restart thrash and 'Gateway is draining' errors #36017

Description

Description

Environment

Symptoms

1. Health-monitor restart thrashing

2. Root cause: DiscordMessageListener blocks event queue

3. DiscordReactionListener slow warnings

4. Agent impact: "Gateway is draining"

Expected behavior

Possible solutions

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions