Skip to content

fix(gateway): skip stale-socket detection for Discord channels#58404

Open
fanyangCS wants to merge 3 commits intoopenclaw:mainfrom
fanyangCS:fix/stale-socket-discord-exclusion
Open

fix(gateway): skip stale-socket detection for Discord channels#58404
fanyangCS wants to merge 3 commits intoopenclaw:mainfrom
fanyangCS:fix/stale-socket-discord-exclusion

Conversation

@fanyangCS
Copy link
Copy Markdown

Summary

Exclude Discord from stale-socket detection in the channel health monitor, using the same pattern as the existing Telegram exclusion.

Root Cause

Stale-socket detection (PR #30153) was designed for Slack but applies generically. Discord (Carbon) is affected because Carbon's heartbeat ACK does not update lastEventAt — only WebSocket open/close/reconnect events do. Idle Discord channels exceed the 30-minute threshold and are misclassified as stale. Observed: 23+ false-positive restarts/day at ~35-minute intervals.

Why This Fix Is Safe

  • Discord (Carbon): connected is maintained by transport-level heartbeat ACK. Dead connections set connected=false, caught by the existing disconnected check. No known failure mode where heartbeat succeeds but dispatch stops.
  • Slack: Unaffected — Slack is not excluded and retains stale-socket detection, which it needs because its connected flag is lifecycle-based, not heartbeat-driven.

Changes

  • channel-health-policy.ts: Add policy.channelId !== "discord" to the stale-socket guard condition
  • channel-health-policy.test.ts: Update existing stale-socket test to use Slack, add new test confirming Discord is excluded

Tests

  • channel-health-policy.test.ts: 15/15 pass ✅
  • channel-health-monitor.test.ts: 30/30 pass ✅
  • Pre-commit hooks (lint, type check): pass ✅

Fixes #58339

Discord (Carbon) maintains its connected state via transport-level
heartbeat ACK. When the WebSocket dies, Carbon sets connected=false,
which is already caught by the disconnected check. Idle channels with
no user activity do not indicate a stale socket.

The stale-socket check was designed for Slack (PR openclaw#30153) where the
connected flag is lifecycle-based and can remain true during zombie
state. Discord is excluded using the same pattern as Telegram.

Observed: 23+ false-positive restarts/day at ~35min intervals on idle
Discord channels.

Fixes openclaw#58339
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 31, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 31, 2026

Greptile Summary

This PR excludes Discord channels from the stale-socket health check in channel-health-policy.ts, using the same pattern as the existing Telegram exclusion. The fix correctly addresses 23+ false-positive daily restarts caused by Carbon's heartbeat ACK not updating lastEventAt.

  • channel-health-policy.ts: Adds policy.channelId !== "discord" to the stale-socket guard. The fix is sound — Discord dead connections are already caught by the connected === false check above, so skipping the event-age threshold for Discord does not introduce a blindspot.
  • channel-health-policy.test.ts: Correctly migrates the existing stale-socket test to use "slack" and adds a new Discord exclusion test. Minor quality concern: two pre-existing tests (webhook mode and null-lastEventAt) still use evaluateDiscordHealth's default channelId: "discord", so they now pass via the new Discord exclusion rather than the guards they were designed to exercise.

Confidence Score: 5/5

Safe to merge — the production fix is correct and the only finding is a P2 test quality issue.

The implementation change is minimal, well-motivated, and follows an established exclusion pattern. All 15 policy tests and 30 monitor tests pass. The sole comment is P2: two existing tests silently exercise the wrong guard after the Discord exclusion lands, but this does not affect runtime correctness.

Minor test quality issue in src/gateway/channel-health-policy.test.ts — webhook-mode and null-lastEventAt tests should use "slack" instead of the "discord" default to validate their intended guards.

Important Files Changed

Filename Overview
src/gateway/channel-health-policy.ts Adds policy.channelId !== "discord" to the stale-socket guard, mirroring the existing Telegram exclusion. Logic is correct — Discord dead sockets are caught by connected === false.
src/gateway/channel-health-policy.test.ts Correctly updates the existing stale-socket test to use Slack and adds a new Discord exclusion test. Minor issue: two existing tests that use the evaluateDiscordHealth helper now pass via the Discord exclusion rather than the guards they were designed to cover.

Comments Outside Diff (1)

  1. src/gateway/channel-health-policy.test.ts, line 179-202 (link)

    P2 Webhook and null-lastEventAt tests now pass for the wrong reason

    Both of these tests use the evaluateDiscordHealth helper, which defaults to channelId: "discord". After this fix, Discord is excluded from stale-socket detection unconditionally, so these tests succeed because of the new channelId !== "discord" short-circuit — not because of the mode !== "webhook" or lastEventAt != null guards they were written to cover.

    If either of those guards were accidentally deleted or broken, these tests would still pass (they silently stopped validating what their names say). Consider using a non-excluded channel ID (e.g. "slack") for these two cases:

    // "skips stale-socket detection for channels in webhook mode"
    const evaluation = evaluateDiscordHealth(
      { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: 0, mode: "webhook" },
      100_000,
      "slack",   // ← use a channel that isn't Discord-excluded
    );
    
    // "does not flag stale sockets for channels without event tracking"
    const evaluation = evaluateDiscordHealth(
      { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: null },
      100_000,
      "slack",
    );

    This also applies to any future test added via evaluateDiscordHealth that intends to test a path other than the Discord exclusion itself.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/gateway/channel-health-policy.test.ts
    Line: 179-202
    
    Comment:
    **Webhook and null-`lastEventAt` tests now pass for the wrong reason**
    
    Both of these tests use the `evaluateDiscordHealth` helper, which defaults to `channelId: "discord"`. After this fix, Discord is excluded from stale-socket detection unconditionally, so these tests succeed because of the new `channelId !== "discord"` short-circuit — not because of the `mode !== "webhook"` or `lastEventAt != null` guards they were written to cover.
    
    If either of those guards were accidentally deleted or broken, these tests would still pass (they silently stopped validating what their names say). Consider using a non-excluded channel ID (e.g. `"slack"`) for these two cases:
    
    ```ts
    // "skips stale-socket detection for channels in webhook mode"
    const evaluation = evaluateDiscordHealth(
      { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: 0, mode: "webhook" },
      100_000,
      "slack",   // ← use a channel that isn't Discord-excluded
    );
    
    // "does not flag stale sockets for channels without event tracking"
    const evaluation = evaluateDiscordHealth(
      { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: null },
      100_000,
      "slack",
    );
    ```
    
    This also applies to any future test added via `evaluateDiscordHealth` that intends to test a path other than the Discord exclusion itself.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/channel-health-policy.test.ts
Line: 179-202

Comment:
**Webhook and null-`lastEventAt` tests now pass for the wrong reason**

Both of these tests use the `evaluateDiscordHealth` helper, which defaults to `channelId: "discord"`. After this fix, Discord is excluded from stale-socket detection unconditionally, so these tests succeed because of the new `channelId !== "discord"` short-circuit — not because of the `mode !== "webhook"` or `lastEventAt != null` guards they were written to cover.

If either of those guards were accidentally deleted or broken, these tests would still pass (they silently stopped validating what their names say). Consider using a non-excluded channel ID (e.g. `"slack"`) for these two cases:

```ts
// "skips stale-socket detection for channels in webhook mode"
const evaluation = evaluateDiscordHealth(
  { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: 0, mode: "webhook" },
  100_000,
  "slack",   // ← use a channel that isn't Discord-excluded
);

// "does not flag stale sockets for channels without event tracking"
const evaluation = evaluateDiscordHealth(
  { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: null },
  100_000,
  "slack",
);
```

This also applies to any future test added via `evaluateDiscordHealth` that intends to test a path other than the Discord exclusion itself.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(gateway): skip stale-socket detectio..." | Re-trigger Greptile

@fanyangCS
Copy link
Copy Markdown
Author

cc @steipete @Takhoffman — requesting review. Tak, you authored the connected === true guard in #38643, so you have the most context on this code path.

@fanyangCS fanyangCS force-pushed the fix/stale-socket-discord-exclusion branch 3 times, most recently from 0a8a8cc to 32a0b33 Compare April 1, 2026 00:32
The helper is used by tests that exercise stale-socket guards. Rename
to evaluateSlackHealth with default channelId 'slack' so these tests
reach the stale-socket branch instead of being short-circuited by the
Discord exclusion.

Addresses Greptile review feedback.
@fanyangCS
Copy link
Copy Markdown
Author

@jacobtomlinson would you be able to review this when you get a chance? Small fix — excludes Discord from stale-socket detection (same pattern as the existing Telegram exclusion). Greptile gave it 5/5 confidence.

@SonicBotMan

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Health monitor false-positive stale-socket restarts on idle Discord channels

2 participants