fix(gateway): skip stale-socket detection for Discord channels#58404
fix(gateway): skip stale-socket detection for Discord channels#58404fanyangCS wants to merge 3 commits intoopenclaw:mainfrom
Conversation
Discord (Carbon) maintains its connected state via transport-level heartbeat ACK. When the WebSocket dies, Carbon sets connected=false, which is already caught by the disconnected check. Idle channels with no user activity do not indicate a stale socket. The stale-socket check was designed for Slack (PR openclaw#30153) where the connected flag is lifecycle-based and can remain true during zombie state. Discord is excluded using the same pattern as Telegram. Observed: 23+ false-positive restarts/day at ~35min intervals on idle Discord channels. Fixes openclaw#58339
Greptile SummaryThis PR excludes Discord channels from the stale-socket health check in
Confidence Score: 5/5Safe to merge — the production fix is correct and the only finding is a P2 test quality issue. The implementation change is minimal, well-motivated, and follows an established exclusion pattern. All 15 policy tests and 30 monitor tests pass. The sole comment is P2: two existing tests silently exercise the wrong guard after the Discord exclusion lands, but this does not affect runtime correctness. Minor test quality issue in
|
| Filename | Overview |
|---|---|
| src/gateway/channel-health-policy.ts | Adds policy.channelId !== "discord" to the stale-socket guard, mirroring the existing Telegram exclusion. Logic is correct — Discord dead sockets are caught by connected === false. |
| src/gateway/channel-health-policy.test.ts | Correctly updates the existing stale-socket test to use Slack and adds a new Discord exclusion test. Minor issue: two existing tests that use the evaluateDiscordHealth helper now pass via the Discord exclusion rather than the guards they were designed to cover. |
Comments Outside Diff (1)
-
src/gateway/channel-health-policy.test.ts, line 179-202 (link)Webhook and null-
lastEventAttests now pass for the wrong reasonBoth of these tests use the
evaluateDiscordHealthhelper, which defaults tochannelId: "discord". After this fix, Discord is excluded from stale-socket detection unconditionally, so these tests succeed because of the newchannelId !== "discord"short-circuit — not because of themode !== "webhook"orlastEventAt != nullguards they were written to cover.If either of those guards were accidentally deleted or broken, these tests would still pass (they silently stopped validating what their names say). Consider using a non-excluded channel ID (e.g.
"slack") for these two cases:// "skips stale-socket detection for channels in webhook mode" const evaluation = evaluateDiscordHealth( { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: 0, mode: "webhook" }, 100_000, "slack", // ← use a channel that isn't Discord-excluded ); // "does not flag stale sockets for channels without event tracking" const evaluation = evaluateDiscordHealth( { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: null }, 100_000, "slack", );
This also applies to any future test added via
evaluateDiscordHealththat intends to test a path other than the Discord exclusion itself.Prompt To Fix With AI
This is a comment left during a code review. Path: src/gateway/channel-health-policy.test.ts Line: 179-202 Comment: **Webhook and null-`lastEventAt` tests now pass for the wrong reason** Both of these tests use the `evaluateDiscordHealth` helper, which defaults to `channelId: "discord"`. After this fix, Discord is excluded from stale-socket detection unconditionally, so these tests succeed because of the new `channelId !== "discord"` short-circuit — not because of the `mode !== "webhook"` or `lastEventAt != null` guards they were written to cover. If either of those guards were accidentally deleted or broken, these tests would still pass (they silently stopped validating what their names say). Consider using a non-excluded channel ID (e.g. `"slack"`) for these two cases: ```ts // "skips stale-socket detection for channels in webhook mode" const evaluation = evaluateDiscordHealth( { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: 0, mode: "webhook" }, 100_000, "slack", // ← use a channel that isn't Discord-excluded ); // "does not flag stale sockets for channels without event tracking" const evaluation = evaluateDiscordHealth( { running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: null }, 100_000, "slack", ); ``` This also applies to any future test added via `evaluateDiscordHealth` that intends to test a path other than the Discord exclusion itself. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/channel-health-policy.test.ts
Line: 179-202
Comment:
**Webhook and null-`lastEventAt` tests now pass for the wrong reason**
Both of these tests use the `evaluateDiscordHealth` helper, which defaults to `channelId: "discord"`. After this fix, Discord is excluded from stale-socket detection unconditionally, so these tests succeed because of the new `channelId !== "discord"` short-circuit — not because of the `mode !== "webhook"` or `lastEventAt != null` guards they were written to cover.
If either of those guards were accidentally deleted or broken, these tests would still pass (they silently stopped validating what their names say). Consider using a non-excluded channel ID (e.g. `"slack"`) for these two cases:
```ts
// "skips stale-socket detection for channels in webhook mode"
const evaluation = evaluateDiscordHealth(
{ running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: 0, mode: "webhook" },
100_000,
"slack", // ← use a channel that isn't Discord-excluded
);
// "does not flag stale sockets for channels without event tracking"
const evaluation = evaluateDiscordHealth(
{ running: true, connected: true, enabled: true, configured: true, lastStartAt: 0, lastEventAt: null },
100_000,
"slack",
);
```
This also applies to any future test added via `evaluateDiscordHealth` that intends to test a path other than the Discord exclusion itself.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(gateway): skip stale-socket detectio..." | Re-trigger Greptile
|
cc @steipete @Takhoffman — requesting review. Tak, you authored the |
0a8a8cc to
32a0b33
Compare
The helper is used by tests that exercise stale-socket guards. Rename to evaluateSlackHealth with default channelId 'slack' so these tests reach the stale-socket branch instead of being short-circuited by the Discord exclusion. Addresses Greptile review feedback.
|
@jacobtomlinson would you be able to review this when you get a chance? Small fix — excludes Discord from stale-socket detection (same pattern as the existing Telegram exclusion). Greptile gave it 5/5 confidence. |
Summary
Exclude Discord from stale-socket detection in the channel health monitor, using the same pattern as the existing Telegram exclusion.
Root Cause
Stale-socket detection (PR #30153) was designed for Slack but applies generically. Discord (Carbon) is affected because Carbon's heartbeat ACK does not update
lastEventAt— only WebSocket open/close/reconnect events do. Idle Discord channels exceed the 30-minute threshold and are misclassified as stale. Observed: 23+ false-positive restarts/day at ~35-minute intervals.Why This Fix Is Safe
connectedis maintained by transport-level heartbeat ACK. Dead connections setconnected=false, caught by the existingdisconnectedcheck. No known failure mode where heartbeat succeeds but dispatch stops.connectedflag is lifecycle-based, not heartbeat-driven.Changes
channel-health-policy.ts: Addpolicy.channelId !== "discord"to the stale-socket guard conditionchannel-health-policy.test.ts: Update existing stale-socket test to use Slack, add new test confirming Discord is excludedTests
channel-health-policy.test.ts: 15/15 pass ✅channel-health-monitor.test.ts: 30/30 pass ✅Fixes #58339