Skip to content

fix(gateway): distinguish disconnected from stuck in health-monitor restart reason#36436

Merged
steipete merged 2 commits intoopenclaw:mainfrom
Sid-Qin:fix/36404-health-monitor-disconnected-reason
Mar 8, 2026
Merged

fix(gateway): distinguish disconnected from stuck in health-monitor restart reason#36436
steipete merged 2 commits intoopenclaw:mainfrom
Sid-Qin:fix/36404-health-monitor-disconnected-reason

Conversation

@Sid-Qin
Copy link
Copy Markdown
Contributor

@Sid-Qin Sid-Qin commented Mar 5, 2026

Summary

  • Problem: resolveChannelRestartReason does not handle disconnected evaluation reason explicitly — it falls through to "stuck", conflating a clean WebSocket drop (e.g. Discord code 1006) with a genuinely stalled channel.
  • Why it matters: Operators diagnosing multi-bot restart storms see reason: stuck in logs for bots that simply lost their WebSocket connection. This makes triage harder and prevents future policy logic from treating disconnects differently (e.g. faster reconnect, skip cooldown).
  • What changed: Added "disconnected" to ChannelRestartReason union type and added an explicit branch in resolveChannelRestartReason that returns "disconnected" when the evaluation reason is "disconnected". Added a corresponding test case.
  • What did NOT change: No changes to evaluateChannelHealth, health-monitor scheduling, restart behavior, or any channel provider code. The restart still happens — only the logged/reported reason is now accurate.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Health-monitor log messages now show reason: disconnected instead of reason: stuck when a channel's WebSocket connection drops while the bot is otherwise running normally.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS 26.3 (arm64)
  • Runtime: Node.js 22
  • Integration/channel: Discord (any channel with WebSocket health monitoring)

Steps

  1. Configure a Discord bot account
  2. Simulate a WebSocket 1006 disconnect (or wait for a natural drop)
  3. Observe health-monitor restart logs

Expected

  • Log shows reason: disconnected

Actual

  • Before fix: log shows reason: stuck
  • After fix: log shows reason: disconnected

Evidence

TypeScript compiles cleanly. All 8 tests in channel-health-policy.test.ts pass:

 ✓ src/gateway/channel-health-policy.test.ts (8 tests) 2ms
   Tests  8 passed (8)

New test case:

it("maps disconnected to disconnected instead of stuck", () => {
  const reason = resolveChannelRestartReason(
    { running: true, connected: false, enabled: true, configured: true },
    { healthy: false, reason: "disconnected" },
  );
  expect(reason).toBe("disconnected");
});

Human Verification (required)

  • Verified scenarios: disconnected evaluation now returns "disconnected" restart reason; stuck evaluation still returns "stuck"; not-running and stale-socket paths unchanged.
  • Edge cases checked: All existing 7 tests continue to pass, confirming no regression in other evaluation→reason mappings.
  • What I did not verify: Live multi-bot Discord restart storm scenario (requires 6+ bots and a network partition).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert: Revert this commit; the fallthrough to "stuck" is restored.
  • Files/config to restore: src/gateway/channel-health-policy.ts
  • Known bad symptoms: None expected — this only changes a log/reason string, not restart behavior.

Risks and Mitigations

  • Risk: Downstream code may pattern-match on ChannelRestartReason values. Mitigation: Grep confirmed reason is only used in log messages (channel-health-monitor.ts line 153), not in conditional logic.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR fixes a logging accuracy issue in the health-monitor restart path. The function resolveChannelRestartReason was not explicitly handling the "disconnected" evaluation reason, causing it to fall through to "stuck". This conflated a clean WebSocket drop (e.g. Discord code 1006) with a genuinely stalled channel.

Changes:

  • Added "disconnected" to the ChannelRestartReason union type
  • Added an explicit 3-line conditional in resolveChannelRestartReason that returns "disconnected" when appropriate
  • Added a test case verifying the fix

Verification:

  • Confirmed reason is used only in a single log message (channel-health-monitor.ts) — no conditional logic depends on this value
  • Verified "disconnected" was already being produced by evaluateChannelHealth but was incorrectly mapped to "stuck"
  • The change is minimal, surgical, and introduces zero behavioral risk — only improves logging accuracy
  • No changes to restart behavior, health evaluation logic, or channel provider code

Confidence Score: 5/5

  • This PR is safe to merge — it only corrects a logged reason string with no change to restart behavior or conditional logic.
  • The change is minimal and surgical: one new union member, one three-line branch in a function, and one focused test case. The reason value is consumed exclusively in a log message and nowhere else in the codebase. Adding a new ChannelRestartReason value carries zero behavioral risk. Existing logic paths remain unchanged. The test directly verifies the fixed behavior.
  • No files require special attention.

Last reviewed commit: be13e1f

@steipete steipete reopened this Mar 8, 2026
@steipete steipete force-pushed the fix/36404-health-monitor-disconnected-reason branch from be13e1f to 8cbb14f Compare March 8, 2026 02:01
@openclaw-barnacle openclaw-barnacle bot added the cli CLI command changes label Mar 8, 2026
steipete added a commit to Sid-Qin/openclaw that referenced this pull request Mar 8, 2026
SidQin-cyber and others added 2 commits March 8, 2026 02:01
…estart reason

resolveChannelRestartReason did not handle the "disconnected" evaluation
reason explicitly, so it fell through to "stuck". This conflates a clean
WebSocket drop (e.g. Discord 1006) with a genuinely stuck channel, making
logs misleading and preventing future policy differentiation.

Add "disconnected" to ChannelRestartReason and handle it before the
catch-all "stuck" return.

Closes openclaw#36404
@steipete steipete force-pushed the fix/36404-health-monitor-disconnected-reason branch from 8cbb14f to c8e859e Compare March 8, 2026 02:02
@steipete steipete merged commit 1e05f14 into openclaw:main Mar 8, 2026
@openclaw-barnacle openclaw-barnacle bot removed the cli CLI command changes label Mar 8, 2026
@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 8, 2026

Landed via temp rebase onto main.

Thanks @Sid-Qin!

openperf pushed a commit to openperf/moltbot that referenced this pull request Mar 8, 2026
mcaxtr pushed a commit to mcaxtr/openclaw that referenced this pull request Mar 8, 2026
Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026
GordonSH-oss pushed a commit to GordonSH-oss/openclaw that referenced this pull request Mar 9, 2026
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026
alexey-pelykh added a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026
…1796)

* fix(ci): stabilize detect-secrets baseline

(cherry picked from commit 08597e8)

* fix(gateway): distinguish disconnected from stuck in health-monitor restart reason

resolveChannelRestartReason did not handle the "disconnected" evaluation
reason explicitly, so it fell through to "stuck". This conflates a clean
WebSocket drop (e.g. Discord 1006) with a genuinely stuck channel, making
logs misleading and preventing future policy differentiation.

Add "disconnected" to ChannelRestartReason and handle it before the
catch-all "stuck" return.

Closes openclaw#36404

(cherry picked from commit 066d589)

* fix: land health-monitor disconnected reason label (openclaw#36436) (thanks @Sid-Qin)

(cherry picked from commit 1e05f14)

* fix: restore Telegram webhook-mode health after restarts

Landed from contributor PR openclaw#39313 by @fellanH.

Co-authored-by: Felix Hellström <[email protected]>
(cherry picked from commit 9d7d961)

* fix(chat): preserve sender labels in dashboard history

(cherry picked from commit 930caea)

* refactor(channels): share native command session targets

(cherry picked from commit e381ab6)

---------

Co-authored-by: Peter Steinberger <[email protected]>
Co-authored-by: SidQin-cyber <[email protected]>
Co-authored-by: Felix Hellström <[email protected]>
Co-authored-by: Ayaan Zaidi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discord health-monitor restart storm: 'disconnected' classified as 'stuck', no reconnect config parity with WhatsApp

3 participants