Skip to content

fix(gateway): health check always times out when lsof is not installed#32613

Closed
riftzen-bit wants to merge 1 commit intoopenclaw:mainfrom
riftzen-bit:fix/restart-health-no-lsof
Closed

fix(gateway): health check always times out when lsof is not installed#32613
riftzen-bit wants to merge 1 commit intoopenclaw:mainfrom
riftzen-bit:fix/restart-health-no-lsof

Conversation

@riftzen-bit
Copy link
Copy Markdown
Contributor

@riftzen-bit riftzen-bit commented Mar 3, 2026

Summary

Severity: HIGH — This is a critical bug that causes openclaw gateway restart to always time out after 60 seconds on any Linux system without lsof installed (common on minimal/container installs, Arch Linux, Alpine, etc.).

Root Cause

In inspectGatewayRestart(), the ownsPort check when runtimePid is known only verifies ownership via portUsage.listeners.some(listenerOwnedByRuntimePid). When lsof is not installed, inspectPortUsage() falls back to checkPortInUse() (which tries to bind and gets EADDRINUSE), returning { status: "busy", listeners: [] }. Since .some() on an empty array always returns false, ownsPort is always falsehealthy is always false → the health-check loop polls for 120 attempts × 500ms = 60s then reports a timeout error, even though the gateway is running and healthy.

The runtimePid == null branch already had the correct fallback (portUsage.status === "busy" && portUsage.listeners.length === 0), but the runtimePid != null branch was missing it.

Fix

Add the same (status === "busy" && listeners.length === 0) fallback to the runtimePid != null branch, so a running service with a known PID is treated as the port owner when listener enumeration is unavailable.

Impact

  • Every openclaw gateway restart on a system without lsof was broken — always timing out after 60s
  • The gateway was actually running fine; only the health check reporting was wrong
  • Affects Linux distros that don't ship lsof by default (Arch, Alpine, minimal Debian/Ubuntu, containers)

Test plan

  • Added test: "treats port as owned when runtime pid is known but listeners are empty (e.g. lsof missing)"
  • All 7 existing tests pass (no regression)
  • pnpm check passes (lint + format + typecheck)
  • Verified edge cases:
    • lsof installed + listener matches runtime PID → healthy ✓
    • lsof installed + listener does NOT match → unhealthy + stale PID detected ✓
    • lsof missing + runtime running + port busy → healthy ✓ (fixed)
    • lsof missing + runtime NOT running → unhealthy ✓
    • port free → unhealthy ✓

The `ownsPort` check in `inspectGatewayRestart` only verified port
ownership via `portUsage.listeners.some(...)` when `runtimePid` was
known.  When `lsof` is not installed, `inspectPortUsage` returns
`{ status: "busy", listeners: [] }` because `checkPortInUse` detects
the port is occupied but cannot enumerate listeners.  `.some()` on an
empty array always returns `false`, so `ownsPort` was always `false`,
causing the health-check loop to spin for the full 60 s timeout on
every restart.

Add the same `(status === "busy" && listeners.length === 0)` fallback
that already existed in the `runtimePid == null` branch so that a
running service with a known PID is treated as the port owner when
listener enumeration is unavailable.
Copilot AI review requested due to automatic review settings March 3, 2026 05:50
@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes size: XS labels Mar 3, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR fixes a critical bug in inspectGatewayRestart() where the health check always timed out (60 seconds) on Linux systems without lsof installed. The root cause was an asymmetry in the ownsPort logic: the runtimePid == null branch already included a (status === "busy" && listeners.length === 0) fallback for when listener enumeration is unavailable, but the runtimePid != null branch was missing the same guard — causing .some() on an empty array to always return false.

Changes:

  • restart-health.ts: Adds the (portUsage.status === "busy" && portUsage.listeners.length === 0) fallback to the runtimePid != null branch of ownsPort, making both branches symmetric and correct.
  • restart-health.test.ts: Adds a dedicated regression test that asserts healthy === true when the runtime is running with a known PID but listener enumeration returns an empty array (e.g., lsof not installed).

The fix is minimal, well-scoped, and has no negative impact on platforms where lsof is available — the fallback only activates when listeners is completely empty.

Confidence Score: 5/5

  • This PR is safe to merge — it is a minimal, targeted bug fix with a clear regression test and no side effects on platforms where lsof is available.
  • The change is a one-liner addition that mirrors an already-correct sibling branch, so the risk of introducing new bugs is very low. The new test directly covers the fixed scenario, and all five edge cases described in the PR are verified. No existing tests were changed.
  • No files require special attention.

Last reviewed commit: 45fc88c

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a restart health-check false-negative in the gateway CLI when process listener enumeration is unavailable (notably when lsof isn’t installed on Linux), which previously caused openclaw gateway restart to always time out despite a healthy running gateway.

Changes:

  • Extend the runtimePid != null ownership check to fall back to “port is busy and listeners are empty” when listener enumeration can’t be performed.
  • Add a regression test covering the lsof-missing / empty-listeners scenario with a known runtime PID.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/cli/daemon-cli/restart-health.ts Aligns port-ownership fallback logic across PID-known and PID-unknown branches to avoid restart health-check timeouts when listeners can’t be enumerated.
src/cli/daemon-cli/restart-health.test.ts Adds a test ensuring healthy === true when runtime PID is known, runtime is running, port is busy, and listener list is empty (e.g., lsof missing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants