CLI: avoid false update restart failures without listener attribution by vincentkoc · Pull Request #39508 · openclaw/openclaw

vincentkoc · 2026-03-08T05:08:14Z

Summary

Problem: openclaw update could report Gateway did not become healthy after restart on Linux hosts where the gateway was healthy but listener ownership could not be attributed after restart.
Why it matters: install and update flows looked failed even when the systemd-managed gateway came back cleanly, which makes beta/stable rollout verification noisy and misleading.
What changed: the restart health check now treats a running service plus a definitely busy port as healthy when process attribution is unavailable, and adds a regression test for the missing-lsof path.
What did NOT change (scope boundary): this does not suppress real startup crashes or change gateway startup behavior outside the restart-health verification path.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

None

User-visible / Behavior Changes

openclaw update no longer reports a failed restart when the gateway is running and listening but listener PID ownership cannot be attributed during post-restart health checks.

Security Impact (required)

New permissions/capabilities? (Yes/No): No
Secrets/tokens handling changed? (Yes/No): No
New/changed network calls? (Yes/No): No
Command/tool execution surface changed? (Yes/No): No
Data access scope changed? (Yes/No): No
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: Linux
Runtime/container: Node 22 / npm global install path
Model/provider: N/A
Integration/channel (if any): N/A
Relevant config (redacted): gateway managed by systemd, loopback bind, restart health check after openclaw update

Steps

Run the gateway under systemd on Linux without usable listener attribution (lsof unavailable / process details unavailable).
Run openclaw update --channel beta and let the updater restart the gateway.
Observe the updater's post-restart health result.

Expected

The update flow should treat the restart as healthy when the service runtime is running and the gateway port is confirmed busy, even if listener PID attribution is unavailable.

Actual

Before this change, the update flow could warn that the gateway did not become healthy after restart even though a subsequent openclaw gateway status --deep showed a healthy runtime and successful RPC probe.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: targeted restart-health, lifecycle, and update CLI Vitest suites pass with the new attribution-gap case.
Edge cases checked: healthy listener child PID ownership, stale PID detection, ambiguous ownership with probe fallback, and missing-listener-details on Linux.
What you did not verify: a live end-to-end package-manager update on a remote Linux host.

Compatibility / Migration

Backward compatible? (Yes/No): Yes
Config/env changes? (Yes/No): No
Migration needed? (Yes/No): No
If yes, exact upgrade steps:

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert the restart-health commits.
Files/config to restore: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/restart-health.test.ts
Known bad symptoms reviewers should watch for: a truly stale non-gateway listener being misclassified as healthy during updater restart verification.

Risks and Mitigations

Risk: the updater could be slightly more permissive when listener attribution is unavailable.
Mitigation: the relaxed path only applies when the service runtime is already running and the port is definitely busy; stale PID logic and probe fallback remain unchanged.

aisle-research-bot · 2026-03-08T05:09:15Z

🤖 We're reviewing this PR with Aisle

We're running a security check on the changes in this PR now. This usually takes a few minutes. ⌛
We'll post the results here as soon as they're ready.

Progress:

Analysis
Triage
Finalization

Latest run failed. Keeping previous successful results. Trace ID: 019ccbd96f60698a489d23d7bce151fc.

_{Last updated on: 2026-03-08T05:09:27Z}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7691e9593d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T05:11:37Z

src/cli/daemon-cli/restart-health.ts

+      ? portUsage.listeners.some((listener) =>
+          listenerOwnedByRuntimePid({ listener, runtimePid }),
+        ) || listenerAttributionGap
+      : gatewayListeners.length > 0 || listenerAttributionGap;


Keep reachability probe when listener attribution is missing

This change marks the restart as healthy whenever runtimePid exists and listener attribution is unavailable, because listenerAttributionGap is OR-ed into ownsPort; in practice inspectPortUsage reports this gap for any busy port with no listener metadata, so a different process occupying the gateway port (while the service process is merely running) is now treated as healthy without calling confirmGatewayReachable. That can hide real restart failures and skip stale-process recovery in update/restart flows on hosts where listener attribution is unavailable.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-08T05:14:28Z

Greptile Summary

This PR fixes a false failure in openclaw update's post-restart health check on Linux hosts where the gateway comes back healthy but lsof (or equivalent) is unavailable for listener process attribution. The fix adds a new hasListenerAttributionGap helper in restart-health.ts that detects a busy port with empty listener details and uses it to set ownsPort = true without requiring the full probe path.

Strengths:

The gate conditions (status === "busy", listeners.length === 0, service running) are conservative and match the stated scope
The logic correctly places the new check in the fallback path, preserving stale PID detection and probe fallback for other cases
New regression test covers the happy path with both errors and hints present

Issues identified:

The portUsage.errors?.length check at line 35 is overly broad: it returns true for ANY error, not just attribution-specific ones. This could cause false-healthy results if an unrelated inspection error co-occurs with status: "busy" and listeners: []. Recommend narrowing to specific error patterns (ENOENT, "process details are unavailable").
The hints-only code path (line 38) lacks independent test coverage. The current test provides both errors and hints, causing the errors?.length check to short-circuit before the hints branch is reached. A follow-up test with errors: [] and only the hint would ensure this fallback path is exercised.

The fix is well-targeted, but these two refinements would improve robustness and maintainability before production rollout.

Confidence Score: 3/5

The fix is narrow and well-intentioned, but two refinements are recommended before merge: narrow the error check to attribution-specific errors, and add test coverage for the hints-only path.
The PR's core logic—treating a running service plus a confirmed busy port as healthy when attribution details are unavailable—is sound and safely scoped to the restart-health verification path. The new helper function is called only when appropriate, and the fallback probe behavior is preserved for edge cases. However, the portUsage.errors?.length check is too permissive (any error triggers the gap), and the hints-only code path has no independent test coverage. These are precision and maintainability concerns rather than blocking safety issues, but they merit correction to reduce the risk of false positives or silent breakage in the hints branch.
src/cli/daemon-cli/restart-health.ts (line 35: error check too broad) and src/cli/daemon-cli/restart-health.test.ts (hints-only path not independently tested).

_{Last reviewed commit: 7691e95}

greptile-apps · 2026-03-08T05:14:32Z

src/cli/daemon-cli/restart-health.ts

+  if (portUsage.errors?.length) {
+    return true;
+  }


The portUsage.errors?.length check is broader than intended. It will return true for any non-empty error array, even if the error is unrelated to listener attribution. If the port inspection tool encounters an error for a transient reason (e.g. I/O timeout) while still reporting status: "busy" and listeners: [], this would skip listener attribution verification and mark the restart as healthy without probe fallback.

Consider narrowing the check to errors that clearly indicate attribution unavailability:

Suggested change

if (portUsage.errors?.length) {

return true;

}

if (portUsage.errors?.some((e) => e.includes("ENOENT") || e.includes("process details are unavailable"))) {

return true;

}

This ensures the gap is detected only when the error is attribution-specific, maintaining consistency with the hints check below.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cli/daemon-cli/restart-health.ts Line: 35-37 Comment: The `portUsage.errors?.length` check is broader than intended. It will return `true` for any non-empty error array, even if the error is unrelated to listener attribution. If the port inspection tool encounters an error for a transient reason (e.g. I/O timeout) while still reporting `status: "busy"` and `listeners: []`, this would skip listener attribution verification and mark the restart as healthy without probe fallback. Consider narrowing the check to errors that clearly indicate attribution unavailability: ```suggestion if (portUsage.errors?.some((e) => e.includes("ENOENT") || e.includes("process details are unavailable"))) { return true; } ``` This ensures the gap is detected only when the error is attribution-specific, maintaining consistency with the hints check below. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-08T05:14:33Z

src/cli/daemon-cli/restart-health.test.ts

+  it("treats busy ports with unavailable listener details as healthy when runtime is running", async () => {
+    const service = {
+      readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),
+    } as unknown as GatewayService;
+
+    inspectPortUsage.mockResolvedValue({
+      port: 18789,
+      status: "busy",
+      listeners: [],
+      hints: [
+        "Port is in use but process details are unavailable (install lsof or run as an admin user).",
+      ],
+      errors: ["Error: spawn lsof ENOENT"],
+    });
+
+    const { inspectGatewayRestart } = await import("./restart-health.js");
+    const snapshot = await inspectGatewayRestart({ service, port: 18789 });
+
+    expect(snapshot.healthy).toBe(true);
+    expect(probeGateway).not.toHaveBeenCalled();
+  });


The hints-only code path in hasListenerAttributionGap lacks independent test coverage. The test at this location provides both errors and hints, but the errors?.length check (line 35 of restart-health.ts) short-circuits before the hints.some(...) branch is ever reached. If the hints check is accidentally modified or removed, this test would still pass.

Add a separate test case that exercises the hints-only path:

Suggested change

it("treats busy ports with unavailable listener details as healthy when runtime is running", async () => {

const service = {

readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),

} as unknown as GatewayService;

inspectPortUsage.mockResolvedValue({

port: 18789,

status: "busy",

listeners: [],

hints: [

"Port is in use but process details are unavailable (install lsof or run as an admin user).",

],

errors: ["Error: spawn lsof ENOENT"],

});

const { inspectGatewayRestart } = await import("./restart-health.js");

const snapshot = await inspectGatewayRestart({ service, port: 18789 });

expect(snapshot.healthy).toBe(true);

expect(probeGateway).not.toHaveBeenCalled();

});

it("treats busy ports with only a hints-based attribution gap as healthy when runtime is running", async () => {

const service = {

readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),

} as unknown as GatewayService;

inspectPortUsage.mockResolvedValue({

port: 18789,

status: "busy",

listeners: [],

hints: ["Port is in use but process details are unavailable (install lsof or run as an admin user)."],

errors: [], // No errors — only hints

});

const { inspectGatewayRestart } = await import("./restart-health.js");

const snapshot = await inspectGatewayRestart({ service, port: 18789 });

expect(snapshot.healthy).toBe(true);

expect(probeGateway).not.toHaveBeenCalled();

});

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cli/daemon-cli/restart-health.test.ts Line: 202-222 Comment: The hints-only code path in `hasListenerAttributionGap` lacks independent test coverage. The test at this location provides both `errors` and `hints`, but the `errors?.length` check (line 35 of `restart-health.ts`) short-circuits before the `hints.some(...)` branch is ever reached. If the hints check is accidentally modified or removed, this test would still pass. Add a separate test case that exercises the hints-only path: ```suggestion it("treats busy ports with only a hints-based attribution gap as healthy when runtime is running", async () => { const service = { readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })), } as unknown as GatewayService; inspectPortUsage.mockResolvedValue({ port: 18789, status: "busy", listeners: [], hints: ["Port is in use but process details are unavailable (install lsof or run as an admin user)."], errors: [], // No errors — only hints }); const { inspectGatewayRestart } = await import("./restart-health.js"); const snapshot = await inspectGatewayRestart({ service, port: 18789 }); expect(snapshot.healthy).toBe(true); expect(probeGateway).not.toHaveBeenCalled(); }); ``` How can I resolve this? If you propose a fix, please make it concise.

上游主要更新（5d22bd029 → 92648f9）： - fix(agents): 扩大 402 临时限速错误的检测范围，允许计费冷却探测（openclaw#38533） - fix(ui): 修正控制台设备鉴权 token 签名问题 - CLI: 避免无监听归因时的误判更新重启失败（openclaw#39508） - fix: 合入社区贡献 PR openclaw#39516（@Imhermes1） - build: 版本升级至 2026.3.8，发布 v2026.3.7 tag 本地累积更新（自上次同步）： - feat: OAuth token 自动续期 —— cron 触发前预检 token 并自动刷新 - fix: oauth-refresh-daemon shouldRefresh 修复（已过期 token 不再被跳过） - fix: 守护进程失败3次后重置计数器，避免永久停止刷新 - feat: 智能路由模型账号选择、模型管理全面优化 - feat: 飞书/钉钉频道会话聚合历史视图

…openclaw#39508)

…openclaw#39508) (cherry picked from commit c381034)

vincentkoc added 2 commits March 7, 2026 21:06

CLI: avoid false unhealthy update restart on Linux

75293bf

CLI: cover missing listener attribution restart health

b0f01ee

vincentkoc self-assigned this Mar 8, 2026

openclaw-barnacle bot added cli CLI command changes size: XS maintainer Maintainer-authored PR labels Mar 8, 2026

vincentkoc marked this pull request as ready for review March 8, 2026 05:09

Merge branch 'main' into vincentkoc-code/fix-update-health-no-lsof

7691e95

chatgpt-codex-connector bot reviewed Mar 8, 2026

View reviewed changes

greptile-apps bot reviewed Mar 8, 2026

View reviewed changes

vincentkoc merged commit c381034 into main Mar 8, 2026
27 of 29 checks passed

vincentkoc deleted the vincentkoc-code/fix-update-health-no-lsof branch March 8, 2026 05:42

github-actions bot mentioned this pull request Mar 8, 2026

📡 Upstream Digest — 2026-03-08 06:33 UTC curtismercier/openclaw-mods#208

Open

github-actions bot mentioned this pull request Mar 8, 2026

上游更新: v2026.3.7 — 19 P0 + 122 P1 待合并 jiulingyun/openclaw-cn#486

Open

Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026

CLI: avoid false update restart failures without listener attribution (…

0a09fcb

…openclaw#39508)

This was referenced Mar 9, 2026

Merge upstream v2026.3.7 stable release #40376

Closed

Merge upstream v2026.3.7 stable release Jackson3195/openclaw-with-a-personal-touch#6

Merged

GordonSH-oss pushed a commit to GordonSH-oss/openclaw that referenced this pull request Mar 9, 2026

CLI: avoid false update restart failures without listener attribution (…

21c9421

…openclaw#39508)

jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026

CLI: avoid false update restart failures without listener attribution (…

1622cba

…openclaw#39508)

alexey-pelykh mentioned this pull request Mar 10, 2026

Cherry-pick: upstream CLI commands updates (18 commits) remoteclaw/remoteclaw#887

Closed

4 tasks

sauerdaniel pushed a commit to sauerdaniel/openclaw that referenced this pull request Mar 11, 2026

CLI: avoid false update restart failures without listener attribution (…

2c93187

…openclaw#39508)

dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026

CLI: avoid false update restart failures without listener attribution (…

93acb20

…openclaw#39508)

senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026

CLI: avoid false update restart failures without listener attribution (…

46e373a

…openclaw#39508)

V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026

CLI: avoid false update restart failures without listener attribution (…

83e71f3

…openclaw#39508)

alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026

CLI: avoid false update restart failures without listener attribution (…

579d572

…openclaw#39508) (cherry picked from commit c381034)

alexey-pelykh mentioned this pull request Mar 22, 2026

Cherry-pick issue #887: upstream CLI commands updates remoteclaw/remoteclaw#1791

Merged

alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026

CLI: avoid false update restart failures without listener attribution (…

b82ec52

…openclaw#39508) (cherry picked from commit c381034)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLI: avoid false update restart failures without listener attribution#39508

CLI: avoid false update restart failures without listener attribution#39508
vincentkoc merged 3 commits intomainfrom
vincentkoc-code/fix-update-health-no-lsof

vincentkoc commented Mar 8, 2026

Uh oh!

aisle-research-bot bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Uh oh!

greptile-apps bot commented Mar 8, 2026

Uh oh!

greptile-apps bot Mar 8, 2026

Uh oh!

greptile-apps bot Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vincentkoc commented Mar 8, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

aisle-research-bot bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 8, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aisle-research-bot bot commented Mar 8, 2026 •

edited

Loading