Skip to content

CLI: avoid false update restart failures without listener attribution#39508

Merged
vincentkoc merged 3 commits intomainfrom
vincentkoc-code/fix-update-health-no-lsof
Mar 8, 2026
Merged

CLI: avoid false update restart failures without listener attribution#39508
vincentkoc merged 3 commits intomainfrom
vincentkoc-code/fix-update-health-no-lsof

Conversation

@vincentkoc
Copy link
Copy Markdown
Member

Summary

  • Problem: openclaw update could report Gateway did not become healthy after restart on Linux hosts where the gateway was healthy but listener ownership could not be attributed after restart.
  • Why it matters: install and update flows looked failed even when the systemd-managed gateway came back cleanly, which makes beta/stable rollout verification noisy and misleading.
  • What changed: the restart health check now treats a running service plus a definitely busy port as healthy when process attribution is unavailable, and adds a regression test for the missing-lsof path.
  • What did NOT change (scope boundary): this does not suppress real startup crashes or change gateway startup behavior outside the restart-health verification path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • None

User-visible / Behavior Changes

openclaw update no longer reports a failed restart when the gateway is running and listening but listener PID ownership cannot be attributed during post-restart health checks.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 22 / npm global install path
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): gateway managed by systemd, loopback bind, restart health check after openclaw update

Steps

  1. Run the gateway under systemd on Linux without usable listener attribution (lsof unavailable / process details unavailable).
  2. Run openclaw update --channel beta and let the updater restart the gateway.
  3. Observe the updater's post-restart health result.

Expected

  • The update flow should treat the restart as healthy when the service runtime is running and the gateway port is confirmed busy, even if listener PID attribution is unavailable.

Actual

  • Before this change, the update flow could warn that the gateway did not become healthy after restart even though a subsequent openclaw gateway status --deep showed a healthy runtime and successful RPC probe.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: targeted restart-health, lifecycle, and update CLI Vitest suites pass with the new attribution-gap case.
  • Edge cases checked: healthy listener child PID ownership, stale PID detection, ambiguous ownership with probe fallback, and missing-listener-details on Linux.
  • What you did not verify: a live end-to-end package-manager update on a remote Linux host.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert the restart-health commits.
  • Files/config to restore: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/restart-health.test.ts
  • Known bad symptoms reviewers should watch for: a truly stale non-gateway listener being misclassified as healthy during updater restart verification.

Risks and Mitigations

  • Risk: the updater could be slightly more permissive when listener attribution is unavailable.
  • Mitigation: the relaxed path only applies when the service runtime is already running and the port is definitely busy; stale PID logic and probe fallback remain unchanged.

@vincentkoc vincentkoc self-assigned this Mar 8, 2026
@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes size: XS maintainer Maintainer-authored PR labels Mar 8, 2026
@vincentkoc vincentkoc marked this pull request as ready for review March 8, 2026 05:09
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Mar 8, 2026

🤖 We're reviewing this PR with Aisle

We're running a security check on the changes in this PR now. This usually takes a few minutes. ⌛
We'll post the results here as soon as they're ready.

Progress:

  • Analysis
  • Triage
  • Finalization

Latest run failed. Keeping previous successful results. Trace ID: 019ccbd96f60698a489d23d7bce151fc.

Last updated on: 2026-03-08T05:09:27Z

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7691e9593d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +147 to +150
? portUsage.listeners.some((listener) =>
listenerOwnedByRuntimePid({ listener, runtimePid }),
) || listenerAttributionGap
: gatewayListeners.length > 0 || listenerAttributionGap;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep reachability probe when listener attribution is missing

This change marks the restart as healthy whenever runtimePid exists and listener attribution is unavailable, because listenerAttributionGap is OR-ed into ownsPort; in practice inspectPortUsage reports this gap for any busy port with no listener metadata, so a different process occupying the gateway port (while the service process is merely running) is now treated as healthy without calling confirmGatewayReachable. That can hide real restart failures and skip stale-process recovery in update/restart flows on hosts where listener attribution is unavailable.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR fixes a false failure in openclaw update's post-restart health check on Linux hosts where the gateway comes back healthy but lsof (or equivalent) is unavailable for listener process attribution. The fix adds a new hasListenerAttributionGap helper in restart-health.ts that detects a busy port with empty listener details and uses it to set ownsPort = true without requiring the full probe path.

Strengths:

  • The gate conditions (status === "busy", listeners.length === 0, service running) are conservative and match the stated scope
  • The logic correctly places the new check in the fallback path, preserving stale PID detection and probe fallback for other cases
  • New regression test covers the happy path with both errors and hints present

Issues identified:

  1. The portUsage.errors?.length check at line 35 is overly broad: it returns true for ANY error, not just attribution-specific ones. This could cause false-healthy results if an unrelated inspection error co-occurs with status: "busy" and listeners: []. Recommend narrowing to specific error patterns (ENOENT, "process details are unavailable").
  2. The hints-only code path (line 38) lacks independent test coverage. The current test provides both errors and hints, causing the errors?.length check to short-circuit before the hints branch is reached. A follow-up test with errors: [] and only the hint would ensure this fallback path is exercised.

The fix is well-targeted, but these two refinements would improve robustness and maintainability before production rollout.

Confidence Score: 3/5

  • The fix is narrow and well-intentioned, but two refinements are recommended before merge: narrow the error check to attribution-specific errors, and add test coverage for the hints-only path.
  • The PR's core logic—treating a running service plus a confirmed busy port as healthy when attribution details are unavailable—is sound and safely scoped to the restart-health verification path. The new helper function is called only when appropriate, and the fallback probe behavior is preserved for edge cases. However, the portUsage.errors?.length check is too permissive (any error triggers the gap), and the hints-only code path has no independent test coverage. These are precision and maintainability concerns rather than blocking safety issues, but they merit correction to reduce the risk of false positives or silent breakage in the hints branch.
  • src/cli/daemon-cli/restart-health.ts (line 35: error check too broad) and src/cli/daemon-cli/restart-health.test.ts (hints-only path not independently tested).

Last reviewed commit: 7691e95

Comment on lines +35 to +37
if (portUsage.errors?.length) {
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The portUsage.errors?.length check is broader than intended. It will return true for any non-empty error array, even if the error is unrelated to listener attribution. If the port inspection tool encounters an error for a transient reason (e.g. I/O timeout) while still reporting status: "busy" and listeners: [], this would skip listener attribution verification and mark the restart as healthy without probe fallback.

Consider narrowing the check to errors that clearly indicate attribution unavailability:

Suggested change
if (portUsage.errors?.length) {
return true;
}
if (portUsage.errors?.some((e) => e.includes("ENOENT") || e.includes("process details are unavailable"))) {
return true;
}

This ensures the gap is detected only when the error is attribution-specific, maintaining consistency with the hints check below.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.ts
Line: 35-37

Comment:
The `portUsage.errors?.length` check is broader than intended. It will return `true` for any non-empty error array, even if the error is unrelated to listener attribution. If the port inspection tool encounters an error for a transient reason (e.g. I/O timeout) while still reporting `status: "busy"` and `listeners: []`, this would skip listener attribution verification and mark the restart as healthy without probe fallback.

Consider narrowing the check to errors that clearly indicate attribution unavailability:

```suggestion
if (portUsage.errors?.some((e) => e.includes("ENOENT") || e.includes("process details are unavailable"))) {
  return true;
}
```

This ensures the gap is detected only when the error is attribution-specific, maintaining consistency with the hints check below.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +202 to +222
it("treats busy ports with unavailable listener details as healthy when runtime is running", async () => {
const service = {
readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),
} as unknown as GatewayService;

inspectPortUsage.mockResolvedValue({
port: 18789,
status: "busy",
listeners: [],
hints: [
"Port is in use but process details are unavailable (install lsof or run as an admin user).",
],
errors: ["Error: spawn lsof ENOENT"],
});

const { inspectGatewayRestart } = await import("./restart-health.js");
const snapshot = await inspectGatewayRestart({ service, port: 18789 });

expect(snapshot.healthy).toBe(true);
expect(probeGateway).not.toHaveBeenCalled();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hints-only code path in hasListenerAttributionGap lacks independent test coverage. The test at this location provides both errors and hints, but the errors?.length check (line 35 of restart-health.ts) short-circuits before the hints.some(...) branch is ever reached. If the hints check is accidentally modified or removed, this test would still pass.

Add a separate test case that exercises the hints-only path:

Suggested change
it("treats busy ports with unavailable listener details as healthy when runtime is running", async () => {
const service = {
readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),
} as unknown as GatewayService;
inspectPortUsage.mockResolvedValue({
port: 18789,
status: "busy",
listeners: [],
hints: [
"Port is in use but process details are unavailable (install lsof or run as an admin user).",
],
errors: ["Error: spawn lsof ENOENT"],
});
const { inspectGatewayRestart } = await import("./restart-health.js");
const snapshot = await inspectGatewayRestart({ service, port: 18789 });
expect(snapshot.healthy).toBe(true);
expect(probeGateway).not.toHaveBeenCalled();
});
it("treats busy ports with only a hints-based attribution gap as healthy when runtime is running", async () => {
const service = {
readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),
} as unknown as GatewayService;
inspectPortUsage.mockResolvedValue({
port: 18789,
status: "busy",
listeners: [],
hints: ["Port is in use but process details are unavailable (install lsof or run as an admin user)."],
errors: [], // No errors — only hints
});
const { inspectGatewayRestart } = await import("./restart-health.js");
const snapshot = await inspectGatewayRestart({ service, port: 18789 });
expect(snapshot.healthy).toBe(true);
expect(probeGateway).not.toHaveBeenCalled();
});
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.test.ts
Line: 202-222

Comment:
The hints-only code path in `hasListenerAttributionGap` lacks independent test coverage. The test at this location provides both `errors` and `hints`, but the `errors?.length` check (line 35 of `restart-health.ts`) short-circuits before the `hints.some(...)` branch is ever reached. If the hints check is accidentally modified or removed, this test would still pass.

Add a separate test case that exercises the hints-only path:

```suggestion
it("treats busy ports with only a hints-based attribution gap as healthy when runtime is running", async () => {
  const service = {
    readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),
  } as unknown as GatewayService;

  inspectPortUsage.mockResolvedValue({
    port: 18789,
    status: "busy",
    listeners: [],
    hints: ["Port is in use but process details are unavailable (install lsof or run as an admin user)."],
    errors: [],  // No errors — only hints
  });

  const { inspectGatewayRestart } = await import("./restart-health.js");
  const snapshot = await inspectGatewayRestart({ service, port: 18789 });

  expect(snapshot.healthy).toBe(true);
  expect(probeGateway).not.toHaveBeenCalled();
});
```

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc vincentkoc merged commit c381034 into main Mar 8, 2026
27 of 29 checks passed
@vincentkoc vincentkoc deleted the vincentkoc-code/fix-update-health-no-lsof branch March 8, 2026 05:42
Get-windy pushed a commit to Get-windy/JieZi-ai-PS that referenced this pull request Mar 8, 2026
上游主要更新(5d22bd029 → 92648f9):
- fix(agents): 扩大 402 临时限速错误的检测范围,允许计费冷却探测(openclaw#38533)
- fix(ui): 修正控制台设备鉴权 token 签名问题
- CLI: 避免无监听归因时的误判更新重启失败(openclaw#39508)
- fix: 合入社区贡献 PR openclaw#39516(@Imhermes1)
- build: 版本升级至 2026.3.8,发布 v2026.3.7 tag

本地累积更新(自上次同步):
- feat: OAuth token 自动续期 —— cron 触发前预检 token 并自动刷新
- fix: oauth-refresh-daemon shouldRefresh 修复(已过期 token 不再被跳过)
- fix: 守护进程失败3次后重置计数器,避免永久停止刷新
- feat: 智能路由模型账号选择、模型管理全面优化
- feat: 飞书/钉钉频道会话聚合历史视图
Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes maintainer Maintainer-authored PR size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant