CLI: avoid false update restart failures without listener attribution#39508
CLI: avoid false update restart failures without listener attribution#39508vincentkoc merged 3 commits intomainfrom
Conversation
|
🤖 We're reviewing this PR with Aisle We're running a security check on the changes in this PR now. This usually takes a few minutes. ⌛ Progress:
Latest run failed. Keeping previous successful results. Trace ID: Last updated on: 2026-03-08T05:09:27Z |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7691e9593d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ? portUsage.listeners.some((listener) => | ||
| listenerOwnedByRuntimePid({ listener, runtimePid }), | ||
| ) || listenerAttributionGap | ||
| : gatewayListeners.length > 0 || listenerAttributionGap; |
There was a problem hiding this comment.
Keep reachability probe when listener attribution is missing
This change marks the restart as healthy whenever runtimePid exists and listener attribution is unavailable, because listenerAttributionGap is OR-ed into ownsPort; in practice inspectPortUsage reports this gap for any busy port with no listener metadata, so a different process occupying the gateway port (while the service process is merely running) is now treated as healthy without calling confirmGatewayReachable. That can hide real restart failures and skip stale-process recovery in update/restart flows on hosts where listener attribution is unavailable.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR fixes a false failure in Strengths:
Issues identified:
The fix is well-targeted, but these two refinements would improve robustness and maintainability before production rollout. Confidence Score: 3/5
Last reviewed commit: 7691e95 |
| if (portUsage.errors?.length) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
The portUsage.errors?.length check is broader than intended. It will return true for any non-empty error array, even if the error is unrelated to listener attribution. If the port inspection tool encounters an error for a transient reason (e.g. I/O timeout) while still reporting status: "busy" and listeners: [], this would skip listener attribution verification and mark the restart as healthy without probe fallback.
Consider narrowing the check to errors that clearly indicate attribution unavailability:
| if (portUsage.errors?.length) { | |
| return true; | |
| } | |
| if (portUsage.errors?.some((e) => e.includes("ENOENT") || e.includes("process details are unavailable"))) { | |
| return true; | |
| } |
This ensures the gap is detected only when the error is attribution-specific, maintaining consistency with the hints check below.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.ts
Line: 35-37
Comment:
The `portUsage.errors?.length` check is broader than intended. It will return `true` for any non-empty error array, even if the error is unrelated to listener attribution. If the port inspection tool encounters an error for a transient reason (e.g. I/O timeout) while still reporting `status: "busy"` and `listeners: []`, this would skip listener attribution verification and mark the restart as healthy without probe fallback.
Consider narrowing the check to errors that clearly indicate attribution unavailability:
```suggestion
if (portUsage.errors?.some((e) => e.includes("ENOENT") || e.includes("process details are unavailable"))) {
return true;
}
```
This ensures the gap is detected only when the error is attribution-specific, maintaining consistency with the hints check below.
How can I resolve this? If you propose a fix, please make it concise.| it("treats busy ports with unavailable listener details as healthy when runtime is running", async () => { | ||
| const service = { | ||
| readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })), | ||
| } as unknown as GatewayService; | ||
|
|
||
| inspectPortUsage.mockResolvedValue({ | ||
| port: 18789, | ||
| status: "busy", | ||
| listeners: [], | ||
| hints: [ | ||
| "Port is in use but process details are unavailable (install lsof or run as an admin user).", | ||
| ], | ||
| errors: ["Error: spawn lsof ENOENT"], | ||
| }); | ||
|
|
||
| const { inspectGatewayRestart } = await import("./restart-health.js"); | ||
| const snapshot = await inspectGatewayRestart({ service, port: 18789 }); | ||
|
|
||
| expect(snapshot.healthy).toBe(true); | ||
| expect(probeGateway).not.toHaveBeenCalled(); | ||
| }); |
There was a problem hiding this comment.
The hints-only code path in hasListenerAttributionGap lacks independent test coverage. The test at this location provides both errors and hints, but the errors?.length check (line 35 of restart-health.ts) short-circuits before the hints.some(...) branch is ever reached. If the hints check is accidentally modified or removed, this test would still pass.
Add a separate test case that exercises the hints-only path:
| it("treats busy ports with unavailable listener details as healthy when runtime is running", async () => { | |
| const service = { | |
| readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })), | |
| } as unknown as GatewayService; | |
| inspectPortUsage.mockResolvedValue({ | |
| port: 18789, | |
| status: "busy", | |
| listeners: [], | |
| hints: [ | |
| "Port is in use but process details are unavailable (install lsof or run as an admin user).", | |
| ], | |
| errors: ["Error: spawn lsof ENOENT"], | |
| }); | |
| const { inspectGatewayRestart } = await import("./restart-health.js"); | |
| const snapshot = await inspectGatewayRestart({ service, port: 18789 }); | |
| expect(snapshot.healthy).toBe(true); | |
| expect(probeGateway).not.toHaveBeenCalled(); | |
| }); | |
| it("treats busy ports with only a hints-based attribution gap as healthy when runtime is running", async () => { | |
| const service = { | |
| readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })), | |
| } as unknown as GatewayService; | |
| inspectPortUsage.mockResolvedValue({ | |
| port: 18789, | |
| status: "busy", | |
| listeners: [], | |
| hints: ["Port is in use but process details are unavailable (install lsof or run as an admin user)."], | |
| errors: [], // No errors — only hints | |
| }); | |
| const { inspectGatewayRestart } = await import("./restart-health.js"); | |
| const snapshot = await inspectGatewayRestart({ service, port: 18789 }); | |
| expect(snapshot.healthy).toBe(true); | |
| expect(probeGateway).not.toHaveBeenCalled(); | |
| }); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.test.ts
Line: 202-222
Comment:
The hints-only code path in `hasListenerAttributionGap` lacks independent test coverage. The test at this location provides both `errors` and `hints`, but the `errors?.length` check (line 35 of `restart-health.ts`) short-circuits before the `hints.some(...)` branch is ever reached. If the hints check is accidentally modified or removed, this test would still pass.
Add a separate test case that exercises the hints-only path:
```suggestion
it("treats busy ports with only a hints-based attribution gap as healthy when runtime is running", async () => {
const service = {
readRuntime: vi.fn(async () => ({ status: "running", pid: 8000 })),
} as unknown as GatewayService;
inspectPortUsage.mockResolvedValue({
port: 18789,
status: "busy",
listeners: [],
hints: ["Port is in use but process details are unavailable (install lsof or run as an admin user)."],
errors: [], // No errors — only hints
});
const { inspectGatewayRestart } = await import("./restart-health.js");
const snapshot = await inspectGatewayRestart({ service, port: 18789 });
expect(snapshot.healthy).toBe(true);
expect(probeGateway).not.toHaveBeenCalled();
});
```
How can I resolve this? If you propose a fix, please make it concise.上游主要更新(5d22bd029 → 92648f9): - fix(agents): 扩大 402 临时限速错误的检测范围,允许计费冷却探测(openclaw#38533) - fix(ui): 修正控制台设备鉴权 token 签名问题 - CLI: 避免无监听归因时的误判更新重启失败(openclaw#39508) - fix: 合入社区贡献 PR openclaw#39516(@Imhermes1) - build: 版本升级至 2026.3.8,发布 v2026.3.7 tag 本地累积更新(自上次同步): - feat: OAuth token 自动续期 —— cron 触发前预检 token 并自动刷新 - fix: oauth-refresh-daemon shouldRefresh 修复(已过期 token 不再被跳过) - fix: 守护进程失败3次后重置计数器,避免永久停止刷新 - feat: 智能路由模型账号选择、模型管理全面优化 - feat: 飞书/钉钉频道会话聚合历史视图
…openclaw#39508) (cherry picked from commit c381034)
…openclaw#39508) (cherry picked from commit c381034)
Summary
openclaw updatecould reportGateway did not become healthy after restarton Linux hosts where the gateway was healthy but listener ownership could not be attributed after restart.lsofpath.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
openclaw updateno longer reports a failed restart when the gateway is running and listening but listener PID ownership cannot be attributed during post-restart health checks.Security Impact (required)
Yes/No): NoYes/No): NoYes/No): NoYes/No): NoYes/No): NoYes, explain risk + mitigation:Repro + Verification
Environment
openclaw updateSteps
lsofunavailable / process details unavailable).openclaw update --channel betaand let the updater restart the gateway.Expected
Actual
openclaw gateway status --deepshowed a healthy runtime and successful RPC probe.Evidence
Human Verification (required)
What you personally verified (not just CI), and how:
Compatibility / Migration
Yes/No): YesYes/No): NoYes/No): NoFailure Recovery (if this breaks)
src/cli/daemon-cli/restart-health.ts,src/cli/daemon-cli/restart-health.test.tsRisks and Mitigations
runningand the port is definitelybusy; stale PID logic and probe fallback remain unchanged.