fix: detect PID recycling in gateway lock on Windows/macOS + startup progress [AI-assisted]#59843
Conversation
On non-Linux platforms, esolveGatewayOwnerStatus previously returned alive for any live PID without verifying the process was actually a gateway. Windows recycles PIDs aggressively, so a stale lock whose PID was reassigned to an unrelated process (e.g. chrome.exe) would block openclaw gateway run until the 5 s timeout, even when no gateway was actually running. Changes: - Add eadWindowsCmdline (via wmic) and eadDarwinCmdline (via ps) to read the owner process command line on Windows and macOS. - Feed the result through the existing isGatewayArgv check so recycled PIDs are detected as dead and the stale lock is removed immediately. - When the command-line lookup fails (wmic missing, timeout, permission error), return unknown so the existing stale-lock heuristic can still reclaim old locks. - Add parseWindowsCmdline to handle double-quoted Windows paths. - Add eadProcessCmdline option to GatewayLockOptions as a testing seam. Made-with: Cursor
- Handle wmic UTF-16LE BOM: read as Buffer and detect BOM before decoding, so the CommandLine regex matches on real Windows 10/11 systems - Reduce CMDLINE_EXEC_TIMEOUT_MS from 3s to 1s to avoid stalling the lock-acquisition polling loop - Return "alive" (not "unknown") when cmdline lookup fails on win32/darwin to preserve conservative single-instance guarantees - Document that the Darwin ps-based whitespace split is deliberately naive and the caller falls back to "alive" when it misparses Made-with: Cursor
Greptile SummaryThis PR fixes PID-recycling false positives in the gateway lock on Windows and macOS by reading the live process command line ( Confidence Score: 5/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/infra/gateway-lock.test.ts
Line: 359
Comment:
**Misleading test name — "unknown" vs "alive"**
The test description says "falls back to *unknown*" but the actual code path returns `"alive"` for non-Linux platforms when `readProcessCmdline` returns `null` (see `gateway-lock.ts` line 224: `return platform === "linux" ? "unknown" : "alive"`). The correct observable outcome (lock held → `GatewayLockError`) is right, but the name misdescribes the internal status used.
```suggestion
it("falls back to alive on win32 when cmdline reader returns null", async () => {
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 104-112
Comment:
**Inaccurate docstring for the space-in-path fallback**
The comment states "when the split does fail the caller falls back to 'alive' (conservative)." This is only true when `readDarwinCmdline` returns `null` (process gone, timeout, etc.). When the process path contains spaces the function returns a *non-null but misparsed* array; `isGatewayArgv` then returns `false` and the caller treats the lock as `"dead"`, not `"alive"`. The description of the fallback should distinguish between these two outcomes to avoid misleading future readers.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: address review feedback for gateway..." | Re-trigger Greptile |
| connectSpy.mockRestore(); | ||
| }); | ||
|
|
||
| it("falls back to unknown on win32 when cmdline reader returns null", async () => { |
There was a problem hiding this comment.
Misleading test name — "unknown" vs "alive"
The test description says "falls back to unknown" but the actual code path returns "alive" for non-Linux platforms when readProcessCmdline returns null (see gateway-lock.ts line 224: return platform === "linux" ? "unknown" : "alive"). The correct observable outcome (lock held → GatewayLockError) is right, but the name misdescribes the internal status used.
| it("falls back to unknown on win32 when cmdline reader returns null", async () => { | |
| it("falls back to alive on win32 when cmdline reader returns null", async () => { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.test.ts
Line: 359
Comment:
**Misleading test name — "unknown" vs "alive"**
The test description says "falls back to *unknown*" but the actual code path returns `"alive"` for non-Linux platforms when `readProcessCmdline` returns `null` (see `gateway-lock.ts` line 224: `return platform === "linux" ? "unknown" : "alive"`). The correct observable outcome (lock held → `GatewayLockError`) is right, but the name misdescribes the internal status used.
```suggestion
it("falls back to alive on win32 when cmdline reader returns null", async () => {
```
How can I resolve this? If you propose a fix, please make it concise.| /** | ||
| * Read the command line of a macOS/BSD process via `ps`. | ||
| * | ||
| * `ps -o command=` outputs an unquoted flat string, so the naive whitespace | ||
| * split will misparse paths containing spaces. This is acceptable because | ||
| * standard macOS install paths do not contain spaces, and when the split | ||
| * does fail the caller falls back to "alive" (conservative). | ||
| */ | ||
| function readDarwinCmdline(pid: number): string[] | null { |
There was a problem hiding this comment.
Inaccurate docstring for the space-in-path fallback
The comment states "when the split does fail the caller falls back to 'alive' (conservative)." This is only true when readDarwinCmdline returns null (process gone, timeout, etc.). When the process path contains spaces the function returns a non-null but misparsed array; isGatewayArgv then returns false and the caller treats the lock as "dead", not "alive". The description of the fallback should distinguish between these two outcomes to avoid misleading future readers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 104-112
Comment:
**Inaccurate docstring for the space-in-path fallback**
The comment states "when the split does fail the caller falls back to 'alive' (conservative)." This is only true when `readDarwinCmdline` returns `null` (process gone, timeout, etc.). When the process path contains spaces the function returns a *non-null but misparsed* array; `isGatewayArgv` then returns `false` and the caller treats the lock as `"dead"`, not `"alive"`. The description of the fallback should distinguish between these two outcomes to avoid misleading future readers.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: adc9ea9a00
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (!line) { | ||
| return null; | ||
| } | ||
| return line.split(/\s+/).filter(Boolean); |
There was a problem hiding this comment.
Treat ambiguous Darwin cmdline parsing as unknown, not dead
readDarwinCmdline tokenizes ps -o command= output with a raw whitespace split, which breaks whenever any argv element contains spaces; the function still returns a non-null argv, so resolveGatewayOwnerStatus can classify a live gateway as dead and remove its lock. In that scenario (for example, gateway launched from a path with spaces), a second gateway run can steal the lock while the first instance is still running, violating the single-instance guard and leading to startup races/bind failures.
Useful? React with 👍 / 👎.
Addressing bot review feedback (3e30121)Greptile P2 — Misleading test name (gateway-lock.test.ts:359) Greptile P2 — Inaccurate docstring for space-in-path fallback (gateway-lock.ts:112)
The docstring now explains this and notes that standard macOS install paths (/usr/local/bin, ~/.npm-global/bin, Homebrew prefixes) do not contain spaces. Codex P2 — Ambiguous Darwin cmdline parsing (gateway-lock.ts:123) All three P2 items have been addressed in commit 3e30121. |
Local verification results (Windows 10, Node 22)All local checks pass:
Verified on Windows 10 22H2, Node v22.x, pnpm 10.x. |
…progress (#59843) Fix stale lock files from crashed gateway processes blocking new invocations on Windows/macOS. Detect PID recycling to avoid false positive lock conflicts, and add startup progress indicator. Thanks @TonyDerek-dot
…progress (openclaw#59843) Fix stale lock files from crashed gateway processes blocking new invocations on Windows/macOS. Detect PID recycling to avoid false positive lock conflicts, and add startup progress indicator. Thanks @TonyDerek-dot
Summary
gateway runfor the full 5s timeout becauseisPidAlivereturnstruefor recycled PIDs of unrelated processes. Additionally, gateway startup shows no output for 15-20s on Windows/NTFS during module loading.wmicon Windows with UTF-16LE BOM handling,pson macOS) to detect PID recycling. Added awithProgressspinner during the heaviest module-loading phase andgatewayLog.infostage markers throughout startup./proc/{pid}/statstart-time comparison as the primary mechanism. No changes to lock file format, timeouts, or polling logic.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause / Regression History (if applicable)
resolveGatewayOwnerStatusonly checked PID liveness on non-Linux platforms without verifying the process identity. Windows recycles PIDs aggressively, soisPidAlivereturnstruefor unrelated processes that inherited the stale PID./proc/{pid}/cmdlinechecking.isPidAlive, which is insufficient due to PID recycling.Regression Test Plan (if applicable)
src/infra/gateway-lock.test.ts,src/infra/gateway-process-argv.test.tsacquireGatewayLockshould reclaim the lock. When the PID belongs to a real gateway, it should reject.readProcessCmdlinemock isolate the platform-specific verification without requiring actual process spawning.User-visible / Behavior Changes
openclaw gateway runnow shows a spinner ("Loading gateway modules…") during the initial module import phase instead of a blank screen for 15-20s on Windows.loading configuration…,resolving authentication…,acquiring gateway lock...,loading plugins...,starting HTTP server...,starting channels and sidecars...) are logged during startup.Diagram (if applicable)
Security Impact (required)
Yes, explain risk + mitigation: N/ARepro + Verification
Environment
Steps
openclaw gateway run, then force-kill the process (Task Manager /kill -9)openclaw gateway runagain immediatelyExpected
Actual
Evidence
New tests in
gateway-lock.test.tscover: Windows cmdline match, Windows cmdline mismatch (PID recycling), macOS cmdline match, macOS cmdline mismatch, port-busy + alive with mock cmdline.Human Verification (required)
pnpm tsgo)AI Assistance Disclosure
Review Conversations
All feedback from #59799 (Greptile + Codex bots) has been incorporated:
Compatibility / Migration
Risks and Mitigations
wmicis deprecated on newer Windows versions and may eventually be removedreadWindowsCmdlinecatches all errors and returns null; the fallback is "alive" (conservative, same as pre-fix behavior)psoutput parsing may misparse paths with spaces on macOS