fix: detect PID recycling in gateway lock on Windows and macOS#59799
fix: detect PID recycling in gateway lock on Windows and macOS#59799TonyDerek-dot wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a real PID-recycling bug on Windows and macOS where a crashed gateway's stale lock file could block
Confidence Score: 2/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 83-87
Comment:
**`wmic` outputs UTF-16LE, but `execFileSync` reads as UTF-8**
On the majority of Windows systems (especially Windows 10/11 older builds), `wmic` emits its output encoded as UTF-16LE (with a BOM). When `execFileSync` reads the result with `encoding: "utf8"`, each ASCII character is interspersed with null bytes (`\x00`), so the decoded string looks like `C\x00o\x00m\x00m\x00a\x00n\x00d\x00L\x00i\x00n\x00e\x00=\x00…`. The regex `/CommandLine=(.+)/` will never match this garbled output, causing `readWindowsCmdline` to silently return `null` on many real Windows machines — exactly the platforms this fix is targeting.
To reliably handle both encodings you can detect the BOM and re-decode:
```typescript
function readWindowsCmdline(pid: number): string[] | null {
try {
const buf = execFileSync(
"wmic",
["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"],
{ timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true, stdio: ["ignore", "pipe", "ignore"] },
) as unknown as Buffer;
// wmic emits UTF-16LE on most Windows builds; newer builds may use UTF-8.
const raw = buf[0] === 0xff && buf[1] === 0xfe
? buf.toString("utf16le")
: buf.toString("utf8");
const match = raw.match(/CommandLine=(.+)/);
if (!match) return null;
return parseWindowsCmdline(match[1].trim());
} catch {
return null;
}
}
```
Without the `encoding` option, `execFileSync` returns a `Buffer` instead of a `string`, allowing manual re-decoding after BOM detection.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 81-116
Comment:
**Synchronous `execFileSync` blocks the event loop for up to 3 seconds**
Both `readWindowsCmdline` and `readDarwinCmdline` call `execFileSync` with a 3-second timeout. Since `resolveGatewayOwnerStatus` is awaited inside the `acquireGatewayLock` polling loop (which has a default timeout of only 5 seconds), a single slow `wmic` invocation can stall the entire Node.js event loop and consume nearly the full lock-acquisition window before any other work can proceed.
`execFileSync` is synchronous and will block the event loop for the full duration of the timeout if the child process hangs. The async equivalents (`execFile` from `node:child_process/promises`, or wrapping `child_process.execFile` in a `Promise`) would allow the event loop to remain responsive and integrate naturally with the existing `await` call chain. Consider making `resolveGatewayOwnerStatus` receive the cmdline via a `Promise`-based reader, e.g.:
```typescript
async function readWindowsCmdline(pid: number): Promise<string[] | null> {
const { execFile } = await import("node:child_process");
return new Promise((resolve) => {
const child = execFile(
"wmic",
["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"],
{ timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true },
(err, stdout) => {
if (err) { resolve(null); return; }
const match = stdout.match(/CommandLine=(.+)/);
resolve(match ? parseWindowsCmdline(match[1].trim()) : null);
},
);
child.stdout?.setEncoding("utf8");
});
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 101-115
Comment:
**Naive `\s+` split in `readDarwinCmdline` breaks on paths containing spaces**
`ps -p <pid> -o command=` outputs the raw command string as a flat, unquoted line — arguments are not shell-quoted. Splitting by `/\s+/` misparses any token that contains a space, including common macOS install paths like `/Users/me/My Apps/bin/node`. In that case, `args[0]` would be `/Users/me/My` and `isGatewayArgv` would fail to recognise it as a gateway binary (its `exe.endsWith("/openclaw")` check on the normalised first arg would miss).
The most portable workaround with the `ps` output format is to request NUL-delimited output via `ps -p <pid> -o args=` and parse accordingly, or to read from `/proc/<pid>/cmdline` on Linux/macOS with `procfs` available. A simpler partial fix is to at least match on any token ending with `/openclaw` rather than relying only on `args[0]`, but that would require changes to `isGatewayArgv` too.
At minimum, add a comment documenting that this split is deliberately naive and will fail for install paths with spaces.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: detect PID recycling in gateway loc..." | Re-trigger Greptile |
src/infra/gateway-lock.ts
Outdated
| const raw = execFileSync( | ||
| "wmic", | ||
| ["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"], | ||
| { encoding: "utf8", timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true, stdio: ["ignore", "pipe", "ignore"] }, | ||
| ); |
There was a problem hiding this comment.
wmic outputs UTF-16LE, but execFileSync reads as UTF-8
On the majority of Windows systems (especially Windows 10/11 older builds), wmic emits its output encoded as UTF-16LE (with a BOM). When execFileSync reads the result with encoding: "utf8", each ASCII character is interspersed with null bytes (\x00), so the decoded string looks like C\x00o\x00m\x00m\x00a\x00n\x00d\x00L\x00i\x00n\x00e\x00=\x00…. The regex /CommandLine=(.+)/ will never match this garbled output, causing readWindowsCmdline to silently return null on many real Windows machines — exactly the platforms this fix is targeting.
To reliably handle both encodings you can detect the BOM and re-decode:
function readWindowsCmdline(pid: number): string[] | null {
try {
const buf = execFileSync(
"wmic",
["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"],
{ timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true, stdio: ["ignore", "pipe", "ignore"] },
) as unknown as Buffer;
// wmic emits UTF-16LE on most Windows builds; newer builds may use UTF-8.
const raw = buf[0] === 0xff && buf[1] === 0xfe
? buf.toString("utf16le")
: buf.toString("utf8");
const match = raw.match(/CommandLine=(.+)/);
if (!match) return null;
return parseWindowsCmdline(match[1].trim());
} catch {
return null;
}
}Without the encoding option, execFileSync returns a Buffer instead of a string, allowing manual re-decoding after BOM detection.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 83-87
Comment:
**`wmic` outputs UTF-16LE, but `execFileSync` reads as UTF-8**
On the majority of Windows systems (especially Windows 10/11 older builds), `wmic` emits its output encoded as UTF-16LE (with a BOM). When `execFileSync` reads the result with `encoding: "utf8"`, each ASCII character is interspersed with null bytes (`\x00`), so the decoded string looks like `C\x00o\x00m\x00m\x00a\x00n\x00d\x00L\x00i\x00n\x00e\x00=\x00…`. The regex `/CommandLine=(.+)/` will never match this garbled output, causing `readWindowsCmdline` to silently return `null` on many real Windows machines — exactly the platforms this fix is targeting.
To reliably handle both encodings you can detect the BOM and re-decode:
```typescript
function readWindowsCmdline(pid: number): string[] | null {
try {
const buf = execFileSync(
"wmic",
["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"],
{ timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true, stdio: ["ignore", "pipe", "ignore"] },
) as unknown as Buffer;
// wmic emits UTF-16LE on most Windows builds; newer builds may use UTF-8.
const raw = buf[0] === 0xff && buf[1] === 0xfe
? buf.toString("utf16le")
: buf.toString("utf8");
const match = raw.match(/CommandLine=(.+)/);
if (!match) return null;
return parseWindowsCmdline(match[1].trim());
} catch {
return null;
}
}
```
Without the `encoding` option, `execFileSync` returns a `Buffer` instead of a `string`, allowing manual re-decoding after BOM detection.
How can I resolve this? If you propose a fix, please make it concise.| function readWindowsCmdline(pid: number): string[] | null { | ||
| try { | ||
| const raw = execFileSync( | ||
| "wmic", | ||
| ["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"], | ||
| { encoding: "utf8", timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true, stdio: ["ignore", "pipe", "ignore"] }, | ||
| ); | ||
| const match = raw.match(/CommandLine=(.+)/); | ||
| if (!match) { | ||
| return null; | ||
| } | ||
| return parseWindowsCmdline(match[1].trim()); | ||
| } catch { | ||
| return null; | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Read the command line of a macOS/BSD process via `ps`. | ||
| */ | ||
| function readDarwinCmdline(pid: number): string[] | null { | ||
| try { | ||
| const raw = execFileSync("ps", ["-p", String(pid), "-o", "command="], { | ||
| encoding: "utf8", | ||
| timeout: CMDLINE_EXEC_TIMEOUT_MS, | ||
| stdio: ["ignore", "pipe", "ignore"], | ||
| }); | ||
| const line = raw.trim(); | ||
| if (!line) { | ||
| return null; | ||
| } | ||
| return line.split(/\s+/).filter(Boolean); | ||
| } catch { | ||
| return null; | ||
| } | ||
| } |
There was a problem hiding this comment.
Synchronous
execFileSync blocks the event loop for up to 3 seconds
Both readWindowsCmdline and readDarwinCmdline call execFileSync with a 3-second timeout. Since resolveGatewayOwnerStatus is awaited inside the acquireGatewayLock polling loop (which has a default timeout of only 5 seconds), a single slow wmic invocation can stall the entire Node.js event loop and consume nearly the full lock-acquisition window before any other work can proceed.
execFileSync is synchronous and will block the event loop for the full duration of the timeout if the child process hangs. The async equivalents (execFile from node:child_process/promises, or wrapping child_process.execFile in a Promise) would allow the event loop to remain responsive and integrate naturally with the existing await call chain. Consider making resolveGatewayOwnerStatus receive the cmdline via a Promise-based reader, e.g.:
async function readWindowsCmdline(pid: number): Promise<string[] | null> {
const { execFile } = await import("node:child_process");
return new Promise((resolve) => {
const child = execFile(
"wmic",
["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"],
{ timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true },
(err, stdout) => {
if (err) { resolve(null); return; }
const match = stdout.match(/CommandLine=(.+)/);
resolve(match ? parseWindowsCmdline(match[1].trim()) : null);
},
);
child.stdout?.setEncoding("utf8");
});
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 81-116
Comment:
**Synchronous `execFileSync` blocks the event loop for up to 3 seconds**
Both `readWindowsCmdline` and `readDarwinCmdline` call `execFileSync` with a 3-second timeout. Since `resolveGatewayOwnerStatus` is awaited inside the `acquireGatewayLock` polling loop (which has a default timeout of only 5 seconds), a single slow `wmic` invocation can stall the entire Node.js event loop and consume nearly the full lock-acquisition window before any other work can proceed.
`execFileSync` is synchronous and will block the event loop for the full duration of the timeout if the child process hangs. The async equivalents (`execFile` from `node:child_process/promises`, or wrapping `child_process.execFile` in a `Promise`) would allow the event loop to remain responsive and integrate naturally with the existing `await` call chain. Consider making `resolveGatewayOwnerStatus` receive the cmdline via a `Promise`-based reader, e.g.:
```typescript
async function readWindowsCmdline(pid: number): Promise<string[] | null> {
const { execFile } = await import("node:child_process");
return new Promise((resolve) => {
const child = execFile(
"wmic",
["process", "where", `processid=${pid}`, "get", "CommandLine", "/value"],
{ timeout: CMDLINE_EXEC_TIMEOUT_MS, windowsHide: true },
(err, stdout) => {
if (err) { resolve(null); return; }
const match = stdout.match(/CommandLine=(.+)/);
resolve(match ? parseWindowsCmdline(match[1].trim()) : null);
},
);
child.stdout?.setEncoding("utf8");
});
}
```
How can I resolve this? If you propose a fix, please make it concise.| function readDarwinCmdline(pid: number): string[] | null { | ||
| try { | ||
| const raw = execFileSync("ps", ["-p", String(pid), "-o", "command="], { | ||
| encoding: "utf8", | ||
| timeout: CMDLINE_EXEC_TIMEOUT_MS, | ||
| stdio: ["ignore", "pipe", "ignore"], | ||
| }); | ||
| const line = raw.trim(); | ||
| if (!line) { | ||
| return null; | ||
| } | ||
| return line.split(/\s+/).filter(Boolean); | ||
| } catch { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Naive
\s+ split in readDarwinCmdline breaks on paths containing spaces
ps -p <pid> -o command= outputs the raw command string as a flat, unquoted line — arguments are not shell-quoted. Splitting by /\s+/ misparses any token that contains a space, including common macOS install paths like /Users/me/My Apps/bin/node. In that case, args[0] would be /Users/me/My and isGatewayArgv would fail to recognise it as a gateway binary (its exe.endsWith("/openclaw") check on the normalised first arg would miss).
The most portable workaround with the ps output format is to request NUL-delimited output via ps -p <pid> -o args= and parse accordingly, or to read from /proc/<pid>/cmdline on Linux/macOS with procfs available. A simpler partial fix is to at least match on any token ending with /openclaw rather than relying only on args[0], but that would require changes to isGatewayArgv too.
At minimum, add a comment documenting that this split is deliberately naive and will fail for install paths with spaces.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 101-115
Comment:
**Naive `\s+` split in `readDarwinCmdline` breaks on paths containing spaces**
`ps -p <pid> -o command=` outputs the raw command string as a flat, unquoted line — arguments are not shell-quoted. Splitting by `/\s+/` misparses any token that contains a space, including common macOS install paths like `/Users/me/My Apps/bin/node`. In that case, `args[0]` would be `/Users/me/My` and `isGatewayArgv` would fail to recognise it as a gateway binary (its `exe.endsWith("/openclaw")` check on the normalised first arg would miss).
The most portable workaround with the `ps` output format is to request NUL-delimited output via `ps -p <pid> -o args=` and parse accordingly, or to read from `/proc/<pid>/cmdline` on Linux/macOS with `procfs` available. A simpler partial fix is to at least match on any token ending with `/openclaw` rather than relying only on `args[0]`, but that would require changes to `isGatewayArgv` too.
At minimum, add a comment documenting that this split is deliberately naive and will fail for install paths with spaces.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 218d94a897
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/infra/gateway-lock.ts
Outdated
| return platform === "linux" || platform === "win32" || platform === "darwin" | ||
| ? "unknown" | ||
| : "alive"; |
There was a problem hiding this comment.
Preserve live lock when cmdline lookup is unavailable
Returning "unknown" for win32/darwin when command-line probing fails sends healthy owners into the stale-lock branch in acquireGatewayLock. Once the lock age exceeds staleMs, that branch deletes the lock file, so a second gateway run can remove an active gateway's lock even when isPidAlive(pid) is true and the lock port is still busy. This is a regression from the prior non-Linux behavior ("alive"), and it weakens single-instance guarantees on hosts where wmic/ps lookup is unavailable or transiently failing.
Useful? React with 👍 / 👎.
On non-Linux platforms, esolveGatewayOwnerStatus previously returned alive for any live PID without verifying the process was actually a gateway. Windows recycles PIDs aggressively, so a stale lock whose PID was reassigned to an unrelated process (e.g. chrome.exe) would block openclaw gateway run until the 5 s timeout, even when no gateway was actually running. Changes: - Add eadWindowsCmdline (via wmic) and eadDarwinCmdline (via ps) to read the owner process command line on Windows and macOS. - Feed the result through the existing isGatewayArgv check so recycled PIDs are detected as dead and the stale lock is removed immediately. - When the command-line lookup fails (wmic missing, timeout, permission error), return unknown so the existing stale-lock heuristic can still reclaim old locks. - Add parseWindowsCmdline to handle double-quoted Windows paths. - Add eadProcessCmdline option to GatewayLockOptions as a testing seam. Made-with: Cursor
- Handle wmic UTF-16LE BOM: read as Buffer and detect BOM before decoding, so the CommandLine regex matches on real Windows 10/11 systems - Reduce CMDLINE_EXEC_TIMEOUT_MS from 3s to 1s to avoid stalling the lock-acquisition polling loop - Return "alive" (not "unknown") when cmdline lookup fails on win32/darwin to preserve conservative single-instance guarantees - Document that the Darwin ps-based whitespace split is deliberately naive and the caller falls back to "alive" when it misparses Made-with: Cursor
218d94a to
10a49d4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10a49d4e52
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (!line) { | ||
| return null; | ||
| } | ||
| return line.split(/\s+/).filter(Boolean); |
There was a problem hiding this comment.
Treat space-split macOS cmdlines as uncertain ownership
Splitting ps -o command= output on raw whitespace can misparse valid gateway processes whose executable path contains spaces (for example custom install paths), and that malformed argv then flows into isGatewayArgv(...) as false. In that case resolveGatewayOwnerStatus marks a live owner as "dead" and deletes its lock, so a second gateway run can remove the active instance’s lock and proceed as if ownership were stale. This is a regression from the previous darwin behavior (alive for any live PID) and should fall back to an uncertain/alive state when argv tokenization is ambiguous.
Useful? React with 👍 / 👎.
|
Closing in favor of a new PR with review feedback addressed (UTF-16LE wmic fix, reduced timeout, conservative alive fallback). |
Summary
On non-Linux platforms (Windows, macOS),
resolveGatewayOwnerStatuspreviously returned"alive"for any live PID without verifying the process was actually a gateway instance. Because Windows recycles PIDs aggressively, a stale lock file whose PID was reassigned to an unrelated process (e.g.chrome.exe,svchost.exe) would blockopenclaw gateway runfor the full 5-second lock timeout, producing a misleading "gateway already running" error even when no gateway was running.Root cause
When the gateway crashes or is killed without releasing its lock file (
%TEMP%\openclaw\gateway.<hash>.lock), the lock persists with the old PID. On Linux,resolveGatewayOwnerStatusreads/proc/<pid>/cmdlineand/proc/<pid>/statto verify the owner process is actually a gateway. On Windows and macOS, there was no equivalent check -- ifprocess.kill(pid, 0)succeeded (i.e. any process held that PID), the lock was treated as valid.Changes
readWindowsCmdline(pid): reads the process command line viawmic process ... get CommandLine(with a 3-second timeout andwindowsHide). Falls back tonullwhenwmicis unavailable or times out.readDarwinCmdline(pid): reads the process command line viaps -p <pid> -o command=. Falls back tonullon failure.parseWindowsCmdline(raw): splits a Windows command line string respecting double-quoted paths (e.g."C:\Program Files\node.exe" gateway run).resolveGatewayOwnerStatusnow delegates to the platform-appropriate reader and feeds the result throughisGatewayArgv(). Recycled PIDs that belong to non-gateway processes are detected as"dead"and the stale lock is removed immediately. When the lookup fails, the status falls back to"unknown"so the existing stale-lock heuristic can still reclaim old locks.readProcessCmdlineoption added toGatewayLockOptionsas a testing seam.Reproduction (Windows)
openclaw gateway runthen kill the process via Task Manager (lock file remains)openclaw gateway runagain -- previously waited 5 s then errored with "gateway already running (pid XXXXX)"Test plan
parseWindowsCmdline(quoted paths, empty input, multiple spaces)