Skip to content

fix: detect PID recycling in gateway lock on Windows/macOS + startup progress [AI-assisted]#59843

Merged
BradGroux merged 3 commits intoopenclaw:mainfrom
TonyDerek-dot:fix/gateway-lock-pid-recycling-v2
Apr 3, 2026
Merged

fix: detect PID recycling in gateway lock on Windows/macOS + startup progress [AI-assisted]#59843
BradGroux merged 3 commits intoopenclaw:mainfrom
TonyDerek-dot:fix/gateway-lock-pid-recycling-v2

Conversation

@TonyDerek-dot
Copy link
Copy Markdown
Contributor

@TonyDerek-dot TonyDerek-dot commented Apr 2, 2026

Summary

  • Problem: On Windows (and macOS), stale gateway lock files from crashed processes block new gateway run for the full 5s timeout because isPidAlive returns true for recycled PIDs of unrelated processes. Additionally, gateway startup shows no output for 15-20s on Windows/NTFS during module loading.
  • Why it matters: Windows users see false "gateway already running" errors or long hangs after unclean shutdowns, requiring manual lock file deletion. The silent startup gives no indication of progress.
  • What changed: Added platform-aware process command-line verification (wmic on Windows with UTF-16LE BOM handling, ps on macOS) to detect PID recycling. Added a withProgress spinner during the heaviest module-loading phase and gatewayLog.info stage markers throughout startup.
  • What did NOT change (scope boundary): Linux behavior is completely unchanged — it continues to use /proc/{pid}/stat start-time comparison as the primary mechanism. No changes to lock file format, timeouts, or polling logic.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause / Regression History (if applicable)

  • Root cause: resolveGatewayOwnerStatus only checked PID liveness on non-Linux platforms without verifying the process identity. Windows recycles PIDs aggressively, so isPidAlive returns true for unrelated processes that inherited the stale PID.
  • Missing detection / guardrail: No command-line verification existed for Windows/macOS — only Linux had /proc/{pid}/cmdline checking.
  • Prior context: The Linux path already had both start-time comparison and cmdline verification. Windows and macOS were added later but only got isPidAlive, which is insufficient due to PID recycling.
  • Why this regressed now: Not a regression — this was a latent bug since the lock mechanism was introduced, but it surfaces frequently on Windows where PID recycling is more aggressive and antivirus slows module loading (making the timeout window feel longer).
  • If unknown, what was ruled out: N/A — root cause is confirmed.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/infra/gateway-lock.test.ts, src/infra/gateway-process-argv.test.ts
  • Scenario the test should lock in: When a lock file exists with a PID that is alive but belongs to a non-gateway process (PID recycling), acquireGatewayLock should reclaim the lock. When the PID belongs to a real gateway, it should reject.
  • Why this is the smallest reliable guardrail: Unit tests with injected readProcessCmdline mock isolate the platform-specific verification without requiring actual process spawning.
  • Existing test that already covers this (if any): Existing tests covered port-busy + PID-alive scenarios but did not verify process identity.

User-visible / Behavior Changes

  • openclaw gateway run now shows a spinner ("Loading gateway modules…") during the initial module import phase instead of a blank screen for 15-20s on Windows.
  • Stage markers (loading configuration…, resolving authentication…, acquiring gateway lock..., loading plugins..., starting HTTP server..., starting channels and sidecars...) are logged during startup.
  • Stale lock files from crashed gateways are reclaimed faster on Windows/macOS when the PID has been recycled by an unrelated process.

Diagram (if applicable)

Before (Windows, stale lock from crashed gateway):
[gateway run] -> isPidAlive(stale PID)=true -> wait 5s timeout -> ERROR "gateway already running"

After:
[gateway run] -> isPidAlive(stale PID)=true -> wmic cmdline check -> not a gateway -> reclaim lock -> start OK

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No (wmic/ps are read-only process queries, not exposed to users)
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Windows 10/11 (primary), macOS (secondary)
  • Runtime/container: Node 22+
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): Default gateway config

Steps

  1. Start openclaw gateway run, then force-kill the process (Task Manager / kill -9)
  2. Run openclaw gateway run again immediately
  3. Observe: old behavior blocks for 5s with "gateway already running"; new behavior reclaims lock and starts

Expected

  • Gateway starts successfully after detecting the stale lock owner is not a gateway process

Actual

  • (Before fix) Blocks for full timeout, then errors with "gateway already running"
  • (After fix) Reclaims lock and starts with spinner + stage log output

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

New tests in gateway-lock.test.ts cover: Windows cmdline match, Windows cmdline mismatch (PID recycling), macOS cmdline match, macOS cmdline mismatch, port-busy + alive with mock cmdline.

Human Verification (required)

  • Verified scenarios: Gateway startup with spinner on Windows, lock reclamation after force-kill, type-check passes (pnpm tsgo)
  • Edge cases checked: wmic UTF-16LE vs UTF-8 output, cmdline read failure fallback (returns "alive"), 1s exec timeout
  • What you did not verify: Real macOS testing (only unit tests), wmic deprecation on future Windows versions

AI Assistance Disclosure

  • This PR was developed with AI assistance (Cursor + Claude)
  • Testing level: Unit tests fully passing, type-check verified, local Windows gateway startup verified
  • I understand what the code does

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

All feedback from #59799 (Greptile + Codex bots) has been incorporated:

  • wmic UTF-16LE BOM detection (P1)
  • Reduced exec timeout from 3s to 1s (P1)
  • Conservative "alive" fallback when cmdline lookup fails (P1)
  • Documented naive whitespace split limitation for Darwin (P2)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: wmic is deprecated on newer Windows versions and may eventually be removed
    • Mitigation: readWindowsCmdline catches all errors and returns null; the fallback is "alive" (conservative, same as pre-fix behavior)
  • Risk: ps output parsing may misparse paths with spaces on macOS
    • Mitigation: Standard macOS install paths don't contain spaces; failure falls back to "alive" (documented in code comment)

On non-Linux platforms, 
esolveGatewayOwnerStatus previously returned
alive for any live PID without verifying the process was actually a
gateway.  Windows recycles PIDs aggressively, so a stale lock whose PID
was reassigned to an unrelated process (e.g. chrome.exe) would block
openclaw gateway run until the 5 s timeout, even when no gateway was
actually running.

Changes:
- Add 
eadWindowsCmdline (via wmic) and 
eadDarwinCmdline (via
  ps) to read the owner process command line on Windows and macOS.
- Feed the result through the existing isGatewayArgv check so recycled
  PIDs are detected as dead and the stale lock is removed immediately.
- When the command-line lookup fails (wmic missing, timeout, permission
  error), return unknown so the existing stale-lock heuristic can
  still reclaim old locks.
- Add parseWindowsCmdline to handle double-quoted Windows paths.
- Add 
eadProcessCmdline option to GatewayLockOptions as a testing
  seam.

Made-with: Cursor
- Handle wmic UTF-16LE BOM: read as Buffer and detect BOM before decoding,
  so the CommandLine regex matches on real Windows 10/11 systems
- Reduce CMDLINE_EXEC_TIMEOUT_MS from 3s to 1s to avoid stalling the
  lock-acquisition polling loop
- Return "alive" (not "unknown") when cmdline lookup fails on win32/darwin
  to preserve conservative single-instance guarantees
- Document that the Darwin ps-based whitespace split is deliberately naive
  and the caller falls back to "alive" when it misparses

Made-with: Cursor
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime cli CLI command changes size: M labels Apr 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR fixes PID-recycling false positives in the gateway lock on Windows and macOS by reading the live process command line (wmic on Windows, ps on macOS) and validating it against isGatewayArgv before treating a lock owner as alive. It also adds a withProgress spinner and structured gatewayLog.info checkpoints to surface startup progress during the previously silent module-loading phase. The implementation is well-structured with proper conservative fallbacks (returning "alive" when the cmdline lookup fails on non-Linux platforms) and a clean DI seam for unit testing via readProcessCmdline.

Confidence Score: 5/5

  • Safe to merge; only minor documentation/naming P2 findings remain.
  • The core PID-recycling logic is correct: the conservative fallback to "alive" on win32/darwin when cmdline reads fail preserves single-instance guarantees, and Linux behavior is unchanged. The wmic Buffer-decoding, UTF-16LE BOM detection, and parseWindowsCmdline quoting logic are all sound for real-world gateway paths. The two remaining findings are a misleading test name and an inaccurate docstring — neither affects runtime behavior.
  • No files require special attention.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.test.ts
Line: 359

Comment:
**Misleading test name — "unknown" vs "alive"**

The test description says "falls back to *unknown*" but the actual code path returns `"alive"` for non-Linux platforms when `readProcessCmdline` returns `null` (see `gateway-lock.ts` line 224: `return platform === "linux" ? "unknown" : "alive"`). The correct observable outcome (lock held → `GatewayLockError`) is right, but the name misdescribes the internal status used.

```suggestion
  it("falls back to alive on win32 when cmdline reader returns null", async () => {
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 104-112

Comment:
**Inaccurate docstring for the space-in-path fallback**

The comment states "when the split does fail the caller falls back to 'alive' (conservative)." This is only true when `readDarwinCmdline` returns `null` (process gone, timeout, etc.). When the process path contains spaces the function returns a *non-null but misparsed* array; `isGatewayArgv` then returns `false` and the caller treats the lock as `"dead"`, not `"alive"`. The description of the fallback should distinguish between these two outcomes to avoid misleading future readers.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: address review feedback for gateway..." | Re-trigger Greptile

connectSpy.mockRestore();
});

it("falls back to unknown on win32 when cmdline reader returns null", async () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading test name — "unknown" vs "alive"

The test description says "falls back to unknown" but the actual code path returns "alive" for non-Linux platforms when readProcessCmdline returns null (see gateway-lock.ts line 224: return platform === "linux" ? "unknown" : "alive"). The correct observable outcome (lock held → GatewayLockError) is right, but the name misdescribes the internal status used.

Suggested change
it("falls back to unknown on win32 when cmdline reader returns null", async () => {
it("falls back to alive on win32 when cmdline reader returns null", async () => {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.test.ts
Line: 359

Comment:
**Misleading test name — "unknown" vs "alive"**

The test description says "falls back to *unknown*" but the actual code path returns `"alive"` for non-Linux platforms when `readProcessCmdline` returns `null` (see `gateway-lock.ts` line 224: `return platform === "linux" ? "unknown" : "alive"`). The correct observable outcome (lock held → `GatewayLockError`) is right, but the name misdescribes the internal status used.

```suggestion
  it("falls back to alive on win32 when cmdline reader returns null", async () => {
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +104 to +112
/**
* Read the command line of a macOS/BSD process via `ps`.
*
* `ps -o command=` outputs an unquoted flat string, so the naive whitespace
* split will misparse paths containing spaces. This is acceptable because
* standard macOS install paths do not contain spaces, and when the split
* does fail the caller falls back to "alive" (conservative).
*/
function readDarwinCmdline(pid: number): string[] | null {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inaccurate docstring for the space-in-path fallback

The comment states "when the split does fail the caller falls back to 'alive' (conservative)." This is only true when readDarwinCmdline returns null (process gone, timeout, etc.). When the process path contains spaces the function returns a non-null but misparsed array; isGatewayArgv then returns false and the caller treats the lock as "dead", not "alive". The description of the fallback should distinguish between these two outcomes to avoid misleading future readers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/gateway-lock.ts
Line: 104-112

Comment:
**Inaccurate docstring for the space-in-path fallback**

The comment states "when the split does fail the caller falls back to 'alive' (conservative)." This is only true when `readDarwinCmdline` returns `null` (process gone, timeout, etc.). When the process path contains spaces the function returns a *non-null but misparsed* array; `isGatewayArgv` then returns `false` and the caller treats the lock as `"dead"`, not `"alive"`. The description of the fallback should distinguish between these two outcomes to avoid misleading future readers.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adc9ea9a00

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if (!line) {
return null;
}
return line.split(/\s+/).filter(Boolean);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat ambiguous Darwin cmdline parsing as unknown, not dead

readDarwinCmdline tokenizes ps -o command= output with a raw whitespace split, which breaks whenever any argv element contains spaces; the function still returns a non-null argv, so resolveGatewayOwnerStatus can classify a live gateway as dead and remove its lock. In that scenario (for example, gateway launched from a path with spaces), a second gateway run can steal the lock while the first instance is still running, violating the single-instance guard and leading to startup races/bind failures.

Useful? React with 👍 / 👎.

@BradGroux BradGroux merged commit 5f6e349 into openclaw:main Apr 3, 2026
41 checks passed
@TonyDerek-dot TonyDerek-dot changed the title fix: detect PID recycling in gateway lock on Windows/macOS + startup progress fix: detect PID recycling in gateway lock on Windows/macOS + startup progress [AI-assisted] Apr 3, 2026
@TonyDerek-dot
Copy link
Copy Markdown
Contributor Author

Addressing bot review feedback (3e30121)

Greptile P2 — Misleading test name (gateway-lock.test.ts:359)
Fixed: renamed to \ alls back to alive on win32 when cmdline reader returns null\ to match the actual non-Linux behavior (line 224:
eturn platform === 'linux' ? 'unknown' : 'alive').

Greptile P2 — Inaccurate docstring for space-in-path fallback (gateway-lock.ts:112)
Fixed: updated the docstring to accurately describe both failure modes:

  1. When
    eadDarwinCmdline\ returns
    ull\ (process gone, timeout) -> caller falls back to \�live\
  2. When the split produces a misparsed non-null array from a space-in-path -> \isGatewayArgv\ returns \ alse\ -> owner classified as \dead\ (false negative, not \�live)

The docstring now explains this and notes that standard macOS install paths (/usr/local/bin, ~/.npm-global/bin, Homebrew prefixes) do not contain spaces.

Codex P2 — Ambiguous Darwin cmdline parsing (gateway-lock.ts:123)
Acknowledged. The updated docstring explicitly documents this edge case. In practice, the risk is very low because standard macOS install paths don't contain spaces. A future improvement could use \sysctl kern.procargs2\ for NUL-delimited parsing on macOS.


All three P2 items have been addressed in commit 3e30121.

@TonyDerek-dot
Copy link
Copy Markdown
Contributor Author

Local verification results (Windows 10, Node 22)

All local checks pass:

  • pnpm check (tsgo + oxlint + all lint rules): PASS (exit 0)
  • Scoped tests (gateway-lock.test.ts + gateway-process-argv.test.ts): 24/24 passed
  • CI: All 28 checks pass (including windows-node-test-1 through 6)

Verified on Windows 10 22H2, Node v22.x, pnpm 10.x.

ngutman pushed a commit that referenced this pull request Apr 3, 2026
…progress (#59843)

Fix stale lock files from crashed gateway processes blocking new invocations on Windows/macOS. Detect PID recycling to avoid false positive lock conflicts, and add startup progress indicator.

Thanks @TonyDerek-dot
steipete pushed a commit to duncanita/openclaw that referenced this pull request Apr 4, 2026
…progress (openclaw#59843)

Fix stale lock files from crashed gateway processes blocking new invocations on Windows/macOS. Detect PID recycling to avoid false positive lock conflicts, and add startup progress indicator.

Thanks @TonyDerek-dot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants