fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop by efe-arv · Pull Request #33410 · openclaw/openclaw

efe-arv · 2026-03-03T19:17:32Z

fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop

Fixes #33103 · Related: #28134

Background

When OpenClaw's gateway is restarted via triggerOpenClawRestart() — the path exercised by the gateway tool, /restart, config-change webhooks, and SIGUSR1 on Linux — the following sequence occurs:

cleanStaleGatewayProcessesSync() locates any existing gateway process listening on the configured port, sends SIGTERM, waits, escalates to SIGKILL if necessary, then returns.
triggerOpenClawRestart() immediately invokes systemctl --user restart openclaw-gateway.
systemd spawns a new gateway process, which calls bind(2) on the configured port.

The bug lives in the transition between steps 1 and 2.

Root Cause

TCP socket lifecycle after SIGKILL

When a process is killed with SIGKILL, the kernel schedules socket cleanup but does not guarantee it completes before process.kill() returns to the caller. Sockets in states such as TIME_WAIT, FIN_WAIT_2, or mid-drain may remain bound to the port for a non-deterministic window after the owning process has been removed from the process table.

The prior implementation of cleanStaleGatewayProcessesSync() used two fixed sleeps:

STALE_SIGTERM_WAIT_MS = 300 ms — time allowed for graceful exit after SIGTERM
STALE_SIGKILL_WAIT_MS = 200 ms — time allowed after SIGKILL before returning

Total maximum wait: 500 ms.

On a loaded system, or when the gateway holds active streaming connections, 500 ms is insufficient for the kernel to complete socket teardown. cleanStaleGatewayProcessesSync() returns with the port still bound. systemctl restart fires immediately, the new process calls bind(2) on port 18789, receives EADDRINUSE, and exits with status 1. systemd schedules another restart — which fails identically — producing an unbounded restart loop. The only recovery was a manual kill -9 of the zombie PID followed by a clean systemctl start.

Why the `--force` path did not prevent this

The gateway CLI exposes a --force flag that calls forceFreePortAndWait() from ports.ts, which correctly polls the port via lsof until it is confirmed free. However, this function is only exercised on interactive openclaw gateway start --force invocations. The triggerOpenClawRestart() → systemctl restart path does not call forceFreePortAndWait(). The two code paths were not unified, leaving the programmatic restart path without port-free verification.

Fix

`src/infra/restart-stale-pids.ts`

Three additions are made:

1. pollPortOnce(port) — a three-state lsof poll

A new internal function returns a discriminated union: { free: true } (port confirmed free), { free: false } (port still bound), or { free: null } (lsof timed out or errored — result inconclusive). This three-state result is critical: the prior approach called findGatewayPidsOnPortSync() directly, which collapses both "no listeners" and "lsof error" into an empty array, making it impossible to distinguish a confirmed-free port from a failed probe.

2. waitForPortFreeSync(port) — polling loop with correct exit semantics

Polls pollPortOnce() at 50 ms intervals for up to 2 seconds. The loop exits on free: true (success) or free: null (lsof unavailable, graceful degradation). On free: false, it sleeps and retries. If the deadline is reached without a conclusive free result, it logs a warning and proceeds.

3. POLL_SPAWN_TIMEOUT_MS = 400 ms — distinct spawn timeout for polling

This addresses the timeout-budget collision identified in review: if POLL_SPAWN_TIMEOUT_MS equals PORT_FREE_TIMEOUT_MS (both 2000 ms), a single hung lsof call consumes the entire polling budget, causing waitForPortFreeSync() to exit on the free: null path while the port remains bound — silently disabling the fix under precisely the heavy-load conditions it targets. Setting POLL_SPAWN_TIMEOUT_MS = 400 ms ensures up to five independent lsof invocations fit within the 2-second polling budget, each with a definitive success or error outcome.

Design rationale — synchronous polling: cleanStaleGatewayProcessesSync() is called from triggerOpenClawRestart() immediately before a blocking spawnSync("systemctl", [...]). Converting the call chain to async is out of scope; the synchronous Atomics.wait sleep primitive already present in this file makes a synchronous poll straightforward and correct.

Design rationale — lsof over kill(pid, 0): After SIGKILL, the target process is removed from the process table almost immediately, but the bound socket can persist. Polling process liveness via kill(pid, 0) would return "dead" while the port remains bound. lsof queries the actual kernel socket state, which is the correct predicate.

Changes

File	Change
`src/infra/restart-stale-pids.ts`	Added `pollPortOnce()`, `waitForPortFreeSync()`, `POLL_SPAWN_TIMEOUT_MS`; increased SIGTERM/SIGKILL wait budgets; optional `spawnTimeoutMs` param on `findGatewayPidsOnPortSync`; wired `waitForPortFreeSync()` into `cleanStaleGatewayProcessesSync()`
`src/infra/restart-stale-pids.test.ts`	New test file — 12 unit tests covering `findGatewayPidsOnPortSync`, `cleanStaleGatewayProcessesSync`, the #33103 regression, and the lsof-timeout edge case identified in review

Test Suite

All tests execute with a synchronous sleep shim (__testing.setSleepSyncOverride) to eliminate real-time waits, and with mocked spawnSync and resolveGatewayPort to remove dependencies on root privileges, lsof availability, or a live gateway process. 12/12 tests pass on vitest v4.0.18.

`findGatewayPidsOnPortSync`

Test	Rationale
Returns `[]` when lsof exits non-zero	lsof returns status 1 when no listeners exist; must not throw
Returns `[]` when lsof returns an error object	Covers `ENOENT` (lsof not installed) and permission-denied scenarios
Parses openclaw-gateway pids and excludes the current process	Validates the `includes("openclaw")` command filter and the `pid !== process.pid` self-exclusion guard
Excludes pids whose command does not include 'openclaw'	Ensures unrelated processes sharing the port are not killed
Forwards `spawnTimeoutMs` to `spawnSync`	Verifies the polling path uses `POLL_SPAWN_TIMEOUT_MS` (400 ms) rather than the default 2000 ms

`cleanStaleGatewayProcessesSync`

Test	Rationale
Returns `[]` and does not call `kill` when port has no listeners	No-op path must issue no signals
Sends `SIGTERM` to stale pids and returns them	Validates the happy-path kill sequence and return value
Escalates to `SIGKILL` when process survives the SIGTERM window	Validates that SIGTERM budget expiry drives `SIGKILL` escalation
Polls until port is confirmed free before returning — regression for #33103	Verifies `cleanStaleGatewayProcessesSync` continues polling after kill until the port is confirmed free, not a fixed sleep
Does not treat a timed-out lsof poll as port-free — review edge case	Verifies that when every poll call returns an error (simulating lsof timeout), the loop bails via the inconclusive path after one attempt rather than spinning the full budget and falsely reporting the port as free
Returns gracefully when `resolveGatewayPort` throws	Config read failure must not propagate to the caller
Returns gracefully when lsof is unavailable from the start	`ENOENT` on lsof must degrade silently

Manual Verification

Confirmed on Ubuntu 24.04.2 LTS, systemd 255, OpenClaw 2026.3.2, port 18789:

Before patch: Triggering a gateway restart via the gateway tool with active streaming connections reproducibly caused the restart loop (systemd restart counter reaching 10+ within 30 seconds, port never released).
After patch: Clean single-attempt restart observed across 20 consecutive trials under load (concurrent Matrix polling, Telegram streaming, and Telegram webhook delivery active).

greptile-apps · 2026-03-03T19:23:44Z

Greptile Summary

This PR fixes a post-SIGKILL EADDRINUSE restart loop by introducing waitForPortFreeSync(), which polls lsof until the gateway port is confirmed free before triggerOpenClawRestart hands off to systemd. SIGTERM/SIGKILL fixed-wait budgets are also doubled (600 ms / 400 ms), and a thorough test suite with a synchronous-sleep shim is added.

Key Finding:

PORT_FREE_TIMEOUT_MS == SPAWN_TIMEOUT_MS (both 2000 ms): If lsof is slow and hits the spawnSync timeout during polling, findGatewayPidsOnPortSync returns [] (error path). waitForPortFreeSync then exits immediately on line 122-123, treating the inconclusive timeout as "port is free". This silently disables the fix under precisely the heavy-load conditions it targets. Using a shorter per-call spawn timeout for poll invocations would allow multiple retries within the polling budget.

Confidence Score: 3/5

The fix prevents the EADDRINUSE restart loop but has an edge case where slow/timed-out lsof calls can disable polling on loaded systems.
The core approach is sound—polling lsof after SIGKILL to verify port release is the correct fix. Tests are comprehensive and well-isolated. However, PORT_FREE_TIMEOUT_MS == SPAWN_TIMEOUT_MS (both 2000 ms) creates a logic gap: a single timed-out lsof call can consume the entire polling budget and cause the function to exit prematurely, returning false "all clear" without confirming the port is actually free. This risk is most acute under the heavy-load scenarios the PR targets, where lsof slowness is most likely. The fix mitigates the restart loop but remains vulnerable in its intended use case.
src/infra/restart-stale-pids.ts — Review timeout constants and polling budget allocation.

_{Last reviewed commit: 4343ce0}

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T19:23:48Z

src/infra/restart-stale-pids.ts

 const SPAWN_TIMEOUT_MS = 2000;
-const STALE_SIGTERM_WAIT_MS = 300;
-const STALE_SIGKILL_WAIT_MS = 200;
+const STALE_SIGTERM_WAIT_MS = 600;
+const STALE_SIGKILL_WAIT_MS = 400;
+/**
+ * After SIGKILL, the kernel may not release the TCP port immediately.
+ * Poll until the port is confirmed free (or until the budget expires) before
+ * returning control to the caller (typically `triggerOpenClawRestart` →
+ * `systemctl restart`). Without this wait the new process races the dying
+ * process for the port and systemd enters an EADDRINUSE restart loop.
+ */
+const PORT_FREE_POLL_INTERVAL_MS = 50;
+const PORT_FREE_TIMEOUT_MS = 2000;


SPAWN_TIMEOUT_MS (2000 ms) equals PORT_FREE_TIMEOUT_MS (2000 ms). If a single lsof call times out during polling on a slow/loaded system, spawnSync returns an error, findGatewayPidsOnPortSync returns [], and waitForPortFreeSync immediately returns on line 122-123—interpreting a timed-out query as "port is free".

This defeats the fix's purpose on the exact systems it's designed for. On a 2-second-loaded system, the first lsof call could consume ~2000 ms, timeout, and exhaust the entire polling budget in a single iteration.

Consider raising PORT_FREE_TIMEOUT_MS significantly above SPAWN_TIMEOUT_MS, or use a shorter per-call timeout for poll invocations:

const PORT_FREE_SPAWN_TIMEOUT_MS = 500; // per-lsof-call timeout during polling const PORT_FREE_POLL_INTERVAL_MS = 50; const PORT_FREE_TIMEOUT_MS = 2000; // overall polling budget

Alternatively, findGatewayPidsOnPortSync could distinguish "[] due to timeout" from "[] due to port free" so waitForPortFreeSync can skip the early-return path on error and continue polling.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/infra/restart-stale-pids.ts Line: 6-17 Comment: `SPAWN_TIMEOUT_MS` (2000 ms) equals `PORT_FREE_TIMEOUT_MS` (2000 ms). If a single `lsof` call times out during polling on a slow/loaded system, `spawnSync` returns an error, `findGatewayPidsOnPortSync` returns `[]`, and `waitForPortFreeSync` immediately returns on line 122-123—interpreting a timed-out query as "port is free". This defeats the fix's purpose on the exact systems it's designed for. On a 2-second-loaded system, the first `lsof` call could consume ~2000 ms, timeout, and exhaust the entire polling budget in a single iteration. Consider raising `PORT_FREE_TIMEOUT_MS` significantly above `SPAWN_TIMEOUT_MS`, or use a shorter per-call timeout for poll invocations: ```typescript const PORT_FREE_SPAWN_TIMEOUT_MS = 500; // per-lsof-call timeout during polling const PORT_FREE_POLL_INTERVAL_MS = 50; const PORT_FREE_TIMEOUT_MS = 2000; // overall polling budget ``` Alternatively, `findGatewayPidsOnPortSync` could distinguish "[] due to timeout" from "[] due to port free" so `waitForPortFreeSync` can skip the early-return path on error and continue polling. How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps Thanks for the sharp catch — this was a real bug in v1 and the PR has been updated to address it directly.

What was wrong in v1:
SPAWN_TIMEOUT_MS == PORT_FREE_TIMEOUT_MS == 2000 ms. A single hung lsof call consumed the entire polling budget, findGatewayPidsOnPortSync collapsed the timeout into [], and waitForPortFreeSync exited on the empty-array path — incorrectly treating an inconclusive result as "port is free".

How v2 fixes it:

1. POLL_SPAWN_TIMEOUT_MS = 400 ms — a dedicated per-call spawn timeout used exclusively during polling, intentionally much shorter than SPAWN_TIMEOUT_MS = 2000 ms. At 400 ms per call, up to five independent lsof invocations fit within the 2-second polling budget, each producing a definitive result.

2. pollPortOnce() — three-state discriminated union:

function pollPortOnce(port: number): { free: true } | { free: false } | { free: null }

{ free: true } — lsof succeeded, no listeners found (port confirmed free)

{ free: false } — lsof succeeded, listeners present (port still bound)

{ free: null } — lsof timed out or errored (result inconclusive)

This directly addresses your alternative suggestion: findGatewayPidsOnPortSync no longer needs to distinguish timeout from empty — pollPortOnce does it at the call site by inspecting res.error before parsing output.

3. waitForPortFreeSync exit semantics:

free: true → return normally (port confirmed free)

free: null → bail immediately (lsof unavailable, graceful degradation — same worst-case as before the fix)

free: false → sleep 50 ms and retry

A timed-out lsof call now produces { free: null } and never reaches the free: true return path.

4. Regression test explicitly covering this case:
The test suite includes "does not treat a timed-out lsof poll as port-free" which simulates every poll returning a spawnSync error and verifies the loop bails via the inconclusive path rather than falsely reporting the port as free. 12/12 tests pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24a7065720

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T19:30:43Z

src/infra/restart-stale-pids.ts

+    const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS);
+    return pids.length === 0 ? { free: true } : { free: false };


Don't treat failed second lsof probe as a free port

pollPortOnce already has a successful lsof result (status === 0 means something is listening), but it then calls findGatewayPidsOnPortSync and converts that second probe to a boolean free/busy check. Because findGatewayPidsOnPortSync returns [] for any spawn error/timeout, a transient failure on the second call is misclassified as free: true, so waitForPortFreeSync can exit early and triggerOpenClawRestart() can still race the old socket and hit EADDRINUSE under load. This should preserve an inconclusive state instead of mapping second-probe errors to “port free”.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector Good catch — this was a real bug and it is now fixed in v3 (commit 815b408).

What was wrong:
pollPortOnce called findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS) as a second lsof probe after the first spawnSync returned status === 0. Since findGatewayPidsOnPortSync collapses any spawn error or timeout into [], a transient failure on that second call produced { free: true } — incorrectly reporting the port as free without a verified probe result. This was the same class of bug Greptile flagged (inconclusive result misclassified as confirmed-free), surfaced through a different code path.

The fix:

The pid parsing logic has been extracted into a pure, zero-I/O helper parsePidsFromLsofOutput(stdout: string). pollPortOnce now calls parsePidsFromLsofOutput(res.stdout) directly on the output it already holds from the single spawnSync invocation — no second lsof spawn, no second failure surface.

// Before (v2) — second lsof spawn, error collapses to free:true const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS); return pids.length === 0 ? { free: true } : { free: false }; // After (v3) — parse inline from existing res.stdout, zero second spawn const pids = parsePidsFromLsofOutput(res.stdout); return pids.length === 0 ? { free: true } : { free: false };

parsePidsFromLsofOutput is a pure function with no I/O — it cannot time out, error, or return a false empty result. The three-state semantics of pollPortOnce are now structurally enforced: the only path to { free: true } from a status === 0 lsof result is through parsing the stdout we already verified succeeded.

New regression test added:
"does not make a second lsof call when the first returns status 0" — asserts that across an initial find + one busy poll + one free poll, exactly 3 spawnSync calls are made total. If pollPortOnce were still making a second probe, the count would be 5. 13/13 tests pass.

To use Codex here, create a Codex account and connect to github.

Fixed in commit f94d2b90a (v4) — both issues addressed:

P1 (bail on first null) — fixed:
waitForPortFreeSync now distinguishes permanent from transient inconclusive results. pollPortOnce returns a 4-state discriminated union: { free: true }, { free: false }, { free: null, permanent: false } (transient — keep retrying), or { free: null, permanent: true } (ENOENT/EACCES/EPERM — lsof permanently unavailable, bail). Only permanent: true exits the loop early. A single slow or errored lsof call no longer aborts the polling budget.

P2 (status !== 0 → free: true) — fixed:
pollPortOnce now maps lsof exit codes precisely:

status === 0 → lsof found listeners, parse pids

status === 1 → lsof canonical "no matching processes" exit → { free: true }

status > 1 → runtime/permission/flag error → { free: null, permanent: false } (inconclusive, keep polling)

Mapping status > 1 to free: true was the exact bug: a permission denied or flag error would have let restart race a still-bound socket.

Tests added (26 total, 92.6% stmt / 90.7% branch / 100% fn coverage):

status === 1 → port-free

status > 1 → inconclusive, not free (P2 regression)

transient error (ETIMEDOUT) → loop continues (P1 regression)

ENOENT/EACCES/EPERM → permanent bail

dateNowOverride in __testing replaces the 2s real-clock budget test with a fake-time test (435ms total suite)

Atomics.wait throw → busy-wait fallback

win32 platform guard

pollPortOnce outer catch (resolveLsofCommandSync throws)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 815b40827f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T20:18:02Z

src/infra/restart-stale-pids.ts

+    if (result.free === null) {
+      // lsof is unavailable or errored — cannot determine port state, bail out
+      return;


Keep polling after inconclusive lsof probe results

This branch exits the wait loop on the first free: null, but pollPortOnce() returns free: null for any spawnSync error (including a timeout). That means one transient slow lsof call under load immediately skips the remaining 2s polling budget and allows restart to proceed while the old listener may still hold the port, recreating the EADDRINUSE race this patch is meant to prevent.

Useful? React with 👍 / 👎.

Fixed in f94d2b90a (v4).

pollPortOnce now returns a 4-state discriminated union:

type PollResult = | { free: true } | { free: false } | { free: null; permanent: false } // transient — keep retrying | { free: null; permanent: true } // ENOENT/EACCES/EPERM — bail

waitForPortFreeSync only bails on permanent: true. A transient spawnSync error (timeout, signal, anything without ENOENT/EACCES/EPERM) returns { free: null, permanent: false } and the loop sleeps and retries until the wall-clock budget expires. A single slow lsof call no longer aborts the polling window.

Regression test added: "continues polling on transient lsof errors (not ENOENT) — Codex P1 fix" — asserts the loop reaches a subsequent free poll after a transient error.

chatgpt-codex-connector · 2026-03-03T20:18:02Z

src/infra/restart-stale-pids.ts

+    if (res.status !== 0) {
+      // lsof exited non-zero with no error object: no listeners found (status 1 is normal)
+      return { free: true };


Treat non-zero lsof exits as inconclusive, not free

This maps every non-zero lsof exit status to free: true, but spawnSync only populates error for spawn failures; command-level failures still come through as non-zero status values. If lsof exits non-zero due runtime/permission/flag issues, this path incorrectly reports the port as free and can let restart race a still-bound socket.

Useful? React with 👍 / 👎.

Fixed in f94d2b90a (v4).

pollPortOnce now maps lsof exit codes precisely rather than treating all non-zero exits as port-free:

if (res.status === 1) { // lsof canonical "no matching processes" exit — port is genuinely free return { free: true }; } if (res.status !== 0) { // status > 1: runtime/permission/flag error — inconclusive, keep polling return { free: null, permanent: false }; }

status === 1 is lsof's documented "no matching processes found" exit — the only non-zero exit that reliably means port-free. Any other non-zero exit (permission denied, bad flag, runtime error) returns { free: null, permanent: false } and the loop retries. Mapping those to free: true was the exact race this patch is supposed to prevent.

Regression test added: "treats lsof exit status >1 as inconclusive, not port-free — Codex P2 regression".

efe-arv · 2026-03-03T22:32:41Z

CI update — commit b0de0acc1 (v5)

Windows test failures (our bug, now fixed): The process.platform win32 mock in the edge-case coverage test was leaking across the test suite. Object.getOwnPropertyDescriptor(process, "platform") returns undefined on Windows CI because platform is a prototype getter — so the if (origDescriptor) restore was skipped and process.platform stayed shadowed as an own property with value "win32" for all subsequent tests. Every downstream call to findGatewayPidsOnPortSync hit the early return and never reached spawnSync, causing 13/26 tests to fail. Fixed by delete process["platform"] in the else branch to drop the shadow and restore prototype lookup.

Bun unit test failure (src/plugins/commands.test.ts) — pre-existing, unrelated to this PR: The failing assertion (acceptsArgs: false field unexpected) is present on main as well (run 22643809542 on a8dd9ffe → CI: failure). This PR does not touch src/plugins/ at all.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0de0acc1c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T22:36:06Z

src/infra/restart-stale-pids.ts

+    const pids = parsePidsFromLsofOutput(res.stdout);
+    return pids.length === 0 ? { free: true } : { free: false };


Treat any lsof listener as busy during port-free polling

pollPortOnce maps status === 0 to { free: true } when parsePidsFromLsofOutput yields no OpenClaw PIDs, but that parser intentionally filters out non-openclaw commands. If another process is listening on the port during restart (or the listener command name does not contain openclaw), the poll loop exits as “port free” immediately and cleanStaleGatewayProcessesSync can still hand off to restart while the socket is occupied, causing the same EADDRINUSE failure path this change is trying to prevent.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector This is an intentional design choice — pollPortOnce is not checking whether the port is free in absolute terms; it is checking whether the openclaw gateway process we just killed has released it.

The only callers of cleanStaleGatewayProcessesSync are paths that already identified and killed openclaw PIDs on this port. If lsof exits 0 and no openclaw commands appear, one of two things is true:

The stale process has fully exited — its socket was removed by the kernel when the process died. No openclaw PIDs means our target is gone → { free: true } is the correct signal.

An unrelated process holds port 18789 — we cannot kill it, waiting is pointless, and we should proceed so systemd can report the conflict explicitly via a clear EADDRINUSE log.

In both cases { free: true } is the right action. Returning { free: false } for case 2 would only burn the polling budget and then proceed anyway (the budget-exhausted warning path). This is documented on findGatewayPidsOnPortSync: "Returns only PIDs that belong to openclaw gateway processes."

Unlike isPortBusy in ports.ts (which answers "can a new process bind here?"), waitForPortFreeSync answers "is the process we killed still holding the socket?" Those are different questions with different correct semantics.

To use Codex here, create a Codex account and connect to github.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e1738207f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T23:09:06Z

src/infra/restart-stale-pids.ts

+    if (res.status === 1) {
+      // lsof canonical "no matching processes" exit — port is genuinely free.
+      return { free: true };


Treat status-1 lsof results with stderr as inconclusive

This branch always maps res.status === 1 to free: true, but elsewhere in this repo we already handle lsof exit 1 with non-empty stderr (for example permission-denied output) as an error rather than a clean “no listeners” result (src/cli/ports.ts and src/infra/ports-inspect.ts). In restricted environments, this can cause waitForPortFreeSync to stop polling immediately and proceed to restart while port state is still unknown, reintroducing the EADDRINUSE race the patch is trying to prevent.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector The ports.ts path you reference throws EACCES to surface an explicit error to the interactive CLI user. pollPortOnce is a background polling loop — throwing would be wrong; we need to keep retrying or bail permanently.

The cases where lsof exits 1 with stderr are:

lsof: WARNING: can't stat() fuse... — filesystem-type detection warning, unrelated to TCP socket reporting

AppArmor/seccomp process-level permission warnings — affect /proc/PID/fd traversal, not /proc/net/tcp[6]

TCP LISTEN socket visibility (what -sTCP:LISTEN reads) comes from /proc/net/tcp which is world-readable in virtually all Linux container environments. lsof permission errors in stderr do not indicate missed TCP listener entries.

The actually dangerous container edge case — lsof exits 1 AND still has openclaw pids in stdout (partial AppArmor output) — is already handled: we parse stdout when non-empty and return { free: false } if openclaw pids appear. That test is in the suite (lsof status 1 with non-empty openclaw stdout is treated as busy, not free (Linux container edge case)).

Adding a stderr regex check to match ports.ts would add latency (polling the full 2s budget on environments where lsof always emits "can't stat() fuse" warnings for NFS/FUSE mounts — common in containers) with no real benefit since TCP socket data is unaffected by those warnings.

To use Codex here, create a Codex account and connect to github.

…art loop When cleanStaleGatewayProcessesSync() kills a stale gateway process, the kernel may not immediately release the TCP port. Previously the function returned after a fixed 500ms sleep (300ms SIGTERM + 200ms SIGKILL), allowing triggerOpenClawRestart() to hand off to systemd before the port was actually free. The new systemd process then raced the dying socket for port 18789, hit EADDRINUSE, and exited with status 1, causing systemd to retry indefinitely — the zombie restart loop reported in openclaw#33103. Fix: add waitForPortFreeSync() that polls lsof at 50ms intervals for up to 2 seconds after SIGKILL. cleanStaleGatewayProcessesSync() now blocks until the port is confirmed free (or the budget expires with a warning) before returning. The increased SIGTERM/SIGKILL wait budgets (600ms / 400ms) also give slow processes more time to exit cleanly. Fixes openclaw#33103 Related: openclaw#28134

efe-arv · 2026-03-04T00:11:31Z

v7 (9c2ebe778) — 3 additional fixes

1. src/plugins/commands.test.ts — updated snapshot to include acceptsArgs: false
This test was failing on both bun and Windows CI (pre-existing on main): getPluginCommandSpecs() was updated to include acceptsArgs in its return type, but the test expectation was not updated. Fixed.

2. parsePidsFromLsofOutput — deduplicate PIDs from dual-stack listeners
Dual-stack listeners (IPv4 + IPv6) cause lsof to emit the same PID twice in -Fpc output. Without dedup, terminateStaleProcessesSync sends SIGTERM twice to the same process and the returned killed array contains duplicates. Fixed with [...new Set(pids)] before the process.pid exclusion filter.

3. pollPortOnce — handle lsof status 1 with non-empty stdout
On Linux containers with restricted /proc (AppArmor, seccomp, user namespaces), lsof can exit 1 AND still emit output for processes it could read. Previously status === 1 unconditionally returned { free: true }. Now parses stdout when non-empty and returns { free: false } if openclaw pids are present, preventing a false-free in container environments.

28 tests passing (2 new: dedup regression, Linux container status-1 edge case). 434ms suite.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c2ebe7785

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T00:15:36Z

src/infra/restart-stale-pids.ts

+    // status === 0: lsof found listeners. Parse pids from the stdout we
+    // already hold — no second lsof spawn, no new failure surface.
+    const pids = parsePidsFromLsofOutput(res.stdout);
+    return pids.length === 0 ? { free: true } : { free: false };


Treat all lsof listeners as busy while polling

This status === 0 branch decides free/busy from parsePidsFromLsofOutput, but that parser intentionally filters to commands containing openclaw. If the port is held by another process (or lsof can’t expose the command name), pids.length becomes 0 and this returns { free: true }, so waitForPortFreeSync can stop early while the socket is still occupied and the subsequent restart can still hit EADDRINUSE.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector Same answer as the P2 above — by design. pollPortOnce answers "is the openclaw process we just killed still holding the socket?" not "is the port clear for a new bind?" Those have different correct semantics; see the reply on the sibling comment for the full rationale.

To use Codex here, create a Codex account and connect to github.

chatgpt-codex-connector · 2026-03-04T00:15:36Z

src/infra/restart-stale-pids.ts

+      if (res.stdout) {
+        const pids = parsePidsFromLsofOutput(res.stdout);
+        return pids.length === 0 ? { free: true } : { free: false };
+      }
+      return { free: true };


Treat lsof status 1 with stderr as inconclusive

When res.status === 1, this path returns { free: true } whenever stdout is empty, but it never checks whether lsof also emitted permission/runtime errors on stderr. In restricted environments that can produce status 1 with error text, this maps an inconclusive probe to “port free,” causing waitForPortFreeSync to exit immediately and allowing the restart path to race a still-bound port.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector Same rationale as the P2 above on 2880954508 — see that reply for full explanation. In short: TCP socket visibility is not affected by the file-level permission warnings that appear in lsof stderr; the openclaw-stdout container edge case (status 1 + openclaw pids in stdout) is already handled.

To use Codex here, create a Codex account and connect to github.

Audit findings fixed: - remove dead guard: terminateStaleProcessesSync pids.length===0 check was unreachable (only caller cleanStaleGatewayProcessesSync already guards) - expose __testing.callSleepSyncRaw so sleepSync's real Atomics.wait path can be unit-tested directly without going through the override - fix broken sleepSync Atomics.wait test: previous test set override=null but cleanStaleGatewayProcessesSync returned before calling sleepSync — replaced with direct callSleepSyncRaw calls that actually exercise L36/L42-47 - fix pid collision: two tests used process.pid+304 (EPERM + dead-at-SIGTERM); EPERM test changed to process.pid+305 - fix misindented tests: 'deduplicates pids' and 'lsof status 1 container edge case' were outside their intended describe blocks; moved to correct scopes (findGatewayPidsOnPortSync and pollPortOnce respectively) - add missing branch tests: - status 1 + non-empty stdout with zero openclaw pids → free:true (L145) - mid-loop non-openclaw cmd in &&-chain (L67) - consecutive p-lines without c-line between them (L67) - invalid PID in p-line (p0 / pNaN) — ternary false branch (L67) - unknown lsof output line (else-if false branch L69) Coverage: 100% stmts / 100% branch / 100% funcs / 100% lines (36 tests)

Takhoffman · 2026-03-04T03:34:04Z

Thanks again for this work.

This was landed indirectly via the synthesized PR #33831, and your contribution is credited in the changelog and as a co-author on the merge commit.

This PR merge/triage workflow is AI-assisted.

Closing this PR as superseded by #33831. If anything here looks incorrect or incomplete, reply to reopen and we can reassess.

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 4343ce0 to 24a7065 Compare March 3, 2026 19:26

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 24a7065 to 815b408 Compare March 3, 2026 20:13

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

openclaw-barnacle bot added the size: M label Mar 3, 2026

efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 815b408 to f94d2b9 Compare March 3, 2026 21:14

openclaw-barnacle bot added size: L and removed size: M labels Mar 3, 2026

efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from f94d2b9 to b0de0ac Compare March 3, 2026 22:32

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from b0de0ac to 9e17382 Compare March 3, 2026 23:04

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 9e17382 to 9c2ebe7 Compare March 4, 2026 00:11

chatgpt-codex-connector bot reviewed Mar 4, 2026

View reviewed changes

Takhoffman mentioned this pull request Mar 4, 2026

fix(gateway): synthesize lifecycle robustness for restart and startup probes #33831

Merged

Takhoffman closed this Mar 4, 2026

		const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS);
		return pids.length === 0 ? { free: true } : { free: false };

		const pids = parsePidsFromLsofOutput(res.stdout);
		return pids.length === 0 ? { free: true } : { free: false };

Uh oh!

Conversation

efe-arv commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!