Skip to content

fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop#33410

Closed
efe-arv wants to merge 2 commits intoopenclaw:mainfrom
efe-arv:fix/gateway-zombie-restart-loop
Closed

fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop#33410
efe-arv wants to merge 2 commits intoopenclaw:mainfrom
efe-arv:fix/gateway-zombie-restart-loop

Conversation

@efe-arv
Copy link
Copy Markdown
Contributor

@efe-arv efe-arv commented Mar 3, 2026

fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop

Fixes #33103 · Related: #28134


Background

When OpenClaw's gateway is restarted via triggerOpenClawRestart() — the path exercised by the gateway tool, /restart, config-change webhooks, and SIGUSR1 on Linux — the following sequence occurs:

  1. cleanStaleGatewayProcessesSync() locates any existing gateway process listening on the configured port, sends SIGTERM, waits, escalates to SIGKILL if necessary, then returns.
  2. triggerOpenClawRestart() immediately invokes systemctl --user restart openclaw-gateway.
  3. systemd spawns a new gateway process, which calls bind(2) on the configured port.

The bug lives in the transition between steps 1 and 2.


Root Cause

TCP socket lifecycle after SIGKILL

When a process is killed with SIGKILL, the kernel schedules socket cleanup but does not guarantee it completes before process.kill() returns to the caller. Sockets in states such as TIME_WAIT, FIN_WAIT_2, or mid-drain may remain bound to the port for a non-deterministic window after the owning process has been removed from the process table.

The prior implementation of cleanStaleGatewayProcessesSync() used two fixed sleeps:

  • STALE_SIGTERM_WAIT_MS = 300 ms — time allowed for graceful exit after SIGTERM
  • STALE_SIGKILL_WAIT_MS = 200 ms — time allowed after SIGKILL before returning

Total maximum wait: 500 ms.

On a loaded system, or when the gateway holds active streaming connections, 500 ms is insufficient for the kernel to complete socket teardown. cleanStaleGatewayProcessesSync() returns with the port still bound. systemctl restart fires immediately, the new process calls bind(2) on port 18789, receives EADDRINUSE, and exits with status 1. systemd schedules another restart — which fails identically — producing an unbounded restart loop. The only recovery was a manual kill -9 of the zombie PID followed by a clean systemctl start.

Why the --force path did not prevent this

The gateway CLI exposes a --force flag that calls forceFreePortAndWait() from ports.ts, which correctly polls the port via lsof until it is confirmed free. However, this function is only exercised on interactive openclaw gateway start --force invocations. The triggerOpenClawRestart()systemctl restart path does not call forceFreePortAndWait(). The two code paths were not unified, leaving the programmatic restart path without port-free verification.


Fix

src/infra/restart-stale-pids.ts

Three additions are made:

1. pollPortOnce(port) — a three-state lsof poll

A new internal function returns a discriminated union: { free: true } (port confirmed free), { free: false } (port still bound), or { free: null } (lsof timed out or errored — result inconclusive). This three-state result is critical: the prior approach called findGatewayPidsOnPortSync() directly, which collapses both "no listeners" and "lsof error" into an empty array, making it impossible to distinguish a confirmed-free port from a failed probe.

2. waitForPortFreeSync(port) — polling loop with correct exit semantics

Polls pollPortOnce() at 50 ms intervals for up to 2 seconds. The loop exits on free: true (success) or free: null (lsof unavailable, graceful degradation). On free: false, it sleeps and retries. If the deadline is reached without a conclusive free result, it logs a warning and proceeds.

3. POLL_SPAWN_TIMEOUT_MS = 400 ms — distinct spawn timeout for polling

This addresses the timeout-budget collision identified in review: if POLL_SPAWN_TIMEOUT_MS equals PORT_FREE_TIMEOUT_MS (both 2000 ms), a single hung lsof call consumes the entire polling budget, causing waitForPortFreeSync() to exit on the free: null path while the port remains bound — silently disabling the fix under precisely the heavy-load conditions it targets. Setting POLL_SPAWN_TIMEOUT_MS = 400 ms ensures up to five independent lsof invocations fit within the 2-second polling budget, each with a definitive success or error outcome.

Design rationale — synchronous polling: cleanStaleGatewayProcessesSync() is called from triggerOpenClawRestart() immediately before a blocking spawnSync("systemctl", [...]). Converting the call chain to async is out of scope; the synchronous Atomics.wait sleep primitive already present in this file makes a synchronous poll straightforward and correct.

Design rationale — lsof over kill(pid, 0): After SIGKILL, the target process is removed from the process table almost immediately, but the bound socket can persist. Polling process liveness via kill(pid, 0) would return "dead" while the port remains bound. lsof queries the actual kernel socket state, which is the correct predicate.


Changes

File Change
src/infra/restart-stale-pids.ts Added pollPortOnce(), waitForPortFreeSync(), POLL_SPAWN_TIMEOUT_MS; increased SIGTERM/SIGKILL wait budgets; optional spawnTimeoutMs param on findGatewayPidsOnPortSync; wired waitForPortFreeSync() into cleanStaleGatewayProcessesSync()
src/infra/restart-stale-pids.test.ts New test file — 12 unit tests covering findGatewayPidsOnPortSync, cleanStaleGatewayProcessesSync, the #33103 regression, and the lsof-timeout edge case identified in review

Test Suite

All tests execute with a synchronous sleep shim (__testing.setSleepSyncOverride) to eliminate real-time waits, and with mocked spawnSync and resolveGatewayPort to remove dependencies on root privileges, lsof availability, or a live gateway process. 12/12 tests pass on vitest v4.0.18.

findGatewayPidsOnPortSync

Test Rationale
Returns [] when lsof exits non-zero lsof returns status 1 when no listeners exist; must not throw
Returns [] when lsof returns an error object Covers ENOENT (lsof not installed) and permission-denied scenarios
Parses openclaw-gateway pids and excludes the current process Validates the includes("openclaw") command filter and the pid !== process.pid self-exclusion guard
Excludes pids whose command does not include 'openclaw' Ensures unrelated processes sharing the port are not killed
Forwards spawnTimeoutMs to spawnSync Verifies the polling path uses POLL_SPAWN_TIMEOUT_MS (400 ms) rather than the default 2000 ms

cleanStaleGatewayProcessesSync

Test Rationale
Returns [] and does not call kill when port has no listeners No-op path must issue no signals
Sends SIGTERM to stale pids and returns them Validates the happy-path kill sequence and return value
Escalates to SIGKILL when process survives the SIGTERM window Validates that SIGTERM budget expiry drives SIGKILL escalation
Polls until port is confirmed free before returning — regression for #33103 Verifies cleanStaleGatewayProcessesSync continues polling after kill until the port is confirmed free, not a fixed sleep
Does not treat a timed-out lsof poll as port-free — review edge case Verifies that when every poll call returns an error (simulating lsof timeout), the loop bails via the inconclusive path after one attempt rather than spinning the full budget and falsely reporting the port as free
Returns gracefully when resolveGatewayPort throws Config read failure must not propagate to the caller
Returns gracefully when lsof is unavailable from the start ENOENT on lsof must degrade silently

Manual Verification

Confirmed on Ubuntu 24.04.2 LTS, systemd 255, OpenClaw 2026.3.2, port 18789:

  • Before patch: Triggering a gateway restart via the gateway tool with active streaming connections reproducibly caused the restart loop (systemd restart counter reaching 10+ within 30 seconds, port never released).
  • After patch: Clean single-attempt restart observed across 20 consecutive trials under load (concurrent Matrix polling, Telegram streaming, and Telegram webhook delivery active).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR fixes a post-SIGKILL EADDRINUSE restart loop by introducing waitForPortFreeSync(), which polls lsof until the gateway port is confirmed free before triggerOpenClawRestart hands off to systemd. SIGTERM/SIGKILL fixed-wait budgets are also doubled (600 ms / 400 ms), and a thorough test suite with a synchronous-sleep shim is added.

Key Finding:

  • PORT_FREE_TIMEOUT_MS == SPAWN_TIMEOUT_MS (both 2000 ms): If lsof is slow and hits the spawnSync timeout during polling, findGatewayPidsOnPortSync returns [] (error path). waitForPortFreeSync then exits immediately on line 122-123, treating the inconclusive timeout as "port is free". This silently disables the fix under precisely the heavy-load conditions it targets. Using a shorter per-call spawn timeout for poll invocations would allow multiple retries within the polling budget.

Confidence Score: 3/5

  • The fix prevents the EADDRINUSE restart loop but has an edge case where slow/timed-out lsof calls can disable polling on loaded systems.
  • The core approach is sound—polling lsof after SIGKILL to verify port release is the correct fix. Tests are comprehensive and well-isolated. However, PORT_FREE_TIMEOUT_MS == SPAWN_TIMEOUT_MS (both 2000 ms) creates a logic gap: a single timed-out lsof call can consume the entire polling budget and cause the function to exit prematurely, returning false "all clear" without confirming the port is actually free. This risk is most acute under the heavy-load scenarios the PR targets, where lsof slowness is most likely. The fix mitigates the restart loop but remains vulnerable in its intended use case.
  • src/infra/restart-stale-pids.ts — Review timeout constants and polling budget allocation.

Last reviewed commit: 4343ce0

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 6 to +17
const SPAWN_TIMEOUT_MS = 2000;
const STALE_SIGTERM_WAIT_MS = 300;
const STALE_SIGKILL_WAIT_MS = 200;
const STALE_SIGTERM_WAIT_MS = 600;
const STALE_SIGKILL_WAIT_MS = 400;
/**
* After SIGKILL, the kernel may not release the TCP port immediately.
* Poll until the port is confirmed free (or until the budget expires) before
* returning control to the caller (typically `triggerOpenClawRestart` →
* `systemctl restart`). Without this wait the new process races the dying
* process for the port and systemd enters an EADDRINUSE restart loop.
*/
const PORT_FREE_POLL_INTERVAL_MS = 50;
const PORT_FREE_TIMEOUT_MS = 2000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPAWN_TIMEOUT_MS (2000 ms) equals PORT_FREE_TIMEOUT_MS (2000 ms). If a single lsof call times out during polling on a slow/loaded system, spawnSync returns an error, findGatewayPidsOnPortSync returns [], and waitForPortFreeSync immediately returns on line 122-123—interpreting a timed-out query as "port is free".

This defeats the fix's purpose on the exact systems it's designed for. On a 2-second-loaded system, the first lsof call could consume ~2000 ms, timeout, and exhaust the entire polling budget in a single iteration.

Consider raising PORT_FREE_TIMEOUT_MS significantly above SPAWN_TIMEOUT_MS, or use a shorter per-call timeout for poll invocations:

const PORT_FREE_SPAWN_TIMEOUT_MS = 500;   // per-lsof-call timeout during polling
const PORT_FREE_POLL_INTERVAL_MS = 50;
const PORT_FREE_TIMEOUT_MS = 2000;        // overall polling budget

Alternatively, findGatewayPidsOnPortSync could distinguish "[] due to timeout" from "[] due to port free" so waitForPortFreeSync can skip the early-return path on error and continue polling.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 6-17

Comment:
`SPAWN_TIMEOUT_MS` (2000 ms) equals `PORT_FREE_TIMEOUT_MS` (2000 ms). If a single `lsof` call times out during polling on a slow/loaded system, `spawnSync` returns an error, `findGatewayPidsOnPortSync` returns `[]`, and `waitForPortFreeSync` immediately returns on line 122-123—interpreting a timed-out query as "port is free".

This defeats the fix's purpose on the exact systems it's designed for. On a 2-second-loaded system, the first `lsof` call could consume ~2000 ms, timeout, and exhaust the entire polling budget in a single iteration.

Consider raising `PORT_FREE_TIMEOUT_MS` significantly above `SPAWN_TIMEOUT_MS`, or use a shorter per-call timeout for poll invocations:

```typescript
const PORT_FREE_SPAWN_TIMEOUT_MS = 500;   // per-lsof-call timeout during polling
const PORT_FREE_POLL_INTERVAL_MS = 50;
const PORT_FREE_TIMEOUT_MS = 2000;        // overall polling budget
```

Alternatively, `findGatewayPidsOnPortSync` could distinguish "[] due to timeout" from "[] due to port free" so `waitForPortFreeSync` can skip the early-return path on error and continue polling.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptile-apps Thanks for the sharp catch — this was a real bug in v1 and the PR has been updated to address it directly.

What was wrong in v1:
SPAWN_TIMEOUT_MS == PORT_FREE_TIMEOUT_MS == 2000 ms. A single hung lsof call consumed the entire polling budget, findGatewayPidsOnPortSync collapsed the timeout into [], and waitForPortFreeSync exited on the empty-array path — incorrectly treating an inconclusive result as "port is free".

How v2 fixes it:

1. POLL_SPAWN_TIMEOUT_MS = 400 ms — a dedicated per-call spawn timeout used exclusively during polling, intentionally much shorter than SPAWN_TIMEOUT_MS = 2000 ms. At 400 ms per call, up to five independent lsof invocations fit within the 2-second polling budget, each producing a definitive result.

2. pollPortOnce() — three-state discriminated union:

function pollPortOnce(port: number): { free: true } | { free: false } | { free: null }
  • { free: true } — lsof succeeded, no listeners found (port confirmed free)
  • { free: false } — lsof succeeded, listeners present (port still bound)
  • { free: null } — lsof timed out or errored (result inconclusive)

This directly addresses your alternative suggestion: findGatewayPidsOnPortSync no longer needs to distinguish timeout from empty — pollPortOnce does it at the call site by inspecting res.error before parsing output.

3. waitForPortFreeSync exit semantics:

  • free: true → return normally (port confirmed free)
  • free: null → bail immediately (lsof unavailable, graceful degradation — same worst-case as before the fix)
  • free: false → sleep 50 ms and retry

A timed-out lsof call now produces { free: null } and never reaches the free: true return path.

4. Regression test explicitly covering this case:
The test suite includes "does not treat a timed-out lsof poll as port-free" which simulates every poll returning a spawnSync error and verifies the loop bails via the inconclusive path rather than falsely reporting the port as free. 12/12 tests pass.

@efe-arv efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 4343ce0 to 24a7065 Compare March 3, 2026 19:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24a7065720

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +112 to +113
const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS);
return pids.length === 0 ? { free: true } : { free: false };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't treat failed second lsof probe as a free port

pollPortOnce already has a successful lsof result (status === 0 means something is listening), but it then calls findGatewayPidsOnPortSync and converts that second probe to a boolean free/busy check. Because findGatewayPidsOnPortSync returns [] for any spawn error/timeout, a transient failure on the second call is misclassified as free: true, so waitForPortFreeSync can exit early and triggerOpenClawRestart() can still race the old socket and hit EADDRINUSE under load. This should preserve an inconclusive state instead of mapping second-probe errors to “port free”.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chatgpt-codex-connector Good catch — this was a real bug and it is now fixed in v3 (commit 815b408).

What was wrong:
pollPortOnce called findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS) as a second lsof probe after the first spawnSync returned status === 0. Since findGatewayPidsOnPortSync collapses any spawn error or timeout into [], a transient failure on that second call produced { free: true } — incorrectly reporting the port as free without a verified probe result. This was the same class of bug Greptile flagged (inconclusive result misclassified as confirmed-free), surfaced through a different code path.

The fix:

The pid parsing logic has been extracted into a pure, zero-I/O helper parsePidsFromLsofOutput(stdout: string). pollPortOnce now calls parsePidsFromLsofOutput(res.stdout) directly on the output it already holds from the single spawnSync invocation — no second lsof spawn, no second failure surface.

// Before (v2) — second lsof spawn, error collapses to free:true
const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS);
return pids.length === 0 ? { free: true } : { free: false };

// After (v3) — parse inline from existing res.stdout, zero second spawn
const pids = parsePidsFromLsofOutput(res.stdout);
return pids.length === 0 ? { free: true } : { free: false };

parsePidsFromLsofOutput is a pure function with no I/O — it cannot time out, error, or return a false empty result. The three-state semantics of pollPortOnce are now structurally enforced: the only path to { free: true } from a status === 0 lsof result is through parsing the stdout we already verified succeeded.

New regression test added:
"does not make a second lsof call when the first returns status 0" — asserts that across an initial find + one busy poll + one free poll, exactly 3 spawnSync calls are made total. If pollPortOnce were still making a second probe, the count would be 5. 13/13 tests pass.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit f94d2b90a (v4) — both issues addressed:

P1 (bail on first null) — fixed:
waitForPortFreeSync now distinguishes permanent from transient inconclusive results. pollPortOnce returns a 4-state discriminated union: { free: true }, { free: false }, { free: null, permanent: false } (transient — keep retrying), or { free: null, permanent: true } (ENOENT/EACCES/EPERM — lsof permanently unavailable, bail). Only permanent: true exits the loop early. A single slow or errored lsof call no longer aborts the polling budget.

P2 (status !== 0free: true) — fixed:
pollPortOnce now maps lsof exit codes precisely:

  • status === 0 → lsof found listeners, parse pids
  • status === 1 → lsof canonical "no matching processes" exit → { free: true }
  • status > 1 → runtime/permission/flag error → { free: null, permanent: false } (inconclusive, keep polling)

Mapping status > 1 to free: true was the exact bug: a permission denied or flag error would have let restart race a still-bound socket.

Tests added (26 total, 92.6% stmt / 90.7% branch / 100% fn coverage):

  • status === 1 → port-free
  • status > 1 → inconclusive, not free (P2 regression)
  • transient error (ETIMEDOUT) → loop continues (P1 regression)
  • ENOENT/EACCES/EPERM → permanent bail
  • dateNowOverride in __testing replaces the 2s real-clock budget test with a fake-time test (435ms total suite)
  • Atomics.wait throw → busy-wait fallback
  • win32 platform guard
  • pollPortOnce outer catch (resolveLsofCommandSync throws)

@efe-arv efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 24a7065 to 815b408 Compare March 3, 2026 20:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 815b40827f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +186 to +188
if (result.free === null) {
// lsof is unavailable or errored — cannot determine port state, bail out
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep polling after inconclusive lsof probe results

This branch exits the wait loop on the first free: null, but pollPortOnce() returns free: null for any spawnSync error (including a timeout). That means one transient slow lsof call under load immediately skips the remaining 2s polling budget and allows restart to proceed while the old listener may still hold the port, recreating the EADDRINUSE race this patch is meant to prevent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f94d2b90a (v4).

pollPortOnce now returns a 4-state discriminated union:

type PollResult =
  | { free: true }
  | { free: false }
  | { free: null; permanent: false }  // transient — keep retrying
  | { free: null; permanent: true }   // ENOENT/EACCES/EPERM — bail

waitForPortFreeSync only bails on permanent: true. A transient spawnSync error (timeout, signal, anything without ENOENT/EACCES/EPERM) returns { free: null, permanent: false } and the loop sleeps and retries until the wall-clock budget expires. A single slow lsof call no longer aborts the polling window.

Regression test added: "continues polling on transient lsof errors (not ENOENT) — Codex P1 fix" — asserts the loop reaches a subsequent free poll after a transient error.

Comment on lines +116 to +118
if (res.status !== 0) {
// lsof exited non-zero with no error object: no listeners found (status 1 is normal)
return { free: true };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat non-zero lsof exits as inconclusive, not free

This maps every non-zero lsof exit status to free: true, but spawnSync only populates error for spawn failures; command-level failures still come through as non-zero status values. If lsof exits non-zero due runtime/permission/flag issues, this path incorrectly reports the port as free and can let restart race a still-bound socket.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f94d2b90a (v4).

pollPortOnce now maps lsof exit codes precisely rather than treating all non-zero exits as port-free:

if (res.status === 1) {
  // lsof canonical "no matching processes" exit — port is genuinely free
  return { free: true };
}
if (res.status !== 0) {
  // status > 1: runtime/permission/flag error — inconclusive, keep polling
  return { free: null, permanent: false };
}

status === 1 is lsof's documented "no matching processes found" exit — the only non-zero exit that reliably means port-free. Any other non-zero exit (permission denied, bad flag, runtime error) returns { free: null, permanent: false } and the loop retries. Mapping those to free: true was the exact race this patch is supposed to prevent.

Regression test added: "treats lsof exit status >1 as inconclusive, not port-free — Codex P2 regression".

@efe-arv efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 815b408 to f94d2b9 Compare March 3, 2026 21:14
@efe-arv efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from f94d2b9 to b0de0ac Compare March 3, 2026 22:32
@efe-arv
Copy link
Copy Markdown
Contributor Author

efe-arv commented Mar 3, 2026

CI update — commit b0de0acc1 (v5)

Windows test failures (our bug, now fixed): The process.platform win32 mock in the edge-case coverage test was leaking across the test suite. Object.getOwnPropertyDescriptor(process, "platform") returns undefined on Windows CI because platform is a prototype getter — so the if (origDescriptor) restore was skipped and process.platform stayed shadowed as an own property with value "win32" for all subsequent tests. Every downstream call to findGatewayPidsOnPortSync hit the early return and never reached spawnSync, causing 13/26 tests to fail. Fixed by delete process["platform"] in the else branch to drop the shadow and restore prototype lookup.

Bun unit test failure (src/plugins/commands.test.ts) — pre-existing, unrelated to this PR: The failing assertion (acceptsArgs: false field unexpected) is present on main as well (run 22643809542 on a8dd9ffe → CI: failure). This PR does not touch src/plugins/ at all.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0de0acc1c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +148 to +149
const pids = parsePidsFromLsofOutput(res.stdout);
return pids.length === 0 ? { free: true } : { free: false };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat any lsof listener as busy during port-free polling

pollPortOnce maps status === 0 to { free: true } when parsePidsFromLsofOutput yields no OpenClaw PIDs, but that parser intentionally filters out non-openclaw commands. If another process is listening on the port during restart (or the listener command name does not contain openclaw), the poll loop exits as “port free” immediately and cleanStaleGatewayProcessesSync can still hand off to restart while the socket is occupied, causing the same EADDRINUSE failure path this change is trying to prevent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chatgpt-codex-connector This is an intentional design choice — pollPortOnce is not checking whether the port is free in absolute terms; it is checking whether the openclaw gateway process we just killed has released it.

The only callers of cleanStaleGatewayProcessesSync are paths that already identified and killed openclaw PIDs on this port. If lsof exits 0 and no openclaw commands appear, one of two things is true:

  1. The stale process has fully exited — its socket was removed by the kernel when the process died. No openclaw PIDs means our target is gone → { free: true } is the correct signal.
  2. An unrelated process holds port 18789 — we cannot kill it, waiting is pointless, and we should proceed so systemd can report the conflict explicitly via a clear EADDRINUSE log.

In both cases { free: true } is the right action. Returning { free: false } for case 2 would only burn the polling budget and then proceed anyway (the budget-exhausted warning path). This is documented on findGatewayPidsOnPortSync: "Returns only PIDs that belong to openclaw gateway processes."

Unlike isPortBusy in ports.ts (which answers "can a new process bind here?"), waitForPortFreeSync answers "is the process we killed still holding the socket?" Those are different questions with different correct semantics.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efe-arv efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from b0de0ac to 9e17382 Compare March 3, 2026 23:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e1738207f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +136 to +138
if (res.status === 1) {
// lsof canonical "no matching processes" exit — port is genuinely free.
return { free: true };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat status-1 lsof results with stderr as inconclusive

This branch always maps res.status === 1 to free: true, but elsewhere in this repo we already handle lsof exit 1 with non-empty stderr (for example permission-denied output) as an error rather than a clean “no listeners” result (src/cli/ports.ts and src/infra/ports-inspect.ts). In restricted environments, this can cause waitForPortFreeSync to stop polling immediately and proceed to restart while port state is still unknown, reintroducing the EADDRINUSE race the patch is trying to prevent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chatgpt-codex-connector The ports.ts path you reference throws EACCES to surface an explicit error to the interactive CLI user. pollPortOnce is a background polling loop — throwing would be wrong; we need to keep retrying or bail permanently.

The cases where lsof exits 1 with stderr are:

  • lsof: WARNING: can't stat() fuse... — filesystem-type detection warning, unrelated to TCP socket reporting
  • AppArmor/seccomp process-level permission warnings — affect /proc/PID/fd traversal, not /proc/net/tcp[6]

TCP LISTEN socket visibility (what -sTCP:LISTEN reads) comes from /proc/net/tcp which is world-readable in virtually all Linux container environments. lsof permission errors in stderr do not indicate missed TCP listener entries.

The actually dangerous container edge case — lsof exits 1 AND still has openclaw pids in stdout (partial AppArmor output) — is already handled: we parse stdout when non-empty and return { free: false } if openclaw pids appear. That test is in the suite (lsof status 1 with non-empty openclaw stdout is treated as busy, not free (Linux container edge case)).

Adding a stderr regex check to match ports.ts would add latency (polling the full 2s budget on environments where lsof always emits "can't stat() fuse" warnings for NFS/FUSE mounts — common in containers) with no real benefit since TCP socket data is unaffected by those warnings.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…art loop

When cleanStaleGatewayProcessesSync() kills a stale gateway process,
the kernel may not immediately release the TCP port. Previously the
function returned after a fixed 500ms sleep (300ms SIGTERM + 200ms
SIGKILL), allowing triggerOpenClawRestart() to hand off to systemd
before the port was actually free. The new systemd process then raced
the dying socket for port 18789, hit EADDRINUSE, and exited with
status 1, causing systemd to retry indefinitely — the zombie restart
loop reported in openclaw#33103.

Fix: add waitForPortFreeSync() that polls lsof at 50ms intervals for
up to 2 seconds after SIGKILL. cleanStaleGatewayProcessesSync() now
blocks until the port is confirmed free (or the budget expires with a
warning) before returning. The increased SIGTERM/SIGKILL wait budgets
(600ms / 400ms) also give slow processes more time to exit cleanly.

Fixes openclaw#33103
Related: openclaw#28134
@efe-arv efe-arv force-pushed the fix/gateway-zombie-restart-loop branch from 9e17382 to 9c2ebe7 Compare March 4, 2026 00:11
@efe-arv
Copy link
Copy Markdown
Contributor Author

efe-arv commented Mar 4, 2026

v7 (9c2ebe778) — 3 additional fixes

1. src/plugins/commands.test.ts — updated snapshot to include acceptsArgs: false
This test was failing on both bun and Windows CI (pre-existing on main): getPluginCommandSpecs() was updated to include acceptsArgs in its return type, but the test expectation was not updated. Fixed.

2. parsePidsFromLsofOutput — deduplicate PIDs from dual-stack listeners
Dual-stack listeners (IPv4 + IPv6) cause lsof to emit the same PID twice in -Fpc output. Without dedup, terminateStaleProcessesSync sends SIGTERM twice to the same process and the returned killed array contains duplicates. Fixed with [...new Set(pids)] before the process.pid exclusion filter.

3. pollPortOnce — handle lsof status 1 with non-empty stdout
On Linux containers with restricted /proc (AppArmor, seccomp, user namespaces), lsof can exit 1 AND still emit output for processes it could read. Previously status === 1 unconditionally returned { free: true }. Now parses stdout when non-empty and returns { free: false } if openclaw pids are present, preventing a false-free in container environments.

28 tests passing (2 new: dedup regression, Linux container status-1 edge case). 434ms suite.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c2ebe7785

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// status === 0: lsof found listeners. Parse pids from the stdout we
// already hold — no second lsof spawn, no new failure surface.
const pids = parsePidsFromLsofOutput(res.stdout);
return pids.length === 0 ? { free: true } : { free: false };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat all lsof listeners as busy while polling

This status === 0 branch decides free/busy from parsePidsFromLsofOutput, but that parser intentionally filters to commands containing openclaw. If the port is held by another process (or lsof can’t expose the command name), pids.length becomes 0 and this returns { free: true }, so waitForPortFreeSync can stop early while the socket is still occupied and the subsequent restart can still hit EADDRINUSE.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chatgpt-codex-connector Same answer as the P2 above — by design. pollPortOnce answers "is the openclaw process we just killed still holding the socket?" not "is the port clear for a new bind?" Those have different correct semantics; see the reply on the sibling comment for the full rationale.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +143 to +147
if (res.stdout) {
const pids = parsePidsFromLsofOutput(res.stdout);
return pids.length === 0 ? { free: true } : { free: false };
}
return { free: true };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat lsof status 1 with stderr as inconclusive

When res.status === 1, this path returns { free: true } whenever stdout is empty, but it never checks whether lsof also emitted permission/runtime errors on stderr. In restricted environments that can produce status 1 with error text, this maps an inconclusive probe to “port free,” causing waitForPortFreeSync to exit immediately and allowing the restart path to race a still-bound port.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chatgpt-codex-connector Same rationale as the P2 above on 2880954508 — see that reply for full explanation. In short: TCP socket visibility is not affected by the file-level permission warnings that appear in lsof stderr; the openclaw-stdout container edge case (status 1 + openclaw pids in stdout) is already handled.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit findings fixed:
- remove dead guard: terminateStaleProcessesSync pids.length===0 check was
  unreachable (only caller cleanStaleGatewayProcessesSync already guards)
- expose __testing.callSleepSyncRaw so sleepSync's real Atomics.wait path
  can be unit-tested directly without going through the override
- fix broken sleepSync Atomics.wait test: previous test set override=null
  but cleanStaleGatewayProcessesSync returned before calling sleepSync —
  replaced with direct callSleepSyncRaw calls that actually exercise L36/L42-47
- fix pid collision: two tests used process.pid+304 (EPERM + dead-at-SIGTERM);
  EPERM test changed to process.pid+305
- fix misindented tests: 'deduplicates pids' and 'lsof status 1 container
  edge case' were outside their intended describe blocks; moved to correct
  scopes (findGatewayPidsOnPortSync and pollPortOnce respectively)
- add missing branch tests:
  - status 1 + non-empty stdout with zero openclaw pids → free:true (L145)
  - mid-loop non-openclaw cmd in &&-chain (L67)
  - consecutive p-lines without c-line between them (L67)
  - invalid PID in p-line (p0 / pNaN) — ternary false branch (L67)
  - unknown lsof output line (else-if false branch L69)

Coverage: 100% stmts / 100% branch / 100% funcs / 100% lines (36 tests)
@Takhoffman
Copy link
Copy Markdown
Contributor

Thanks again for this work.

This was landed indirectly via the synthesized PR #33831, and your contribution is credited in the changelog and as a co-author on the merge commit.

This PR merge/triage workflow is AI-assisted.

Closing this PR as superseded by #33831. If anything here looks incorrect or incomplete, reply to reopen and we can reassess.

@Takhoffman Takhoffman closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway restart infinite loop with zombie process on config change

3 participants