fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop#33410
fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop#33410efe-arv wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a post-SIGKILL EADDRINUSE restart loop by introducing Key Finding:
Confidence Score: 3/5
Last reviewed commit: 4343ce0 |
| const SPAWN_TIMEOUT_MS = 2000; | ||
| const STALE_SIGTERM_WAIT_MS = 300; | ||
| const STALE_SIGKILL_WAIT_MS = 200; | ||
| const STALE_SIGTERM_WAIT_MS = 600; | ||
| const STALE_SIGKILL_WAIT_MS = 400; | ||
| /** | ||
| * After SIGKILL, the kernel may not release the TCP port immediately. | ||
| * Poll until the port is confirmed free (or until the budget expires) before | ||
| * returning control to the caller (typically `triggerOpenClawRestart` → | ||
| * `systemctl restart`). Without this wait the new process races the dying | ||
| * process for the port and systemd enters an EADDRINUSE restart loop. | ||
| */ | ||
| const PORT_FREE_POLL_INTERVAL_MS = 50; | ||
| const PORT_FREE_TIMEOUT_MS = 2000; |
There was a problem hiding this comment.
SPAWN_TIMEOUT_MS (2000 ms) equals PORT_FREE_TIMEOUT_MS (2000 ms). If a single lsof call times out during polling on a slow/loaded system, spawnSync returns an error, findGatewayPidsOnPortSync returns [], and waitForPortFreeSync immediately returns on line 122-123—interpreting a timed-out query as "port is free".
This defeats the fix's purpose on the exact systems it's designed for. On a 2-second-loaded system, the first lsof call could consume ~2000 ms, timeout, and exhaust the entire polling budget in a single iteration.
Consider raising PORT_FREE_TIMEOUT_MS significantly above SPAWN_TIMEOUT_MS, or use a shorter per-call timeout for poll invocations:
const PORT_FREE_SPAWN_TIMEOUT_MS = 500; // per-lsof-call timeout during polling
const PORT_FREE_POLL_INTERVAL_MS = 50;
const PORT_FREE_TIMEOUT_MS = 2000; // overall polling budgetAlternatively, findGatewayPidsOnPortSync could distinguish "[] due to timeout" from "[] due to port free" so waitForPortFreeSync can skip the early-return path on error and continue polling.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 6-17
Comment:
`SPAWN_TIMEOUT_MS` (2000 ms) equals `PORT_FREE_TIMEOUT_MS` (2000 ms). If a single `lsof` call times out during polling on a slow/loaded system, `spawnSync` returns an error, `findGatewayPidsOnPortSync` returns `[]`, and `waitForPortFreeSync` immediately returns on line 122-123—interpreting a timed-out query as "port is free".
This defeats the fix's purpose on the exact systems it's designed for. On a 2-second-loaded system, the first `lsof` call could consume ~2000 ms, timeout, and exhaust the entire polling budget in a single iteration.
Consider raising `PORT_FREE_TIMEOUT_MS` significantly above `SPAWN_TIMEOUT_MS`, or use a shorter per-call timeout for poll invocations:
```typescript
const PORT_FREE_SPAWN_TIMEOUT_MS = 500; // per-lsof-call timeout during polling
const PORT_FREE_POLL_INTERVAL_MS = 50;
const PORT_FREE_TIMEOUT_MS = 2000; // overall polling budget
```
Alternatively, `findGatewayPidsOnPortSync` could distinguish "[] due to timeout" from "[] due to port free" so `waitForPortFreeSync` can skip the early-return path on error and continue polling.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
@greptile-apps Thanks for the sharp catch — this was a real bug in v1 and the PR has been updated to address it directly.
What was wrong in v1:
SPAWN_TIMEOUT_MS == PORT_FREE_TIMEOUT_MS == 2000 ms. A single hung lsof call consumed the entire polling budget, findGatewayPidsOnPortSync collapsed the timeout into [], and waitForPortFreeSync exited on the empty-array path — incorrectly treating an inconclusive result as "port is free".
How v2 fixes it:
1. POLL_SPAWN_TIMEOUT_MS = 400 ms — a dedicated per-call spawn timeout used exclusively during polling, intentionally much shorter than SPAWN_TIMEOUT_MS = 2000 ms. At 400 ms per call, up to five independent lsof invocations fit within the 2-second polling budget, each producing a definitive result.
2. pollPortOnce() — three-state discriminated union:
function pollPortOnce(port: number): { free: true } | { free: false } | { free: null }{ free: true }— lsof succeeded, no listeners found (port confirmed free){ free: false }— lsof succeeded, listeners present (port still bound){ free: null }— lsof timed out or errored (result inconclusive)
This directly addresses your alternative suggestion: findGatewayPidsOnPortSync no longer needs to distinguish timeout from empty — pollPortOnce does it at the call site by inspecting res.error before parsing output.
3. waitForPortFreeSync exit semantics:
free: true→ return normally (port confirmed free)free: null→ bail immediately (lsof unavailable, graceful degradation — same worst-case as before the fix)free: false→ sleep 50 ms and retry
A timed-out lsof call now produces { free: null } and never reaches the free: true return path.
4. Regression test explicitly covering this case:
The test suite includes "does not treat a timed-out lsof poll as port-free" which simulates every poll returning a spawnSync error and verifies the loop bails via the inconclusive path rather than falsely reporting the port as free. 12/12 tests pass.
4343ce0 to
24a7065
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24a7065720
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/infra/restart-stale-pids.ts
Outdated
| const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS); | ||
| return pids.length === 0 ? { free: true } : { free: false }; |
There was a problem hiding this comment.
Don't treat failed second lsof probe as a free port
pollPortOnce already has a successful lsof result (status === 0 means something is listening), but it then calls findGatewayPidsOnPortSync and converts that second probe to a boolean free/busy check. Because findGatewayPidsOnPortSync returns [] for any spawn error/timeout, a transient failure on the second call is misclassified as free: true, so waitForPortFreeSync can exit early and triggerOpenClawRestart() can still race the old socket and hit EADDRINUSE under load. This should preserve an inconclusive state instead of mapping second-probe errors to “port free”.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@chatgpt-codex-connector Good catch — this was a real bug and it is now fixed in v3 (commit 815b408).
What was wrong:
pollPortOnce called findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS) as a second lsof probe after the first spawnSync returned status === 0. Since findGatewayPidsOnPortSync collapses any spawn error or timeout into [], a transient failure on that second call produced { free: true } — incorrectly reporting the port as free without a verified probe result. This was the same class of bug Greptile flagged (inconclusive result misclassified as confirmed-free), surfaced through a different code path.
The fix:
The pid parsing logic has been extracted into a pure, zero-I/O helper parsePidsFromLsofOutput(stdout: string). pollPortOnce now calls parsePidsFromLsofOutput(res.stdout) directly on the output it already holds from the single spawnSync invocation — no second lsof spawn, no second failure surface.
// Before (v2) — second lsof spawn, error collapses to free:true
const pids = findGatewayPidsOnPortSync(port, POLL_SPAWN_TIMEOUT_MS);
return pids.length === 0 ? { free: true } : { free: false };
// After (v3) — parse inline from existing res.stdout, zero second spawn
const pids = parsePidsFromLsofOutput(res.stdout);
return pids.length === 0 ? { free: true } : { free: false };parsePidsFromLsofOutput is a pure function with no I/O — it cannot time out, error, or return a false empty result. The three-state semantics of pollPortOnce are now structurally enforced: the only path to { free: true } from a status === 0 lsof result is through parsing the stdout we already verified succeeded.
New regression test added:
"does not make a second lsof call when the first returns status 0" — asserts that across an initial find + one busy poll + one free poll, exactly 3 spawnSync calls are made total. If pollPortOnce were still making a second probe, the count would be 5. 13/13 tests pass.
There was a problem hiding this comment.
To use Codex here, create a Codex account and connect to github.
There was a problem hiding this comment.
Fixed in commit f94d2b90a (v4) — both issues addressed:
P1 (bail on first null) — fixed:
waitForPortFreeSync now distinguishes permanent from transient inconclusive results. pollPortOnce returns a 4-state discriminated union: { free: true }, { free: false }, { free: null, permanent: false } (transient — keep retrying), or { free: null, permanent: true } (ENOENT/EACCES/EPERM — lsof permanently unavailable, bail). Only permanent: true exits the loop early. A single slow or errored lsof call no longer aborts the polling budget.
P2 (status !== 0 → free: true) — fixed:
pollPortOnce now maps lsof exit codes precisely:
status === 0→ lsof found listeners, parse pidsstatus === 1→ lsof canonical "no matching processes" exit →{ free: true }status > 1→ runtime/permission/flag error →{ free: null, permanent: false }(inconclusive, keep polling)
Mapping status > 1 to free: true was the exact bug: a permission denied or flag error would have let restart race a still-bound socket.
Tests added (26 total, 92.6% stmt / 90.7% branch / 100% fn coverage):
status === 1→ port-freestatus > 1→ inconclusive, not free (P2 regression)- transient error (ETIMEDOUT) → loop continues (P1 regression)
- ENOENT/EACCES/EPERM → permanent bail
dateNowOverridein__testingreplaces the 2s real-clock budget test with a fake-time test (435ms total suite)Atomics.waitthrow → busy-wait fallback- win32 platform guard
pollPortOnceouter catch (resolveLsofCommandSync throws)
24a7065 to
815b408
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 815b40827f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/infra/restart-stale-pids.ts
Outdated
| if (result.free === null) { | ||
| // lsof is unavailable or errored — cannot determine port state, bail out | ||
| return; |
There was a problem hiding this comment.
Keep polling after inconclusive lsof probe results
This branch exits the wait loop on the first free: null, but pollPortOnce() returns free: null for any spawnSync error (including a timeout). That means one transient slow lsof call under load immediately skips the remaining 2s polling budget and allows restart to proceed while the old listener may still hold the port, recreating the EADDRINUSE race this patch is meant to prevent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f94d2b90a (v4).
pollPortOnce now returns a 4-state discriminated union:
type PollResult =
| { free: true }
| { free: false }
| { free: null; permanent: false } // transient — keep retrying
| { free: null; permanent: true } // ENOENT/EACCES/EPERM — bailwaitForPortFreeSync only bails on permanent: true. A transient spawnSync error (timeout, signal, anything without ENOENT/EACCES/EPERM) returns { free: null, permanent: false } and the loop sleeps and retries until the wall-clock budget expires. A single slow lsof call no longer aborts the polling window.
Regression test added: "continues polling on transient lsof errors (not ENOENT) — Codex P1 fix" — asserts the loop reaches a subsequent free poll after a transient error.
src/infra/restart-stale-pids.ts
Outdated
| if (res.status !== 0) { | ||
| // lsof exited non-zero with no error object: no listeners found (status 1 is normal) | ||
| return { free: true }; |
There was a problem hiding this comment.
Treat non-zero lsof exits as inconclusive, not free
This maps every non-zero lsof exit status to free: true, but spawnSync only populates error for spawn failures; command-level failures still come through as non-zero status values. If lsof exits non-zero due runtime/permission/flag issues, this path incorrectly reports the port as free and can let restart race a still-bound socket.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in f94d2b90a (v4).
pollPortOnce now maps lsof exit codes precisely rather than treating all non-zero exits as port-free:
if (res.status === 1) {
// lsof canonical "no matching processes" exit — port is genuinely free
return { free: true };
}
if (res.status !== 0) {
// status > 1: runtime/permission/flag error — inconclusive, keep polling
return { free: null, permanent: false };
}status === 1 is lsof's documented "no matching processes found" exit — the only non-zero exit that reliably means port-free. Any other non-zero exit (permission denied, bad flag, runtime error) returns { free: null, permanent: false } and the loop retries. Mapping those to free: true was the exact race this patch is supposed to prevent.
Regression test added: "treats lsof exit status >1 as inconclusive, not port-free — Codex P2 regression".
815b408 to
f94d2b9
Compare
f94d2b9 to
b0de0ac
Compare
|
CI update — commit Windows test failures (our bug, now fixed): The Bun unit test failure ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0de0acc1c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const pids = parsePidsFromLsofOutput(res.stdout); | ||
| return pids.length === 0 ? { free: true } : { free: false }; |
There was a problem hiding this comment.
Treat any lsof listener as busy during port-free polling
pollPortOnce maps status === 0 to { free: true } when parsePidsFromLsofOutput yields no OpenClaw PIDs, but that parser intentionally filters out non-openclaw commands. If another process is listening on the port during restart (or the listener command name does not contain openclaw), the poll loop exits as “port free” immediately and cleanStaleGatewayProcessesSync can still hand off to restart while the socket is occupied, causing the same EADDRINUSE failure path this change is trying to prevent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@chatgpt-codex-connector This is an intentional design choice — pollPortOnce is not checking whether the port is free in absolute terms; it is checking whether the openclaw gateway process we just killed has released it.
The only callers of cleanStaleGatewayProcessesSync are paths that already identified and killed openclaw PIDs on this port. If lsof exits 0 and no openclaw commands appear, one of two things is true:
- The stale process has fully exited — its socket was removed by the kernel when the process died. No openclaw PIDs means our target is gone →
{ free: true }is the correct signal. - An unrelated process holds port 18789 — we cannot kill it, waiting is pointless, and we should proceed so systemd can report the conflict explicitly via a clear
EADDRINUSElog.
In both cases { free: true } is the right action. Returning { free: false } for case 2 would only burn the polling budget and then proceed anyway (the budget-exhausted warning path). This is documented on findGatewayPidsOnPortSync: "Returns only PIDs that belong to openclaw gateway processes."
Unlike isPortBusy in ports.ts (which answers "can a new process bind here?"), waitForPortFreeSync answers "is the process we killed still holding the socket?" Those are different questions with different correct semantics.
There was a problem hiding this comment.
To use Codex here, create a Codex account and connect to github.
b0de0ac to
9e17382
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e1738207f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (res.status === 1) { | ||
| // lsof canonical "no matching processes" exit — port is genuinely free. | ||
| return { free: true }; |
There was a problem hiding this comment.
Treat status-1 lsof results with stderr as inconclusive
This branch always maps res.status === 1 to free: true, but elsewhere in this repo we already handle lsof exit 1 with non-empty stderr (for example permission-denied output) as an error rather than a clean “no listeners” result (src/cli/ports.ts and src/infra/ports-inspect.ts). In restricted environments, this can cause waitForPortFreeSync to stop polling immediately and proceed to restart while port state is still unknown, reintroducing the EADDRINUSE race the patch is trying to prevent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@chatgpt-codex-connector The ports.ts path you reference throws EACCES to surface an explicit error to the interactive CLI user. pollPortOnce is a background polling loop — throwing would be wrong; we need to keep retrying or bail permanently.
The cases where lsof exits 1 with stderr are:
lsof: WARNING: can't stat() fuse...— filesystem-type detection warning, unrelated to TCP socket reporting- AppArmor/seccomp process-level permission warnings — affect
/proc/PID/fdtraversal, not/proc/net/tcp[6]
TCP LISTEN socket visibility (what -sTCP:LISTEN reads) comes from /proc/net/tcp which is world-readable in virtually all Linux container environments. lsof permission errors in stderr do not indicate missed TCP listener entries.
The actually dangerous container edge case — lsof exits 1 AND still has openclaw pids in stdout (partial AppArmor output) — is already handled: we parse stdout when non-empty and return { free: false } if openclaw pids appear. That test is in the suite (lsof status 1 with non-empty openclaw stdout is treated as busy, not free (Linux container edge case)).
Adding a stderr regex check to match ports.ts would add latency (polling the full 2s budget on environments where lsof always emits "can't stat() fuse" warnings for NFS/FUSE mounts — common in containers) with no real benefit since TCP socket data is unaffected by those warnings.
There was a problem hiding this comment.
To use Codex here, create a Codex account and connect to github.
…art loop When cleanStaleGatewayProcessesSync() kills a stale gateway process, the kernel may not immediately release the TCP port. Previously the function returned after a fixed 500ms sleep (300ms SIGTERM + 200ms SIGKILL), allowing triggerOpenClawRestart() to hand off to systemd before the port was actually free. The new systemd process then raced the dying socket for port 18789, hit EADDRINUSE, and exited with status 1, causing systemd to retry indefinitely — the zombie restart loop reported in openclaw#33103. Fix: add waitForPortFreeSync() that polls lsof at 50ms intervals for up to 2 seconds after SIGKILL. cleanStaleGatewayProcessesSync() now blocks until the port is confirmed free (or the budget expires with a warning) before returning. The increased SIGTERM/SIGKILL wait budgets (600ms / 400ms) also give slow processes more time to exit cleanly. Fixes openclaw#33103 Related: openclaw#28134
9e17382 to
9c2ebe7
Compare
|
v7 ( 1. 2. 3. 28 tests passing (2 new: dedup regression, Linux container status-1 edge case). 434ms suite. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c2ebe7785
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // status === 0: lsof found listeners. Parse pids from the stdout we | ||
| // already hold — no second lsof spawn, no new failure surface. | ||
| const pids = parsePidsFromLsofOutput(res.stdout); | ||
| return pids.length === 0 ? { free: true } : { free: false }; |
There was a problem hiding this comment.
Treat all lsof listeners as busy while polling
This status === 0 branch decides free/busy from parsePidsFromLsofOutput, but that parser intentionally filters to commands containing openclaw. If the port is held by another process (or lsof can’t expose the command name), pids.length becomes 0 and this returns { free: true }, so waitForPortFreeSync can stop early while the socket is still occupied and the subsequent restart can still hit EADDRINUSE.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@chatgpt-codex-connector Same answer as the P2 above — by design. pollPortOnce answers "is the openclaw process we just killed still holding the socket?" not "is the port clear for a new bind?" Those have different correct semantics; see the reply on the sibling comment for the full rationale.
There was a problem hiding this comment.
To use Codex here, create a Codex account and connect to github.
| if (res.stdout) { | ||
| const pids = parsePidsFromLsofOutput(res.stdout); | ||
| return pids.length === 0 ? { free: true } : { free: false }; | ||
| } | ||
| return { free: true }; |
There was a problem hiding this comment.
Treat lsof status 1 with stderr as inconclusive
When res.status === 1, this path returns { free: true } whenever stdout is empty, but it never checks whether lsof also emitted permission/runtime errors on stderr. In restricted environments that can produce status 1 with error text, this maps an inconclusive probe to “port free,” causing waitForPortFreeSync to exit immediately and allowing the restart path to race a still-bound port.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@chatgpt-codex-connector Same rationale as the P2 above on 2880954508 — see that reply for full explanation. In short: TCP socket visibility is not affected by the file-level permission warnings that appear in lsof stderr; the openclaw-stdout container edge case (status 1 + openclaw pids in stdout) is already handled.
There was a problem hiding this comment.
To use Codex here, create a Codex account and connect to github.
Audit findings fixed: - remove dead guard: terminateStaleProcessesSync pids.length===0 check was unreachable (only caller cleanStaleGatewayProcessesSync already guards) - expose __testing.callSleepSyncRaw so sleepSync's real Atomics.wait path can be unit-tested directly without going through the override - fix broken sleepSync Atomics.wait test: previous test set override=null but cleanStaleGatewayProcessesSync returned before calling sleepSync — replaced with direct callSleepSyncRaw calls that actually exercise L36/L42-47 - fix pid collision: two tests used process.pid+304 (EPERM + dead-at-SIGTERM); EPERM test changed to process.pid+305 - fix misindented tests: 'deduplicates pids' and 'lsof status 1 container edge case' were outside their intended describe blocks; moved to correct scopes (findGatewayPidsOnPortSync and pollPortOnce respectively) - add missing branch tests: - status 1 + non-empty stdout with zero openclaw pids → free:true (L145) - mid-loop non-openclaw cmd in &&-chain (L67) - consecutive p-lines without c-line between them (L67) - invalid PID in p-line (p0 / pNaN) — ternary false branch (L67) - unknown lsof output line (else-if false branch L69) Coverage: 100% stmts / 100% branch / 100% funcs / 100% lines (36 tests)
|
Thanks again for this work. This was landed indirectly via the synthesized PR #33831, and your contribution is credited in the changelog and as a co-author on the merge commit. This PR merge/triage workflow is AI-assisted. Closing this PR as superseded by #33831. If anything here looks incorrect or incomplete, reply to reopen and we can reassess. |
fix(restart): poll port free after SIGKILL to prevent EADDRINUSE restart loop
Fixes #33103 · Related: #28134
Background
When OpenClaw's gateway is restarted via
triggerOpenClawRestart()— the path exercised by thegatewaytool,/restart, config-change webhooks, and SIGUSR1 on Linux — the following sequence occurs:cleanStaleGatewayProcessesSync()locates any existing gateway process listening on the configured port, sendsSIGTERM, waits, escalates toSIGKILLif necessary, then returns.triggerOpenClawRestart()immediately invokessystemctl --user restart openclaw-gateway.bind(2)on the configured port.The bug lives in the transition between steps 1 and 2.
Root Cause
TCP socket lifecycle after SIGKILL
When a process is killed with
SIGKILL, the kernel schedules socket cleanup but does not guarantee it completes beforeprocess.kill()returns to the caller. Sockets in states such asTIME_WAIT,FIN_WAIT_2, or mid-drain may remain bound to the port for a non-deterministic window after the owning process has been removed from the process table.The prior implementation of
cleanStaleGatewayProcessesSync()used two fixed sleeps:STALE_SIGTERM_WAIT_MS = 300 ms— time allowed for graceful exit afterSIGTERMSTALE_SIGKILL_WAIT_MS = 200 ms— time allowed afterSIGKILLbefore returningTotal maximum wait: 500 ms.
On a loaded system, or when the gateway holds active streaming connections, 500 ms is insufficient for the kernel to complete socket teardown.
cleanStaleGatewayProcessesSync()returns with the port still bound.systemctl restartfires immediately, the new process callsbind(2)on port 18789, receivesEADDRINUSE, and exits with status 1. systemd schedules another restart — which fails identically — producing an unbounded restart loop. The only recovery was a manualkill -9of the zombie PID followed by a cleansystemctl start.Why the
--forcepath did not prevent thisThe gateway CLI exposes a
--forceflag that callsforceFreePortAndWait()fromports.ts, which correctly polls the port vialsofuntil it is confirmed free. However, this function is only exercised on interactiveopenclaw gateway start --forceinvocations. ThetriggerOpenClawRestart()→systemctl restartpath does not callforceFreePortAndWait(). The two code paths were not unified, leaving the programmatic restart path without port-free verification.Fix
src/infra/restart-stale-pids.tsThree additions are made:
1.
pollPortOnce(port)— a three-state lsof pollA new internal function returns a discriminated union:
{ free: true }(port confirmed free),{ free: false }(port still bound), or{ free: null }(lsof timed out or errored — result inconclusive). This three-state result is critical: the prior approach calledfindGatewayPidsOnPortSync()directly, which collapses both "no listeners" and "lsof error" into an empty array, making it impossible to distinguish a confirmed-free port from a failed probe.2.
waitForPortFreeSync(port)— polling loop with correct exit semanticsPolls
pollPortOnce()at 50 ms intervals for up to 2 seconds. The loop exits onfree: true(success) orfree: null(lsof unavailable, graceful degradation). Onfree: false, it sleeps and retries. If the deadline is reached without a conclusive free result, it logs a warning and proceeds.3.
POLL_SPAWN_TIMEOUT_MS = 400 ms— distinct spawn timeout for pollingThis addresses the timeout-budget collision identified in review: if
POLL_SPAWN_TIMEOUT_MSequalsPORT_FREE_TIMEOUT_MS(both 2000 ms), a single hung lsof call consumes the entire polling budget, causingwaitForPortFreeSync()to exit on thefree: nullpath while the port remains bound — silently disabling the fix under precisely the heavy-load conditions it targets. SettingPOLL_SPAWN_TIMEOUT_MS = 400 msensures up to five independent lsof invocations fit within the 2-second polling budget, each with a definitive success or error outcome.Design rationale — synchronous polling:
cleanStaleGatewayProcessesSync()is called fromtriggerOpenClawRestart()immediately before a blockingspawnSync("systemctl", [...]). Converting the call chain to async is out of scope; the synchronousAtomics.waitsleep primitive already present in this file makes a synchronous poll straightforward and correct.Design rationale — lsof over
kill(pid, 0): AfterSIGKILL, the target process is removed from the process table almost immediately, but the bound socket can persist. Polling process liveness viakill(pid, 0)would return "dead" while the port remains bound.lsofqueries the actual kernel socket state, which is the correct predicate.Changes
src/infra/restart-stale-pids.tspollPortOnce(),waitForPortFreeSync(),POLL_SPAWN_TIMEOUT_MS; increased SIGTERM/SIGKILL wait budgets; optionalspawnTimeoutMsparam onfindGatewayPidsOnPortSync; wiredwaitForPortFreeSync()intocleanStaleGatewayProcessesSync()src/infra/restart-stale-pids.test.tsfindGatewayPidsOnPortSync,cleanStaleGatewayProcessesSync, the #33103 regression, and the lsof-timeout edge case identified in reviewTest Suite
All tests execute with a synchronous sleep shim (
__testing.setSleepSyncOverride) to eliminate real-time waits, and with mockedspawnSyncandresolveGatewayPortto remove dependencies on root privileges,lsofavailability, or a live gateway process. 12/12 tests pass onvitest v4.0.18.findGatewayPidsOnPortSync[]when lsof exits non-zero[]when lsof returns an error objectENOENT(lsof not installed) and permission-denied scenariosincludes("openclaw")command filter and thepid !== process.pidself-exclusion guardspawnTimeoutMstospawnSyncPOLL_SPAWN_TIMEOUT_MS(400 ms) rather than the default 2000 mscleanStaleGatewayProcessesSync[]and does not callkillwhen port has no listenersSIGTERMto stale pids and returns themSIGKILLwhen process survives the SIGTERM windowSIGKILLescalationcleanStaleGatewayProcessesSynccontinues polling after kill until the port is confirmed free, not a fixed sleepresolveGatewayPortthrowsENOENTon lsof must degrade silentlyManual Verification
Confirmed on Ubuntu 24.04.2 LTS, systemd 255, OpenClaw 2026.3.2, port 18789:
gatewaytool with active streaming connections reproducibly caused the restart loop (systemd restart counter reaching 10+ within 30 seconds, port never released).