fix: add EADDRINUSE retry and TIME_WAIT port-bind checks for gateway startup#33555
fix: add EADDRINUSE retry and TIME_WAIT port-bind checks for gateway startup#33555rmfalco89 wants to merge 2 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81cfa53c51
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| * - EACCES: bind to a privileged port as non-root. | ||
| * - EINVAL, etc.: other unrecoverable OS errors. | ||
| */ | ||
| export function probePortFree(port: number, host = "0.0.0.0"): Promise<boolean> { |
There was a problem hiding this comment.
Probe bindability on the actual gateway bind host
probePortFree hardcodes 0.0.0.0 as the default host, and waitForPortBindable/runGatewayCommand never pass the resolved gateway bind address, so gateway run --force can fail early with EADDRINUSE for sockets on other local interfaces even when the configured bind host (for example 127.0.0.1 or a custom IP) is actually free. This is a regression from the new preflight check because startup now exits before reaching listenGatewayHttpServer, which would have succeeded on the intended host.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR introduces a two-layer defense against intermittent Key observations:
Confidence Score: 4/5
Last reviewed commit: 81cfa53 |
| describe("waitForPortBindable", () => { | ||
| it("propagates EACCES rejection immediately without retrying", async () => { | ||
| // Every call to createServer will emit EACCES — so if waitForPortBindable retried, | ||
| // mockCreateServer would be called many times. We assert it's called exactly once. | ||
| mockCreateServer.mockClear(); | ||
| mockCreateServer.mockReturnValue(makeErrServer("EACCES")); | ||
| await expect( | ||
| waitForPortBindable(80, { timeoutMs: 5000, intervalMs: 50 }), | ||
| ).rejects.toMatchObject({ code: "EACCES" }); | ||
| // Only one probe should have been attempted — no spinning through the retry loop. | ||
| expect(mockCreateServer).toHaveBeenCalledTimes(1); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Missing test coverage for the core retry loop
The waitForPortBindable test suite only exercises the non-retryable EACCES short-circuit path. The primary purpose of the function — retrying when probePortFree returns false (EADDRINUSE / TIME_WAIT) and eventually succeeding after N probes — is not covered by any test. Neither is the timeout/throw scenario (port stays occupied throughout all retries).
Consider adding:
- A test where
mockCreateServerreturns EADDRINUSE for the first N calls and then succeeds, asserting the returnedwaitedvalue is> 0and that the probe was called the expected number of times. - A test where
mockCreateServeralways returns EADDRINUSE andtimeoutMsis small, asserting the function rejects with the expected error message.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/ports.test.ts
Line: 85-97
Comment:
**Missing test coverage for the core retry loop**
The `waitForPortBindable` test suite only exercises the non-retryable EACCES short-circuit path. The primary purpose of the function — retrying when `probePortFree` returns `false` (EADDRINUSE / TIME_WAIT) and eventually succeeding after N probes — is not covered by any test. Neither is the timeout/throw scenario (port stays occupied throughout all retries).
Consider adding:
1. A test where `mockCreateServer` returns EADDRINUSE for the first N calls and then succeeds, asserting the returned `waited` value is `> 0` and that the probe was called the expected number of times.
2. A test where `mockCreateServer` always returns EADDRINUSE and `timeoutMs` is small, asserting the function rejects with the expected error message.
How can I resolve this? If you propose a fix, please make it concise.| export async function waitForPortBindable( | ||
| port: number, | ||
| opts: { timeoutMs?: number; intervalMs?: number } = {}, | ||
| ): Promise<number> { | ||
| const timeoutMs = Math.max(opts.timeoutMs ?? 3000, 0); | ||
| const intervalMs = Math.max(opts.intervalMs ?? 150, 1); | ||
| let waited = 0; | ||
| while (waited < timeoutMs) { | ||
| if (await probePortFree(port)) { | ||
| return waited; | ||
| } | ||
| await sleep(intervalMs); | ||
| waited += intervalMs; | ||
| } | ||
| // Final attempt | ||
| if (await probePortFree(port)) { | ||
| return waited; | ||
| } | ||
| throw new Error(`port ${port} still not bindable after ${waited}ms (TIME_WAIT or kernel hold)`); | ||
| } |
There was a problem hiding this comment.
waitForPortBindable silently locks probe to 0.0.0.0
probePortFree accepts a host parameter, but waitForPortBindable doesn't expose one — every internal call uses the default "0.0.0.0". In run.ts, the resolved bind address (loopback, LAN IP, Tailscale interface, etc.) isn't yet available at the point waitForPortBindable is called, so the host can't be forwarded today without a refactor.
In practice this works correctly for IPv4 because binding 0.0.0.0:PORT overlaps with any specific IPv4 address, so a TIME_WAIT entry on 127.0.0.1:PORT will still cause an EADDRINUSE on 0.0.0.0:PORT. However, if a system uses an IPv6-only gateway bind (e.g., ::1), a TIME_WAIT on that address won't be detected by probing 0.0.0.0.
Adding an optional host parameter to waitForPortBindable (defaulting to "0.0.0.0") would make this future-safe and document the intentional choice explicitly:
export async function waitForPortBindable(
port: number,
opts: { timeoutMs?: number; intervalMs?: number; host?: string } = {},
): Promise<number> {
const host = opts.host ?? "0.0.0.0";
// ...
if (await probePortFree(port, host)) {Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/ports.ts
Line: 367-386
Comment:
**`waitForPortBindable` silently locks probe to `0.0.0.0`**
`probePortFree` accepts a `host` parameter, but `waitForPortBindable` doesn't expose one — every internal call uses the default `"0.0.0.0"`. In `run.ts`, the resolved bind address (loopback, LAN IP, Tailscale interface, etc.) isn't yet available at the point `waitForPortBindable` is called, so the host can't be forwarded today without a refactor.
In practice this works correctly for IPv4 because binding `0.0.0.0:PORT` overlaps with any specific IPv4 address, so a TIME_WAIT entry on `127.0.0.1:PORT` will still cause an EADDRINUSE on `0.0.0.0:PORT`. However, if a system uses an IPv6-only gateway bind (e.g., `::1`), a TIME_WAIT on that address won't be detected by probing `0.0.0.0`.
Adding an optional `host` parameter to `waitForPortBindable` (defaulting to `"0.0.0.0"`) would make this future-safe and document the intentional choice explicitly:
```typescript
export async function waitForPortBindable(
port: number,
opts: { timeoutMs?: number; intervalMs?: number; host?: string } = {},
): Promise<number> {
const host = opts.host ?? "0.0.0.0";
// ...
if (await probePortFree(port, host)) {
```
How can I resolve this? If you propose a fix, please make it concise.|
Thanks again for this work. This was landed indirectly via the synthesized PR #33831, and your contribution is credited in the changelog and as a co-author on the merge commit. This PR merge/triage workflow is AI-assisted. Closing this PR as superseded by #33831. If anything here looks incorrect or incomplete, reply to reopen and we can reassess. |
Problem
When a gateway process is killed and immediately restarted, the TCP port can be in
TIME_WAITstate: the kernel holds the socket in a 2-minute cleanup window even though the listening process is gone.lsof(andfuser) only report actively-listening sockets; they cannot seeTIME_WAITentries, so--forcewould succeed and report "no listeners" yet the subsequenthttpServer.listen()would still fail withEADDRINUSE.This caused intermittent gateway startup failures on restart in CI and on developer machines running hot-reload cycles.
Two-layer fix
Layer 1 — retry loop in
src/gateway/server/http-listen.tslistenGatewayHttpServernow retries up toEADDRINUSE_MAX_RETRIES(4) times atEADDRINUSE_RETRY_INTERVAL_MS(500 ms) intervals before throwingGatewayLockError. This handles the general case where any gateway restart hits a transiently-occupied port. Total retry budget: 2 seconds.Layer 2 — TCP bind probe in
src/cli/ports.ts(only with--force)Two new exports:
probePortFree(port, host)— creates anet.Server, binds it, closes it, and resolvestrueif that succeeded. Unlikelsof, a real bind attempt is rejected by the kernel if the port has aTIME_WAITentry.waitForPortBindable(port, opts)— pollsprobePortFreeup totimeoutMs(default 3 s) withintervalMs(default 150 ms) between probes.runGatewayCommandnow callswaitForPortBindableafterforceFreePortAndWaitwhen--forceis used.Test plan
pnpm build— clean buildpnpm check— lint/format passpnpm test— full test suite passesrun.option-collisions.test.tsupdated to mockwaitForPortBindableopenclaw gateway run --forceimmediately — gateway starts without EADDRINUSE error🤖 Generated with Claude Code