Skip to content

fix: add EADDRINUSE retry and TIME_WAIT port-bind checks for gateway startup#33555

Closed
rmfalco89 wants to merge 2 commits intoopenclaw:mainfrom
rmfalco89:fix/gateway-port-binding-v2
Closed

fix: add EADDRINUSE retry and TIME_WAIT port-bind checks for gateway startup#33555
rmfalco89 wants to merge 2 commits intoopenclaw:mainfrom
rmfalco89:fix/gateway-port-binding-v2

Conversation

@rmfalco89
Copy link
Copy Markdown

Problem

When a gateway process is killed and immediately restarted, the TCP port can be in TIME_WAIT state: the kernel holds the socket in a 2-minute cleanup window even though the listening process is gone. lsof (and fuser) only report actively-listening sockets; they cannot see TIME_WAIT entries, so --force would succeed and report "no listeners" yet the subsequent httpServer.listen() would still fail with EADDRINUSE.

This caused intermittent gateway startup failures on restart in CI and on developer machines running hot-reload cycles.

Two-layer fix

Layer 1 — retry loop in src/gateway/server/http-listen.ts

listenGatewayHttpServer now retries up to EADDRINUSE_MAX_RETRIES (4) times at EADDRINUSE_RETRY_INTERVAL_MS (500 ms) intervals before throwing GatewayLockError. This handles the general case where any gateway restart hits a transiently-occupied port. Total retry budget: 2 seconds.

Layer 2 — TCP bind probe in src/cli/ports.ts (only with --force)

Two new exports:

  • probePortFree(port, host) — creates a net.Server, binds it, closes it, and resolves true if that succeeded. Unlike lsof, a real bind attempt is rejected by the kernel if the port has a TIME_WAIT entry.
  • waitForPortBindable(port, opts) — polls probePortFree up to timeoutMs (default 3 s) with intervalMs (default 150 ms) between probes.

runGatewayCommand now calls waitForPortBindable after forceFreePortAndWait when --force is used.

Test plan

  • pnpm build — clean build
  • pnpm check — lint/format pass
  • pnpm test — full test suite passes
  • run.option-collisions.test.ts updated to mock waitForPortBindable
  • Manual: stop gateway, kill process, run openclaw gateway run --force immediately — gateway starts without EADDRINUSE error

🤖 Generated with Claude Code

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime cli CLI command changes size: M labels Mar 3, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81cfa53c51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

* - EACCES: bind to a privileged port as non-root.
* - EINVAL, etc.: other unrecoverable OS errors.
*/
export function probePortFree(port: number, host = "0.0.0.0"): Promise<boolean> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Probe bindability on the actual gateway bind host

probePortFree hardcodes 0.0.0.0 as the default host, and waitForPortBindable/runGatewayCommand never pass the resolved gateway bind address, so gateway run --force can fail early with EADDRINUSE for sockets on other local interfaces even when the configured bind host (for example 127.0.0.1 or a custom IP) is actually free. This is a regression from the new preflight check because startup now exits before reaching listenGatewayHttpServer, which would have succeeded on the intended host.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR introduces a two-layer defense against intermittent EADDRINUSE failures during gateway restarts caused by TCP TIME_WAIT kernel holds. Layer 1 adds a bounded retry loop (4 × 500 ms) in listenGatewayHttpServer; Layer 2 adds a real TCP bind probe (probePortFree / waitForPortBindable) in the --force path to verify the port is truly bindable before handing off to the server. The approach is technically sound — a real bind is the only reliable way to detect TIME_WAIT sockets that lsof cannot see.

Key observations:

  • The retry logic in http-listen.ts is correct: after EADDRINUSE, Node.js resets the server's internal handle to null, making re-calling listen() on the same http.Server instance safe.
  • waitForPortBindable is called unconditionally inside the --force block (even when no process was killed) — this is the right behavior since TIME_WAIT sockets are invisible to lsof.
  • waitForPortBindable always probes 0.0.0.0, silently ignoring the actual bind address. This works correctly for all IPv4 configurations but would miss TIME_WAIT entries on IPv6-only bind addresses.
  • The waitForPortBindable test suite only covers the non-retryable EACCES short-circuit; the core retry-on-EADDRINUSE behavior and timeout/throw path are not exercised by any test.

Confidence Score: 4/5

  • This PR is safe to merge; the two-layer approach is architecturally sound and the changes are well-scoped.
  • No logic bugs found. The implementation correctly handles TIME_WAIT at both the probe and listen layers. Minor concerns: waitForPortBindable lacks a host parameter (IPv6 edge case) and its core retry loop is not covered by tests, but neither affects correctness for the primary use case.
  • src/cli/ports.test.ts would benefit from tests covering the EADDRINUSE retry path and timeout scenario in waitForPortBindable.

Last reviewed commit: 81cfa53

Comment on lines +85 to +97
describe("waitForPortBindable", () => {
it("propagates EACCES rejection immediately without retrying", async () => {
// Every call to createServer will emit EACCES — so if waitForPortBindable retried,
// mockCreateServer would be called many times. We assert it's called exactly once.
mockCreateServer.mockClear();
mockCreateServer.mockReturnValue(makeErrServer("EACCES"));
await expect(
waitForPortBindable(80, { timeoutMs: 5000, intervalMs: 50 }),
).rejects.toMatchObject({ code: "EACCES" });
// Only one probe should have been attempted — no spinning through the retry loop.
expect(mockCreateServer).toHaveBeenCalledTimes(1);
});
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the core retry loop

The waitForPortBindable test suite only exercises the non-retryable EACCES short-circuit path. The primary purpose of the function — retrying when probePortFree returns false (EADDRINUSE / TIME_WAIT) and eventually succeeding after N probes — is not covered by any test. Neither is the timeout/throw scenario (port stays occupied throughout all retries).

Consider adding:

  1. A test where mockCreateServer returns EADDRINUSE for the first N calls and then succeeds, asserting the returned waited value is > 0 and that the probe was called the expected number of times.
  2. A test where mockCreateServer always returns EADDRINUSE and timeoutMs is small, asserting the function rejects with the expected error message.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/ports.test.ts
Line: 85-97

Comment:
**Missing test coverage for the core retry loop**

The `waitForPortBindable` test suite only exercises the non-retryable EACCES short-circuit path. The primary purpose of the function — retrying when `probePortFree` returns `false` (EADDRINUSE / TIME_WAIT) and eventually succeeding after N probes — is not covered by any test. Neither is the timeout/throw scenario (port stays occupied throughout all retries).

Consider adding:
1. A test where `mockCreateServer` returns EADDRINUSE for the first N calls and then succeeds, asserting the returned `waited` value is `> 0` and that the probe was called the expected number of times.
2. A test where `mockCreateServer` always returns EADDRINUSE and `timeoutMs` is small, asserting the function rejects with the expected error message.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +367 to +386
export async function waitForPortBindable(
port: number,
opts: { timeoutMs?: number; intervalMs?: number } = {},
): Promise<number> {
const timeoutMs = Math.max(opts.timeoutMs ?? 3000, 0);
const intervalMs = Math.max(opts.intervalMs ?? 150, 1);
let waited = 0;
while (waited < timeoutMs) {
if (await probePortFree(port)) {
return waited;
}
await sleep(intervalMs);
waited += intervalMs;
}
// Final attempt
if (await probePortFree(port)) {
return waited;
}
throw new Error(`port ${port} still not bindable after ${waited}ms (TIME_WAIT or kernel hold)`);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForPortBindable silently locks probe to 0.0.0.0

probePortFree accepts a host parameter, but waitForPortBindable doesn't expose one — every internal call uses the default "0.0.0.0". In run.ts, the resolved bind address (loopback, LAN IP, Tailscale interface, etc.) isn't yet available at the point waitForPortBindable is called, so the host can't be forwarded today without a refactor.

In practice this works correctly for IPv4 because binding 0.0.0.0:PORT overlaps with any specific IPv4 address, so a TIME_WAIT entry on 127.0.0.1:PORT will still cause an EADDRINUSE on 0.0.0.0:PORT. However, if a system uses an IPv6-only gateway bind (e.g., ::1), a TIME_WAIT on that address won't be detected by probing 0.0.0.0.

Adding an optional host parameter to waitForPortBindable (defaulting to "0.0.0.0") would make this future-safe and document the intentional choice explicitly:

export async function waitForPortBindable(
  port: number,
  opts: { timeoutMs?: number; intervalMs?: number; host?: string } = {},
): Promise<number> {
  const host = opts.host ?? "0.0.0.0";
  // ...
  if (await probePortFree(port, host)) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/ports.ts
Line: 367-386

Comment:
**`waitForPortBindable` silently locks probe to `0.0.0.0`**

`probePortFree` accepts a `host` parameter, but `waitForPortBindable` doesn't expose one — every internal call uses the default `"0.0.0.0"`. In `run.ts`, the resolved bind address (loopback, LAN IP, Tailscale interface, etc.) isn't yet available at the point `waitForPortBindable` is called, so the host can't be forwarded today without a refactor.

In practice this works correctly for IPv4 because binding `0.0.0.0:PORT` overlaps with any specific IPv4 address, so a TIME_WAIT entry on `127.0.0.1:PORT` will still cause an EADDRINUSE on `0.0.0.0:PORT`. However, if a system uses an IPv6-only gateway bind (e.g., `::1`), a TIME_WAIT on that address won't be detected by probing `0.0.0.0`.

Adding an optional `host` parameter to `waitForPortBindable` (defaulting to `"0.0.0.0"`) would make this future-safe and document the intentional choice explicitly:

```typescript
export async function waitForPortBindable(
  port: number,
  opts: { timeoutMs?: number; intervalMs?: number; host?: string } = {},
): Promise<number> {
  const host = opts.host ?? "0.0.0.0";
  // ...
  if (await probePortFree(port, host)) {
```

How can I resolve this? If you propose a fix, please make it concise.

@Takhoffman
Copy link
Copy Markdown
Contributor

Thanks again for this work.

This was landed indirectly via the synthesized PR #33831, and your contribution is credited in the changelog and as a co-author on the merge commit.

This PR merge/triage workflow is AI-assisted.

Closing this PR as superseded by #33831. If anything here looks incorrect or incomplete, reply to reopen and we can reassess.

@Takhoffman Takhoffman closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants