Skip to content

fix(gateway): restore loopback handshake timeout budget (#44714)#44962

Open
best wants to merge 2 commits intoopenclaw:mainfrom
best:fix/loopback-handshake-timeout
Open

fix(gateway): restore loopback handshake timeout budget (#44714)#44962
best wants to merge 2 commits intoopenclaw:mainfrom
best:fix/loopback-handshake-timeout

Conversation

@best
Copy link
Copy Markdown

@best best commented Mar 13, 2026

Summary

  • Problem: PR Hardening: tighten preauth WebSocket handshake limits #44089 (GHSA-jv4g-m82p-2j93) reduced DEFAULT_HANDSHAKE_TIMEOUT_MS from 10 s → 3 s. On loaded machines, event-loop latency pushes the pre-auth WebSocket handshake past 3 s, so every local CLI command (openclaw logs, openclaw cron list, etc.) fails with handshake-timeout while the gateway remains healthy.
  • Why it matters: Every local CLI command that uses WebSocket is broken on busy gateways after upgrading to v2026.3.12.
  • What changed: Added a loopback-aware timeout tier — direct localhost connections get 10 s (pre-hardening budget), remote/proxied clients keep 3 s (preserving the security hardening).
  • What did NOT change (scope boundary): Remote client timeout, pre-auth payload limit, or any other GHSA-jv4g-m82p-2j93 hardening.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration
  • Auth / tokens

Linked Issue/PR

User-visible / Behavior Changes

Local CLI commands that were failing with handshake-timeout on v2026.3.12 will work again. No config changes needed.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Debian 13 x64 / macOS (reported by multiple users)
  • Runtime/container: Node v22+
  • Model/provider: N/A
  • Integration/channel: N/A
  • Relevant config: Default local gateway, auth mode token

Steps

  1. Start gateway on v2026.3.12
  2. Ensure some background load (active cron jobs, plugin activity)
  3. Run openclaw logs --follow or openclaw cron list

Expected

  • CLI connects and works normally

Actual

  • gateway closed (1000 normal closure) / handshake-timeout

Evidence

Human Verification (required)

  • Verified scenarios: Local gateway with active cron jobs, openclaw cron list works after patching the dist bundle with the equivalent constant change
  • Edge cases checked: Remote client (3 s preserved), proxied-via-loopback (3 s preserved), direct loopback (10 s restored)
  • What you did not verify: Full CI suite (relying on CI)

AI-Assisted Contribution 🤖

  • This PR was developed with AI assistance (Codex for implementation, Claude Code for independent review)
  • Testing: Lightly tested — pnpm check passed locally, unit tests 3/3 passed, static analysis + dual AI review, CI pending
  • I understand what the code does and can explain every line

Review Conversations

  • I will reply to or resolve every bot review conversation addressed in this PR.
  • I will leave unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert this commit; all connections fall back to 3 s
  • Files/config to restore: None
  • Known bad symptoms: If loopback detection misclassifies a remote connection, it would get a longer pre-auth window (10 s vs 3 s) — still well within typical HTTP server timeouts

Risks and Mitigations

  • Risk: isLocalDirectRequest() misclassifies a proxied remote connection as local, giving it the longer 10 s window
    • Mitigation: Uses the existing proxy-aware isLocalDirectRequest() which resolves the real client IP through trustedProxies + X-Forwarded-For chain; unit test explicitly covers the proxied-via-loopback scenario

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR restores a two-tier pre-auth WebSocket handshake timeout: direct loopback connections (local CLI tools) get 10 s — matching the pre-hardening budget — while all remote/proxied clients keep the 3 s window introduced by the GHSA-jv4g-m82p-2j93 hardening. The fix inserts cleanly into the existing proxy-aware isLocalDirectRequest machinery, avoids touching the remote-client timeout or any other security knob, and is covered by four targeted unit tests.

Key observations:

  • The new resolveHandshakeTimeoutMs helper delegates entirely to the well-tested isLocalDirectRequest, so the proxy-chain safety properties (spoofed X-Forwarded-For without configured trustedProxies does NOT yield a local classification) are inherited automatically.
  • The loadConfig() call added in ws-connection.ts is consistent with message-handler.ts, which already calls loadConfig() per connection for the same gateway.trustedProxies / allowRealIpFallback fields. This means two loadConfig() calls occur per connection rather than one — a minor redundancy, but not a bug and not materially different from the existing pattern.
  • The OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS override in getHandshakeTimeoutMs correctly short-circuits before the new isLocalClient branch, preserving all existing integration tests without modification.
  • IPv6 loopback (::1) test coverage is now included, addressing the previously raised gap.

Confidence Score: 4/5

  • This PR is safe to merge; the security boundary (remote clients stay at 3 s) is preserved and the fix re-uses existing, tested proxy-aware loopback detection.
  • The logic is straightforward, re-uses a battle-tested function (isLocalDirectRequest), and is backed by four unit tests covering all key scenarios including IPv6 loopback and the proxied-via-loopback edge case. The only minor concern is a second loadConfig() call per connection, but that's consistent with the existing codebase pattern in message-handler.ts. No critical logic errors or security regressions found.
  • No files require special attention. src/gateway/server/ws-connection.ts introduces a second loadConfig() per connection but this matches the existing pattern in message-handler.ts and carries no correctness risk.

Last reviewed commit: 915e8f2

Comment on lines +20 to +55
it("keeps the longer default timeout for direct loopback handshakes only", () => {
expect(
resolveHandshakeTimeoutMs({
req: createGatewayRequest({
path: "/",
host: "127.0.0.1:18789",
remoteAddress: "127.0.0.1",
}),
}),
).toBe(DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS);

expect(
resolveHandshakeTimeoutMs({
req: createGatewayRequest({
path: "/",
host: "gateway.example:18789",
remoteAddress: "203.0.113.10",
}),
}),
).toBe(DEFAULT_HANDSHAKE_TIMEOUT_MS);

expect(
resolveHandshakeTimeoutMs({
req: createGatewayRequest({
path: "/",
host: "gateway.example:18789",
remoteAddress: "127.0.0.1",
headers: {
"x-forwarded-for": "203.0.113.10",
"x-forwarded-host": "gateway.example:18789",
},
}),
trustedProxies: ["127.0.0.1"],
}),
).toBe(DEFAULT_HANDSHAKE_TIMEOUT_MS);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing IPv6 loopback test case

The unit test only exercises IPv4 loopback (127.0.0.1). On macOS and many Linux distributions, Node.js WebSocket connections to localhost resolve to ::1 (IPv6). isLocalishHost appears to handle ::1 based on its comment ("localhost/127.x::1 and mapped forms") and the parsing logic in net.ts, but there is no test coverage to confirm that resolveHandshakeTimeoutMs returns DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS for an IPv6 direct loopback connection.

Consider adding a third scenario:

expect(
  resolveHandshakeTimeoutMs({
    req: createGatewayRequest({
      path: "/",
      host: "[::1]:18789",
      remoteAddress: "::1",
    }),
  }),
).toBe(DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS);

Without this, a user on macOS where Node binds to ::1 could still hit the 3 s timeout after this fix is applied.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server.preauth-hardening.test.ts
Line: 20-55

Comment:
**Missing IPv6 loopback test case**

The unit test only exercises IPv4 loopback (`127.0.0.1`). On macOS and many Linux distributions, Node.js WebSocket connections to `localhost` resolve to `::1` (IPv6). `isLocalishHost` appears to handle `::1` based on its comment ("localhost/127.x::1 and mapped forms") and the parsing logic in `net.ts`, but there is no test coverage to confirm that `resolveHandshakeTimeoutMs` returns `DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS` for an IPv6 direct loopback connection.

Consider adding a third scenario:

```ts
expect(
  resolveHandshakeTimeoutMs({
    req: createGatewayRequest({
      path: "/",
      host: "[::1]:18789",
      remoteAddress: "::1",
    }),
  }),
).toBe(DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS);
```

Without this, a user on macOS where Node binds to `::1` could still hit the 3 s timeout after this fix is applied.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98fb6bd99a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

});

const handshakeTimeoutMs = getHandshakeTimeoutMs();
const configSnapshot = loadConfig();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start handshake timeout before config loading

Loading config synchronously at connection-open delays creation of the handshake timer, so unauthenticated sockets are allowed to stay open for loadConfig() latency plus the configured timeout. This weakens the 3s remote hardening under slow config reads (JSON5 parse/includes/env substitution), and it is compounded because the message handler immediately calls loadConfig() again (src/gateway/server/ws-connection/message-handler.ts), making slow-path latency worse. The timer should be armed before any synchronous config work (or config should be resolved once and reused) so the pre-auth budget is actually enforced from accept time.

Useful? React with 👍 / 👎.

@best best force-pushed the fix/loopback-handshake-timeout branch from 98fb6bd to 915e8f2 Compare March 13, 2026 10:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 915e8f2d59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +10 to +13
const isLocalHandshake = isLocalDirectRequest(
params.req,
params.trustedProxies,
params.allowRealIpFallback === true,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve local timeout when loopback is in trusted proxies

Calling isLocalDirectRequest with trustedProxies misclassifies direct localhost CLI connections as non-local when operators configure gateway.trustedProxies to include 127.0.0.1/::1 (a documented reverse-proxy setup): resolveClientIp treats the loopback peer as a proxy and, without forwarded headers, returns no client IP, so this path falls back to the 3s timeout instead of the intended 10s local budget. In that configuration, busy hosts can still reproduce handshake-timeout for direct local CLI WebSocket connects.

Useful? React with 👍 / 👎.

@best
Copy link
Copy Markdown
Author

best commented Mar 13, 2026

@greptileai review

@pandar00
Copy link
Copy Markdown

Experiencing the same issue

{"0":"{"subsystem":"gateway/ws"}","1":{"cause":"handshake-timeout","handshake":"failed","durationMs":3121,"host":"127.0.0.1:18789","handshakeMs":3002},"2":"closed before connect conn=1bf8fcf9-117e-486a-8aa0-6e16b498102d remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:18789 ua=n/a code=1000 reason=n/a","_meta":{"runtime":"node","runtimeVersion":"22.22.1","hostname":"unknown","name":"{"subsystem":"gateway/ws"}","parentNames":["openclaw"],"date":"2026-03-14T01:19:51.053Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///usr/local/lib/node_modules/openclaw/distsubsystem-DBBM_0ww.js:1057:51","fileName":"subsystem-DBBM_0ww.js","fileNameWithLine":"subsystem-DBBM_0ww.js:1057","fileColumn":"51","fileLine":"1057","filePath":"usr/local/lib/node_modules/openclaw/dist/subsystem-DBBM_0ww.js","filePathWithLine":"usr/local/lib/node_modules/openclaw/dist/subsystem-DBBM_0ww.js:1057","method":"logToFile"}},"time":"2026-03-13T18:19:51.053-07:00"}

Manually updating to 10s no longer have the issue

@timmyz
Copy link
Copy Markdown

timmyz commented Mar 15, 2026

@greptileai review

2026.3.13 still has this issue. It seems the pr has conflicts, will you resolve it? @best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

3 participants