fix(gateway): restore loopback handshake timeout budget (#44714)#44962
fix(gateway): restore loopback handshake timeout budget (#44714)#44962best wants to merge 2 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR restores a two-tier pre-auth WebSocket handshake timeout: direct loopback connections (local CLI tools) get 10 s — matching the pre-hardening budget — while all remote/proxied clients keep the 3 s window introduced by the GHSA-jv4g-m82p-2j93 hardening. The fix inserts cleanly into the existing proxy-aware Key observations:
Confidence Score: 4/5
Last reviewed commit: 915e8f2 |
| it("keeps the longer default timeout for direct loopback handshakes only", () => { | ||
| expect( | ||
| resolveHandshakeTimeoutMs({ | ||
| req: createGatewayRequest({ | ||
| path: "/", | ||
| host: "127.0.0.1:18789", | ||
| remoteAddress: "127.0.0.1", | ||
| }), | ||
| }), | ||
| ).toBe(DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS); | ||
|
|
||
| expect( | ||
| resolveHandshakeTimeoutMs({ | ||
| req: createGatewayRequest({ | ||
| path: "/", | ||
| host: "gateway.example:18789", | ||
| remoteAddress: "203.0.113.10", | ||
| }), | ||
| }), | ||
| ).toBe(DEFAULT_HANDSHAKE_TIMEOUT_MS); | ||
|
|
||
| expect( | ||
| resolveHandshakeTimeoutMs({ | ||
| req: createGatewayRequest({ | ||
| path: "/", | ||
| host: "gateway.example:18789", | ||
| remoteAddress: "127.0.0.1", | ||
| headers: { | ||
| "x-forwarded-for": "203.0.113.10", | ||
| "x-forwarded-host": "gateway.example:18789", | ||
| }, | ||
| }), | ||
| trustedProxies: ["127.0.0.1"], | ||
| }), | ||
| ).toBe(DEFAULT_HANDSHAKE_TIMEOUT_MS); | ||
| }); |
There was a problem hiding this comment.
Missing IPv6 loopback test case
The unit test only exercises IPv4 loopback (127.0.0.1). On macOS and many Linux distributions, Node.js WebSocket connections to localhost resolve to ::1 (IPv6). isLocalishHost appears to handle ::1 based on its comment ("localhost/127.x::1 and mapped forms") and the parsing logic in net.ts, but there is no test coverage to confirm that resolveHandshakeTimeoutMs returns DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS for an IPv6 direct loopback connection.
Consider adding a third scenario:
expect(
resolveHandshakeTimeoutMs({
req: createGatewayRequest({
path: "/",
host: "[::1]:18789",
remoteAddress: "::1",
}),
}),
).toBe(DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS);Without this, a user on macOS where Node binds to ::1 could still hit the 3 s timeout after this fix is applied.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server.preauth-hardening.test.ts
Line: 20-55
Comment:
**Missing IPv6 loopback test case**
The unit test only exercises IPv4 loopback (`127.0.0.1`). On macOS and many Linux distributions, Node.js WebSocket connections to `localhost` resolve to `::1` (IPv6). `isLocalishHost` appears to handle `::1` based on its comment ("localhost/127.x::1 and mapped forms") and the parsing logic in `net.ts`, but there is no test coverage to confirm that `resolveHandshakeTimeoutMs` returns `DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS` for an IPv6 direct loopback connection.
Consider adding a third scenario:
```ts
expect(
resolveHandshakeTimeoutMs({
req: createGatewayRequest({
path: "/",
host: "[::1]:18789",
remoteAddress: "::1",
}),
}),
).toBe(DEFAULT_LOCAL_HANDSHAKE_TIMEOUT_MS);
```
Without this, a user on macOS where Node binds to `::1` could still hit the 3 s timeout after this fix is applied.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98fb6bd99a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }); | ||
|
|
||
| const handshakeTimeoutMs = getHandshakeTimeoutMs(); | ||
| const configSnapshot = loadConfig(); |
There was a problem hiding this comment.
Start handshake timeout before config loading
Loading config synchronously at connection-open delays creation of the handshake timer, so unauthenticated sockets are allowed to stay open for loadConfig() latency plus the configured timeout. This weakens the 3s remote hardening under slow config reads (JSON5 parse/includes/env substitution), and it is compounded because the message handler immediately calls loadConfig() again (src/gateway/server/ws-connection/message-handler.ts), making slow-path latency worse. The timer should be armed before any synchronous config work (or config should be resolved once and reused) so the pre-auth budget is actually enforced from accept time.
Useful? React with 👍 / 👎.
98fb6bd to
915e8f2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 915e8f2d59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const isLocalHandshake = isLocalDirectRequest( | ||
| params.req, | ||
| params.trustedProxies, | ||
| params.allowRealIpFallback === true, |
There was a problem hiding this comment.
Preserve local timeout when loopback is in trusted proxies
Calling isLocalDirectRequest with trustedProxies misclassifies direct localhost CLI connections as non-local when operators configure gateway.trustedProxies to include 127.0.0.1/::1 (a documented reverse-proxy setup): resolveClientIp treats the loopback peer as a proxy and, without forwarded headers, returns no client IP, so this path falls back to the 3s timeout instead of the intended 10s local budget. In that configuration, busy hosts can still reproduce handshake-timeout for direct local CLI WebSocket connects.
Useful? React with 👍 / 👎.
|
@greptileai review |
|
Experiencing the same issue {"0":"{"subsystem":"gateway/ws"}","1":{"cause":"handshake-timeout","handshake":"failed","durationMs":3121,"host":"127.0.0.1:18789","handshakeMs":3002},"2":"closed before connect conn=1bf8fcf9-117e-486a-8aa0-6e16b498102d remote=127.0.0.1 fwd=n/a origin=n/a host=127.0.0.1:18789 ua=n/a code=1000 reason=n/a","_meta":{"runtime":"node","runtimeVersion":"22.22.1","hostname":"unknown","name":"{"subsystem":"gateway/ws"}","parentNames":["openclaw"],"date":"2026-03-14T01:19:51.053Z","logLevelId":4,"logLevelName":"WARN","path":{"fullFilePath":"file:///usr/local/lib/node_modules/openclaw/distsubsystem-DBBM_0ww.js:1057:51","fileName":"subsystem-DBBM_0ww.js","fileNameWithLine":"subsystem-DBBM_0ww.js:1057","fileColumn":"51","fileLine":"1057","filePath":"usr/local/lib/node_modules/openclaw/dist/subsystem-DBBM_0ww.js","filePathWithLine":"usr/local/lib/node_modules/openclaw/dist/subsystem-DBBM_0ww.js:1057","method":"logToFile"}},"time":"2026-03-13T18:19:51.053-07:00"} Manually updating to 10s no longer have the issue |
2026.3.13 still has this issue. It seems the pr has conflicts, will you resolve it? @best |
Summary
DEFAULT_HANDSHAKE_TIMEOUT_MSfrom 10 s → 3 s. On loaded machines, event-loop latency pushes the pre-auth WebSocket handshake past 3 s, so every local CLI command (openclaw logs,openclaw cron list, etc.) fails withhandshake-timeoutwhile the gateway remains healthy.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
openclaw logs --followfails after 2026.3.12 upgrade with handshake timeout while gateway remains healthy #44714User-visible / Behavior Changes
Local CLI commands that were failing with
handshake-timeouton v2026.3.12 will work again. No config changes needed.Security Impact (required)
Repro + Verification
Environment
Steps
openclaw logs --followoropenclaw cron listExpected
Actual
gateway closed (1000 normal closure)/handshake-timeoutEvidence
{"cause":"handshake-timeout","handshakeMs":3000}for loopback connectionsopenclaw logs --followfails after 2026.3.12 upgrade with handshake timeout while gateway remains healthy #44714 (@taikai-zz, @timmyz, @NGLSL) and [Bug] CLI命令超时 - WebSocket认证问题 #44901Human Verification (required)
openclaw cron listworks after patching the dist bundle with the equivalent constant changeAI-Assisted Contribution 🤖
pnpm checkpassed locally, unit tests 3/3 passed, static analysis + dual AI review, CI pendingReview Conversations
Compatibility / Migration
Failure Recovery (if this breaks)
Risks and Mitigations
isLocalDirectRequest()misclassifies a proxied remote connection as local, giving it the longer 10 s windowisLocalDirectRequest()which resolves the real client IP throughtrustedProxies+X-Forwarded-Forchain; unit test explicitly covers the proxied-via-loopback scenario