fix: clear WS handshake timer early, increase timeouts#49751
fix: clear WS handshake timer early, increase timeouts#49751sashakhar1 wants to merge 1 commit intoopenclaw:mainfrom
Conversation
The gateway handshake timer fires before connect auth completes on resource-constrained hosts (small VPS, busy event loop), because clearHandshakeTimer() is called only after the full async auth flow finishes. On systems where resolveConnectAuthState() or device signature verification takes >3s, the timer kills a legitimate in-progress connection with code 1000 "normal closure". Move clearHandshakeTimer() to right after the connect request is validated (parsed, method == connect, params OK) — before any async auth work begins. This separates "did the client send a valid connect request?" from "is the auth valid?", preventing the timer from racing against auth resolution. Also raise the default timeouts as a safety margin: - Server handshake timeout: 3s → 10s - Client challenge timeout: 2s → 10s (max 30s) Fixes openclaw#46650 Fixes openclaw#48167 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Greptile SummaryThis PR fixes a real race condition on resource-constrained hosts where the 3-second WS handshake timer would fire and kill a legitimate connection whose async auth flow hadn't finished yet. The fix is semantically correct — the handshake timer's purpose is "did the client send a valid connect request?", which is answered as soon as the frame validates, not after auth resolves. Auth failures continue to be handled with proper close codes by the auth handler itself. Timeouts are also raised (server 3 s → 10 s, client 2 s → 10 s) to further reduce false-positive timer firings. Key observations:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/server/ws-connection/message-handler.ts
Line: 347
Comment:
**No timeout guard for a hanging auth flow**
After this change, clearing the handshake timer as soon as a valid connect frame arrives means there is no longer any watchdog protecting against an indefinitely-stalled auth resolution. If `resolveConnectAuthState` (or the downstream device verification / token-grant logic called later in the handler) ever hangs due to slow I/O, a deadlocked rate-limiter, or an unresponsive upstream service, the WebSocket connection will stay open forever with no cleanup mechanism.
Before this PR the 3-second handshake timer would have eventually killed such a connection (a side-effect, not its designed purpose). That safety net is now gone.
Consider adding a per-auth deadline, for example wrapping the async auth block in a `Promise.race` against an `AbortSignal`/timeout:
```ts
// Clear handshake timer – client sent a valid connect frame
clearHandshakeTimer();
// Guard against indefinitely-stalled auth
const AUTH_TIMEOUT_MS = 30_000;
const authTimeoutId = setTimeout(() => {
close(1011, "auth timeout");
}, AUTH_TIMEOUT_MS);
try {
// ... existing async auth flow ...
} finally {
clearTimeout(authTimeoutId);
}
```
This is low-risk to add and prevents connection/memory leaks in pathological environments where the very slow I/O that motivated this fix could also stall auth indefinitely.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/gateway/client.ts
Line: 701
Comment:
**Undocumented upper-bound change for `connectDelayMs`**
In addition to the default changing from `2_000` to `10_000`, the maximum clamp value for the `connectDelayMs` option has also been silently increased from `10_000` to `30_000`. The PR description only mentions the default change.
This means callers who previously set `connectDelayMs` to a value between 10 001 ms and 30 000 ms (e.g. in tests) will now get a materially different timeout than before, since their value was silently clamped to 10 s previously and is now honored at face value. While the change is likely intentional (aligning the configurable ceiling with the new server-side 10 s default), it's worth explicitly documenting and auditing any call sites that supply a `connectDelayMs` above 10 000 ms to confirm the new behaviour is expected.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "fix: clear WS handsh..." |
| // Auth resolution is async and can take longer than the handshake timeout | ||
| // on resource-constrained hosts — clearing early prevents the timer from | ||
| // killing a connection that is legitimately in-progress. | ||
| clearHandshakeTimer(); |
There was a problem hiding this comment.
No timeout guard for a hanging auth flow
After this change, clearing the handshake timer as soon as a valid connect frame arrives means there is no longer any watchdog protecting against an indefinitely-stalled auth resolution. If resolveConnectAuthState (or the downstream device verification / token-grant logic called later in the handler) ever hangs due to slow I/O, a deadlocked rate-limiter, or an unresponsive upstream service, the WebSocket connection will stay open forever with no cleanup mechanism.
Before this PR the 3-second handshake timer would have eventually killed such a connection (a side-effect, not its designed purpose). That safety net is now gone.
Consider adding a per-auth deadline, for example wrapping the async auth block in a Promise.race against an AbortSignal/timeout:
// Clear handshake timer – client sent a valid connect frame
clearHandshakeTimer();
// Guard against indefinitely-stalled auth
const AUTH_TIMEOUT_MS = 30_000;
const authTimeoutId = setTimeout(() => {
close(1011, "auth timeout");
}, AUTH_TIMEOUT_MS);
try {
// ... existing async auth flow ...
} finally {
clearTimeout(authTimeoutId);
}This is low-risk to add and prevents connection/memory leaks in pathological environments where the very slow I/O that motivated this fix could also stall auth indefinitely.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server/ws-connection/message-handler.ts
Line: 347
Comment:
**No timeout guard for a hanging auth flow**
After this change, clearing the handshake timer as soon as a valid connect frame arrives means there is no longer any watchdog protecting against an indefinitely-stalled auth resolution. If `resolveConnectAuthState` (or the downstream device verification / token-grant logic called later in the handler) ever hangs due to slow I/O, a deadlocked rate-limiter, or an unresponsive upstream service, the WebSocket connection will stay open forever with no cleanup mechanism.
Before this PR the 3-second handshake timer would have eventually killed such a connection (a side-effect, not its designed purpose). That safety net is now gone.
Consider adding a per-auth deadline, for example wrapping the async auth block in a `Promise.race` against an `AbortSignal`/timeout:
```ts
// Clear handshake timer – client sent a valid connect frame
clearHandshakeTimer();
// Guard against indefinitely-stalled auth
const AUTH_TIMEOUT_MS = 30_000;
const authTimeoutId = setTimeout(() => {
close(1011, "auth timeout");
}, AUTH_TIMEOUT_MS);
try {
// ... existing async auth flow ...
} finally {
clearTimeout(authTimeoutId);
}
```
This is low-risk to add and prevents connection/memory leaks in pathological environments where the very slow I/O that motivated this fix could also stall auth indefinitely.
How can I resolve this? If you propose a fix, please make it concise.| typeof rawConnectDelayMs === "number" && Number.isFinite(rawConnectDelayMs) | ||
| ? Math.max(250, Math.min(10_000, rawConnectDelayMs)) | ||
| : 2_000; | ||
| ? Math.max(250, Math.min(30_000, rawConnectDelayMs)) |
There was a problem hiding this comment.
Undocumented upper-bound change for
connectDelayMs
In addition to the default changing from 2_000 to 10_000, the maximum clamp value for the connectDelayMs option has also been silently increased from 10_000 to 30_000. The PR description only mentions the default change.
This means callers who previously set connectDelayMs to a value between 10 001 ms and 30 000 ms (e.g. in tests) will now get a materially different timeout than before, since their value was silently clamped to 10 s previously and is now honored at face value. While the change is likely intentional (aligning the configurable ceiling with the new server-side 10 s default), it's worth explicitly documenting and auditing any call sites that supply a connectDelayMs above 10 000 ms to confirm the new behaviour is expected.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/client.ts
Line: 701
Comment:
**Undocumented upper-bound change for `connectDelayMs`**
In addition to the default changing from `2_000` to `10_000`, the maximum clamp value for the `connectDelayMs` option has also been silently increased from `10_000` to `30_000`. The PR description only mentions the default change.
This means callers who previously set `connectDelayMs` to a value between 10 001 ms and 30 000 ms (e.g. in tests) will now get a materially different timeout than before, since their value was silently clamped to 10 s previously and is now honored at face value. While the change is likely intentional (aligning the configurable ceiling with the new server-side 10 s default), it's worth explicitly documenting and auditing any call sites that supply a `connectDelayMs` above 10 000 ms to confirm the new behaviour is expected.
How can I resolve this? If you propose a fix, please make it concise.
Summary
clearHandshakeTimer()earlier in the gateway connect handler — right after validating the connect request, before any async auth work (device verification, token resolution, etc.)Problem
On resource-constrained hosts (small VPS, busy Node.js event loop), the gateway async auth flow (
resolveConnectAuthState, device signature verification, token grant) can take longer than the 3-secondDEFAULT_HANDSHAKE_TIMEOUT_MS. BecauseclearHandshakeTimer()is only called after the entire auth flow completes, the timer fires mid-auth and kills a legitimate in-progress connection with code 1000.The client sends a valid
connectrequest within ~50ms of WS open, the gateway receives it, starts processing auth — then the handshake timer fires 3 seconds after WS opened and closes the connection before auth finishes.Gateway logs show:
Client sees:
Root cause
The handshake timer starts on WS open and races against the full async auth resolution. When auth takes >3s (due to event loop pressure, slow I/O, or heavy GC), the timer wins.
Fix
Separate "did the client send a valid connect request?" from "is the auth valid?". Clear the timer as soon as the connect request parses and validates, before entering async auth:
If auth fails, the connection is closed by the auth handler itself (with a proper error code), not by the handshake timer.
Files changed
src/gateway/server/ws-connection/message-handler.tsclearHandshakeTimer()from post-auth to pre-authsrc/gateway/server-constants.tsDEFAULT_HANDSHAKE_TIMEOUT_MS: 3s -> 10ssrc/gateway/client.tsTest plan
openclaw gateway healthsucceeds on a resource-constrained host (2GB VPS, loopback + token auth)openclaw cron listsucceeds (requires operator.read scope via device identity)Fixes #46650
Fixes #48167