Skip to content

fix: clear WS handshake timer early, increase timeouts#49751

Open
sashakhar1 wants to merge 1 commit intoopenclaw:mainfrom
sashakhar1:fix/ws-handshake-timeout-loopback-token-auth
Open

fix: clear WS handshake timer early, increase timeouts#49751
sashakhar1 wants to merge 1 commit intoopenclaw:mainfrom
sashakhar1:fix/ws-handshake-timeout-loopback-token-auth

Conversation

@sashakhar1
Copy link
Copy Markdown

Summary

  • Move clearHandshakeTimer() earlier in the gateway connect handler — right after validating the connect request, before any async auth work (device verification, token resolution, etc.)
  • Raise default timeouts: server handshake 3s->10s, client challenge 2s->10s

Problem

On resource-constrained hosts (small VPS, busy Node.js event loop), the gateway async auth flow (resolveConnectAuthState, device signature verification, token grant) can take longer than the 3-second DEFAULT_HANDSHAKE_TIMEOUT_MS. Because clearHandshakeTimer() is only called after the entire auth flow completes, the timer fires mid-auth and kills a legitimate in-progress connection with code 1000.

The client sends a valid connect request within ~50ms of WS open, the gateway receives it, starts processing auth — then the handshake timer fires 3 seconds after WS opened and closes the connection before auth finishes.

Gateway logs show:

[ws] handshake timeout conn=... remote=127.0.0.1
[ws] closed before connect conn=... code=1000 reason=n/a

Client sees:

Error: gateway closed (1000 normal closure): no close reason

Root cause

The handshake timer starts on WS open and races against the full async auth resolution. When auth takes >3s (due to event loop pressure, slow I/O, or heavy GC), the timer wins.

Timeline (broken):
  0ms    WS open -> handshake timer starts (3s)
  ~15ms  gateway sends connect.challenge
  ~50ms  client sends connect request -> gateway starts async auth
  3000ms handshake timer fires -> gateway sees !client -> close()
  ???ms  auth would have completed -> too late, connection dead

Fix

Separate "did the client send a valid connect request?" from "is the auth valid?". Clear the timer as soon as the connect request parses and validates, before entering async auth:

Timeline (fixed):
  0ms    WS open -> handshake timer starts (10s)
  ~15ms  gateway sends connect.challenge
  ~50ms  client sends connect request -> clearHandshakeTimer() -> async auth begins
  ???ms  auth completes -> session established

If auth fails, the connection is closed by the auth handler itself (with a proper error code), not by the handshake timer.

Files changed

File Change
src/gateway/server/ws-connection/message-handler.ts Move clearHandshakeTimer() from post-auth to pre-auth
src/gateway/server-constants.ts DEFAULT_HANDSHAKE_TIMEOUT_MS: 3s -> 10s
src/gateway/client.ts Client challenge timeout default: 2s -> 10s

Test plan

  • Existing gateway WS tests pass (timer moved, not removed — invalid/missing connect still times out)
  • openclaw gateway health succeeds on a resource-constrained host (2GB VPS, loopback + token auth)
  • openclaw cron list succeeds (requires operator.read scope via device identity)
  • Invalid handshake (malformed frame, wrong method) still closes promptly
  • Unauthenticated connect still rejected by auth handler (not the timer)

Fixes #46650
Fixes #48167

The gateway handshake timer fires before connect auth completes on
resource-constrained hosts (small VPS, busy event loop), because
clearHandshakeTimer() is called only after the full async auth flow
finishes. On systems where resolveConnectAuthState() or device
signature verification takes >3s, the timer kills a legitimate
in-progress connection with code 1000 "normal closure".

Move clearHandshakeTimer() to right after the connect request is
validated (parsed, method == connect, params OK) — before any async
auth work begins. This separates "did the client send a valid connect
request?" from "is the auth valid?", preventing the timer from racing
against auth resolution.

Also raise the default timeouts as a safety margin:
- Server handshake timeout: 3s → 10s
- Client challenge timeout: 2s → 10s (max 30s)

Fixes openclaw#46650
Fixes openclaw#48167

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 18, 2026

Greptile Summary

This PR fixes a real race condition on resource-constrained hosts where the 3-second WS handshake timer would fire and kill a legitimate connection whose async auth flow hadn't finished yet. The fix is semantically correct — the handshake timer's purpose is "did the client send a valid connect request?", which is answered as soon as the frame validates, not after auth resolves. Auth failures continue to be handled with proper close codes by the auth handler itself. Timeouts are also raised (server 3 s → 10 s, client 2 s → 10 s) to further reduce false-positive timer firings.

Key observations:

  • No watchdog for hanging auth: Moving clearHandshakeTimer() before the async auth block (at message-handler.ts:347) removes the only mechanism that would have eventually reclaimed a connection whose auth stalled indefinitely. If resolveConnectAuthState or downstream verification logic hangs due to slow I/O, the connection will now stay open forever. Consider adding a dedicated per-auth timeout to preserve this safety net without reinstating the original race condition.
  • Undocumented cap change in client.ts: The maximum allowed value for the connectDelayMs option was silently increased from 10 000 ms to 30 000 ms alongside the documented default change. Any call sites supplying connectDelayMs > 10_000 will now behave differently.

Confidence Score: 4/5

  • Safe to merge — the core semantic fix is correct and well-reasoned; minor follow-up work on an auth-hang watchdog is advisable but not blocking.
  • The handshake-timer move is logically sound and directly addresses the described failure mode. Auth failures still produce proper close codes. The main concern is a missing safety net for indefinitely-hanging auth flows (the original timer served this role as a side-effect), which could cause connection/resource leaks in pathological conditions but is unlikely in normal operation. The undocumented connectDelayMs cap change is minor. Existing tests should continue to pass since the timer is moved but not removed.
  • Pay attention to src/gateway/server/ws-connection/message-handler.ts — the absence of any timeout guard around the async auth block after clearHandshakeTimer() is the main gap to address before or after merging.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/server/ws-connection/message-handler.ts
Line: 347

Comment:
**No timeout guard for a hanging auth flow**

After this change, clearing the handshake timer as soon as a valid connect frame arrives means there is no longer any watchdog protecting against an indefinitely-stalled auth resolution. If `resolveConnectAuthState` (or the downstream device verification / token-grant logic called later in the handler) ever hangs due to slow I/O, a deadlocked rate-limiter, or an unresponsive upstream service, the WebSocket connection will stay open forever with no cleanup mechanism.

Before this PR the 3-second handshake timer would have eventually killed such a connection (a side-effect, not its designed purpose). That safety net is now gone.

Consider adding a per-auth deadline, for example wrapping the async auth block in a `Promise.race` against an `AbortSignal`/timeout:

```ts
// Clear handshake timer – client sent a valid connect frame
clearHandshakeTimer();

// Guard against indefinitely-stalled auth
const AUTH_TIMEOUT_MS = 30_000;
const authTimeoutId = setTimeout(() => {
  close(1011, "auth timeout");
}, AUTH_TIMEOUT_MS);
try {
  // ... existing async auth flow ...
} finally {
  clearTimeout(authTimeoutId);
}
```

This is low-risk to add and prevents connection/memory leaks in pathological environments where the very slow I/O that motivated this fix could also stall auth indefinitely.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gateway/client.ts
Line: 701

Comment:
**Undocumented upper-bound change for `connectDelayMs`**

In addition to the default changing from `2_000` to `10_000`, the maximum clamp value for the `connectDelayMs` option has also been silently increased from `10_000` to `30_000`. The PR description only mentions the default change.

This means callers who previously set `connectDelayMs` to a value between 10 001 ms and 30 000 ms (e.g. in tests) will now get a materially different timeout than before, since their value was silently clamped to 10 s previously and is now honored at face value. While the change is likely intentional (aligning the configurable ceiling with the new server-side 10 s default), it's worth explicitly documenting and auditing any call sites that supply a `connectDelayMs` above 10 000 ms to confirm the new behaviour is expected.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix: clear WS handsh..."

// Auth resolution is async and can take longer than the handshake timeout
// on resource-constrained hosts — clearing early prevents the timer from
// killing a connection that is legitimately in-progress.
clearHandshakeTimer();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No timeout guard for a hanging auth flow

After this change, clearing the handshake timer as soon as a valid connect frame arrives means there is no longer any watchdog protecting against an indefinitely-stalled auth resolution. If resolveConnectAuthState (or the downstream device verification / token-grant logic called later in the handler) ever hangs due to slow I/O, a deadlocked rate-limiter, or an unresponsive upstream service, the WebSocket connection will stay open forever with no cleanup mechanism.

Before this PR the 3-second handshake timer would have eventually killed such a connection (a side-effect, not its designed purpose). That safety net is now gone.

Consider adding a per-auth deadline, for example wrapping the async auth block in a Promise.race against an AbortSignal/timeout:

// Clear handshake timer – client sent a valid connect frame
clearHandshakeTimer();

// Guard against indefinitely-stalled auth
const AUTH_TIMEOUT_MS = 30_000;
const authTimeoutId = setTimeout(() => {
  close(1011, "auth timeout");
}, AUTH_TIMEOUT_MS);
try {
  // ... existing async auth flow ...
} finally {
  clearTimeout(authTimeoutId);
}

This is low-risk to add and prevents connection/memory leaks in pathological environments where the very slow I/O that motivated this fix could also stall auth indefinitely.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server/ws-connection/message-handler.ts
Line: 347

Comment:
**No timeout guard for a hanging auth flow**

After this change, clearing the handshake timer as soon as a valid connect frame arrives means there is no longer any watchdog protecting against an indefinitely-stalled auth resolution. If `resolveConnectAuthState` (or the downstream device verification / token-grant logic called later in the handler) ever hangs due to slow I/O, a deadlocked rate-limiter, or an unresponsive upstream service, the WebSocket connection will stay open forever with no cleanup mechanism.

Before this PR the 3-second handshake timer would have eventually killed such a connection (a side-effect, not its designed purpose). That safety net is now gone.

Consider adding a per-auth deadline, for example wrapping the async auth block in a `Promise.race` against an `AbortSignal`/timeout:

```ts
// Clear handshake timer – client sent a valid connect frame
clearHandshakeTimer();

// Guard against indefinitely-stalled auth
const AUTH_TIMEOUT_MS = 30_000;
const authTimeoutId = setTimeout(() => {
  close(1011, "auth timeout");
}, AUTH_TIMEOUT_MS);
try {
  // ... existing async auth flow ...
} finally {
  clearTimeout(authTimeoutId);
}
```

This is low-risk to add and prevents connection/memory leaks in pathological environments where the very slow I/O that motivated this fix could also stall auth indefinitely.

How can I resolve this? If you propose a fix, please make it concise.

typeof rawConnectDelayMs === "number" && Number.isFinite(rawConnectDelayMs)
? Math.max(250, Math.min(10_000, rawConnectDelayMs))
: 2_000;
? Math.max(250, Math.min(30_000, rawConnectDelayMs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Undocumented upper-bound change for connectDelayMs

In addition to the default changing from 2_000 to 10_000, the maximum clamp value for the connectDelayMs option has also been silently increased from 10_000 to 30_000. The PR description only mentions the default change.

This means callers who previously set connectDelayMs to a value between 10 001 ms and 30 000 ms (e.g. in tests) will now get a materially different timeout than before, since their value was silently clamped to 10 s previously and is now honored at face value. While the change is likely intentional (aligning the configurable ceiling with the new server-side 10 s default), it's worth explicitly documenting and auditing any call sites that supply a connectDelayMs above 10 000 ms to confirm the new behaviour is expected.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/client.ts
Line: 701

Comment:
**Undocumented upper-bound change for `connectDelayMs`**

In addition to the default changing from `2_000` to `10_000`, the maximum clamp value for the `connectDelayMs` option has also been silently increased from `10_000` to `30_000`. The PR description only mentions the default change.

This means callers who previously set `connectDelayMs` to a value between 10 001 ms and 30 000 ms (e.g. in tests) will now get a materially different timeout than before, since their value was silently clamped to 10 s previously and is now honored at face value. While the change is likely intentional (aligning the configurable ceiling with the new server-side 10 s default), it's worth explicitly documenting and auditing any call sites that supply a `connectDelayMs` above 10 000 ms to confirm the new behaviour is expected.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XS

Projects

None yet

1 participant