Skip to content

fix(gateway/ui): restore handshake timeout and remove connect delay (…#46631

Open
divitsinghall wants to merge 2 commits intoopenclaw:mainfrom
divitsinghall:fix/websocket-disconnect-reconnect-loop-46103
Open

fix(gateway/ui): restore handshake timeout and remove connect delay (…#46631
divitsinghall wants to merge 2 commits intoopenclaw:mainfrom
divitsinghall:fix/websocket-disconnect-reconnect-loop-46103

Conversation

@divitsinghall
Copy link
Copy Markdown

fix(gateway/ui): restore handshake timeout and remove connect delay (#46103)

The WebChat UI entered an infinite disconnect/reconnect loop after PR #44089 tightened the preauth handshake timeout from 10s to 3s. The browser client's queueConnect() added a 750ms debounce delay before sendConnect(), plus async crypto operations (device identity loading + signature generation), leaving insufficient time to complete the handshake before the server closed the connection.

Two changes:

  1. Restore DEFAULT_HANDSHAKE_TIMEOUT_MS to 10_000 (10s). The preauth payload size cap (MAX_PREAUTH_PAYLOAD_BYTES = 64KB) already provides sufficient protection against abuse without needing an aggressive timeout.
  2. Remove the 750ms queueConnect() delay. The server's connect.challenge nonce already ensures proper ordering, so the debounce was unnecessary and wasted time from the handshake budget.

Fixes #46103

Summary

  • Problem: WebChat UI at /chat enters an infinite disconnect/reconnect loop (code 1001) after PR Hardening: tighten preauth WebSocket handshake limits #44089 reduced the handshake timeout from 10s to 3s.
  • Why it matters: Users cannot use WebChat at all — the connection cycles every 1–2 seconds with a "disconnected" warning icon.
  • What changed: Restored DEFAULT_HANDSHAKE_TIMEOUT_MS to 10s and removed the unnecessary 750ms debounce delay in the browser client's queueConnect().
  • What did NOT change (scope boundary): The preauth payload size cap (64KB), preauth frame validation, and all other security hardening from PR Hardening: tighten preauth WebSocket handshake limits #44089 remain intact. No changes to Node.js client, auth flow, or reconnect backoff logic.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • WebChat connections at /chat now complete the handshake reliably instead of timing out in a loop.
  • DEFAULT_HANDSHAKE_TIMEOUT_MS restored from 3s to 10s (the original value before PR Hardening: tighten preauth WebSocket handshake limits #44089).
  • Browser client connects ~750ms faster (removed unnecessary debounce delay in queueConnect()).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A — The preauth payload size cap (MAX_PREAUTH_PAYLOAD_BYTES = 64KB) and preauth frame validation from PR Hardening: tighten preauth WebSocket handshake limits #44089 remain intact. The handshake timeout is restored to the original 10s value that was stable for months before the regression.

Repro + Verification

Environment

  • OS: macOS (also reproducible on Chrome/Safari per reporter)
  • Runtime/container: Node.js (pnpm workspace)
  • Model/provider: N/A
  • Integration/channel (if any): WebChat UI
  • Relevant config (redacted): Default gateway config with gateway.bind = "loopback"

Steps

  1. Start gateway with pnpm gateway:watch
  2. Open http://127.0.0.1:18789/chat in Chrome
  3. Open DevTools → Network → WS tab
  4. Observe WebSocket connection behavior

Expected

  • Single stable WebSocket connection that stays open indefinitely

Actual

  • Before fix: WebSocket closes with code 1001 every 1–2 seconds and immediately reconnects
  • After fix: WebSocket connection stays open with no repeated close/open events

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

All 28 directly relevant gateway tests pass:

  • server.preauth-hardening.test.ts (2/2 ✅) — confirms handshake timeout still enforced via env override
  • client.watchdog.test.ts (6/6 ✅) — confirms tick watchdog unaffected
  • client.test.ts (20/20 ✅) — confirms client connect/reconnect/auth flows intact

Human Verification (required)

  • Verified scenarios: Ran all 28 directly relevant tests (preauth hardening, client watchdog, client auth). Confirmed the 7 failing tests in the full gateway suite (142 files) are pre-existing and in unrelated areas (agent symlinks, node-invoke approval).
  • Edge cases checked: Verified getHandshakeTimeoutMs() still supports OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS env override for tests. Confirmed queueConnect() still resets connectNonce and connectSent state correctly without the timer.
  • What you did not verify: Live WebChat UI testing (requires running the full gateway server and opening /chat in a browser). Recommend the reviewer verify this manually.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit 8af809dbd or set OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS=3000 as env var for quick testing.
  • Files/config to restore: src/gateway/server-constants.ts (change 10_000 back to 3_000), ui/src/ui/gateway.ts (restore 750ms setTimeout in queueConnect()).
  • Known bad symptoms reviewers should watch for: If the handshake timeout is too generous, slow/malicious clients could hold unauthenticated connections longer. Mitigated by the 64KB preauth payload cap.

Risks and Mitigations

  • Risk: Longer handshake timeout (10s vs 3s) allows unauthenticated sockets to linger slightly longer.
    • Mitigation: The MAX_PREAUTH_PAYLOAD_BYTES = 64KB cap limits what unauthenticated clients can send, and 10s was the stable default for months before the regression. The security posture is equivalent to the original pre-hardening state plus the payload cap.

@openclaw-barnacle openclaw-barnacle bot added app: web-ui App: web-ui gateway Gateway runtime size: XS labels Mar 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8af809dbde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

void this.sendConnect();
}, 750);
this.connectTimer = null;
void this.sendConnect();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wait for nonce before issuing browser connect request

queueConnect() now calls sendConnect() immediately on socket open, before connect.challenge is necessarily processed, so sendConnect() can build the device auth payload with nonce: "" when this.connectNonce is still null. In that case the server rejects the handshake (device nonce required) and closes the socket in src/gateway/server/ws-connection/message-handler.ts (providedNonce check), which can reproduce the same reconnect loop on slower/variable links where the challenge event arrives after open; the previous delay avoided this race by giving the challenge frame time to populate connectNonce.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 14, 2026

Greptile Summary

This PR fixes a WebChat reconnect loop regression introduced in #44089 by making two changes: restoring DEFAULT_HANDSHAKE_TIMEOUT_MS to 10s (correct, the 3s limit was too aggressive for the browser's async crypto path) and removing the 750ms debounce in queueConnect() (functionally works today but introduces a fragile timing dependency described below).

  • Timeout restoration (server-constants.ts): Clearly correct. The browser client must complete loadOrCreateDeviceIdentity() (localStorage + crypto.subtle.digest) and signDevicePayload() (crypto.subtle.sign) before the handshake finishes. 3s was too tight; 10s matches the pre-hardening baseline. The MAX_PREAUTH_PAYLOAD_BYTES = 64KB cap still limits unauthenticated connection abuse.

  • Timer removal (gateway.ts): The PR claims the server's connect.challenge nonce "ensures proper ordering", but the opposite is true: the server sends the challenge immediately on connection (ws-connection.ts), and rejects empty nonces with device-nonce-missing (message-handler.ts lines 606–613). By calling sendConnect() immediately in queueConnect(), connectSent is set to true before the challenge message can arrive as a macrotask. The sendConnect() call inside the connect.challenge handler (line 402) is now always a no-op. The code works because loadOrCreateDeviceIdentity() always awaits crypto.subtle.digest() — a call that currently yields to the macrotask queue in browsers, allowing the challenge to arrive before connectNonce is read. This is an undocumented, implementation-specific timing dependency rather than an explicit ordering guarantee.

  • A minimal timer (e.g., setTimeout(..., 0) or even 50ms) in queueConnect() would eliminate the race entirely while still recovering nearly all of the 750ms savings.

Confidence Score: 3/5

  • Safe to merge for the timeout fix; the timer removal works in practice but relies on browser crypto timing internals rather than explicit ordering.
  • The 10s timeout change correctly and completely fixes the reported regression. The 750ms timer removal works today because crypto.subtle.digest() yields to the macrotask queue in all current major browsers, giving the connect.challenge message time to arrive before connectNonce is read in sendConnect(). However, the connect.challenge handler's sendConnect() call is now always a no-op, and the ordering depends on browser-specific crypto scheduling rather than the explicit timer that was previously in place. If a JS environment resolves crypto.subtle promises as microtasks (or the device-identity fast-path skips the digest entirely in a future refactor), connections will fail with a server-side device-nonce-missing rejection and the reconnect loop will return. Score reflects that the PR solves the immediate bug but leaves the nonce-ordering correctness on uncertain footing.
  • ui/src/ui/gateway.ts — the queueConnect() change introduces an implicit timing dependency on crypto.subtle yielding to macrotasks; the connect.challenge handler's sendConnect() call is now always a no-op and the challenge nonce is only included by coincidence of async scheduling.

Comments Outside Diff (1)

  1. ui/src/ui/gateway.ts, line 482-490 (link)

    Nonce ordering now relies on implicit crypto timing

    The PR description states "The server's connect.challenge nonce already ensures proper ordering", but the server code in ws-connection.ts sends connect.challenge immediately on connection. The server's message-handler.ts (lines 606–613) explicitly rejects empty nonces:

    if (!providedNonce) {
        rejectDeviceAuthInvalid("device-nonce-missing", "device nonce required");
    

    With the timer removed, sendConnect() is invoked synchronously on the open event — before the connect.challenge macrotask can be processed. The nonce is only set in time because loadOrCreateDeviceIdentity() calls crypto.subtle.digest() (inside fingerprintPublicKey), which happens to yield to the macrotask queue in current browser implementations, allowing the challenge message to be processed during that pause.

    This means the correct nonce is included today because of an undocumented timing side-effect of the crypto API, not by design. If crypto.subtle.digest() ever resolves synchronously (e.g., in a future browser engine, a test environment with a synchronous crypto shim, or if identity is loaded from a synchronous cache path that avoids the await), connectNonce will still be null when read, the server will reject with device-nonce-missing, and the client will fall back into a reconnect loop.

    The original 750ms timer guaranteed the challenge macrotask was processed before sendConnect() ran, regardless of crypto timing. A more robust alternative that avoids the unnecessary delay would be to move sendConnect() invocation entirely into the connect.challenge handler (making the timer the sole fallback), rather than calling it eagerly from queueConnect():

    private queueConnect() {
      this.connectNonce = null;
      this.connectSent = false;
      if (this.connectTimer !== null) {
        window.clearTimeout(this.connectTimer);
      }
      // Fallback only: challenge handler is the primary trigger for sendConnect().
      // If the server never sends connect.challenge within a short window, we proceed
      // with an empty nonce and let the server reject/retry as appropriate.
      this.connectTimer = window.setTimeout(() => {
        void this.sendConnect();
      }, 50); // minimal delay; just enough to yield to the challenge macrotask
    }

    This preserves the ~750ms savings while keeping the ordering guarantee explicit.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: ui/src/ui/gateway.ts
Line: 482-490

Comment:
**Nonce ordering now relies on implicit crypto timing**

The PR description states "The server's connect.challenge nonce already ensures proper ordering", but the server code in `ws-connection.ts` sends `connect.challenge` **immediately** on connection. The server's `message-handler.ts` (lines 606–613) explicitly rejects empty nonces:

```
if (!providedNonce) {
    rejectDeviceAuthInvalid("device-nonce-missing", "device nonce required");
```

With the timer removed, `sendConnect()` is invoked synchronously on the `open` event — before the `connect.challenge` macrotask can be processed. The nonce is only set in time because `loadOrCreateDeviceIdentity()` calls `crypto.subtle.digest()` (inside `fingerprintPublicKey`), which happens to yield to the macrotask queue in current browser implementations, allowing the challenge message to be processed during that pause.

This means the correct nonce is included today because of an undocumented timing side-effect of the crypto API, not by design. If `crypto.subtle.digest()` ever resolves synchronously (e.g., in a future browser engine, a test environment with a synchronous crypto shim, or if identity is loaded from a synchronous cache path that avoids the `await`), `connectNonce` will still be `null` when read, the server will reject with `device-nonce-missing`, and the client will fall back into a reconnect loop.

The original 750ms timer guaranteed the challenge macrotask was processed before `sendConnect()` ran, regardless of crypto timing. A more robust alternative that avoids the unnecessary delay would be to move `sendConnect()` invocation entirely into the `connect.challenge` handler (making the timer the sole fallback), rather than calling it eagerly from `queueConnect()`:</p>

```typescript
private queueConnect() {
  this.connectNonce = null;
  this.connectSent = false;
  if (this.connectTimer !== null) {
    window.clearTimeout(this.connectTimer);
  }
  // Fallback only: challenge handler is the primary trigger for sendConnect().
  // If the server never sends connect.challenge within a short window, we proceed
  // with an empty nonce and let the server reject/retry as appropriate.
  this.connectTimer = window.setTimeout(() => {
    void this.sendConnect();
  }, 50); // minimal delay; just enough to yield to the challenge macrotask
}
```

This preserves the ~750ms savings while keeping the ordering guarantee explicit.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 8af809d

…penclaw#46103)

The WebChat UI entered an infinite disconnect/reconnect loop after PR openclaw#44089
tightened the preauth handshake timeout from 10s to 3s. The browser client's
queueConnect() added a 750ms debounce delay before sendConnect(), plus async
crypto operations (device identity loading + signature generation), leaving
insufficient time to complete the handshake before the server closed the
connection.

Two changes:
1. Restore DEFAULT_HANDSHAKE_TIMEOUT_MS to 10_000 (10s). The preauth payload
   size cap (MAX_PREAUTH_PAYLOAD_BYTES = 64KB) already provides sufficient
   protection against abuse without needing an aggressive timeout.
2. Remove the 750ms queueConnect() delay. The server's connect.challenge nonce
   already ensures proper ordering, so the debounce was unnecessary and wasted
   time from the handshake budget.

Fixes openclaw#46103
@divitsinghall divitsinghall force-pushed the fix/websocket-disconnect-reconnect-loop-46103 branch from 8af809d to 3f82e62 Compare March 17, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WebSocket continuously disconnects and reconnects

1 participant