Skip to content

fix: improve WS handshake reliability on slow-startup environments#60075

Merged
BradGroux merged 2 commits intoopenclaw:mainfrom
BradGroux:bgod/fix-48736-windows-ws-handshake-timeout
Apr 3, 2026
Merged

fix: improve WS handshake reliability on slow-startup environments#60075
BradGroux merged 2 commits intoopenclaw:mainfrom
BradGroux:bgod/fix-48736-windows-ws-handshake-timeout

Conversation

@BradGroux
Copy link
Copy Markdown
Contributor

Fixes #48736

Problem

On Windows with large dist bundles (46MB / 639 files), heavy synchronous module loading during CLI startup blocks the event loop, preventing timely processing of the connect.challenge frame. This causes ~80% handshake timeout failures for CLI commands that use WebSocket connections (e.g. openclaw cron list), while the gateway itself is healthy.

Key evidence from the issue:

  • Manual Node.js script implementing v2 handshake connects 100% of the time
  • openclaw --version: 127ms vs openclaw cron list first stdout: 3,775ms
  • CLI doesn't send connect frame within the challenge timeout window

Changes

  1. Event loop yield before WS connection (call.ts): Added setImmediate yield before client.start() so pending I/O from heavy module loading can drain before the handshake begins.

  2. Client-side env var override (handshake-timeouts.ts): Added OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var for the client-side connect challenge timeout. The server already supports OPENCLAW_HANDSHAKE_TIMEOUT_MS — this gives users a client-side escape hatch for slow environments.

  3. Diagnostic timing in error messages (client.ts): Challenge timeout errors now include elapsed time and configured limit (e.g. waited 10023ms, limit 10000ms), making it much easier to diagnose timing issues.

  4. Tests: Added coverage for env var parsing, resolution precedence, and fallback behavior.

Testing

  • pnpm build — passes
  • pnpm check — passes (tsgo, lint, all checks)
  • pnpm vitest run src/gateway/handshake-timeouts.test.ts — 5/5 tests pass

…aw#48832)

schema.ts and validation.ts imported CHANNEL_IDS from channels/registry.js,
which re-exports from channels/ids.js but also imports plugins/runtime.js.
When the bundler resolves this dependency graph, the re-exported CHANNEL_IDS
can be undefined at the point config/validation.ts evaluates (temporal dead
zone), causing 'CHANNEL_IDS is not iterable' on startup.

Fix: import CHANNEL_IDS directly from channels/ids.js (the leaf module with
zero heavy dependencies) and normalizeChatChannelId from channels/chat-meta.js.

Fixes openclaw#48832
…penclaw#48736)

On Windows with large dist bundles (46MB/639 files), heavy synchronous
module loading blocks the event loop during CLI startup, preventing
timely processing of the connect.challenge frame and causing ~80%
handshake timeout failures.

Changes:
- Yield event loop (setImmediate) before starting WS connection in
  callGateway to let pending I/O drain after heavy module loading
- Add OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var override for
  client-side connect challenge timeout (server already has
  OPENCLAW_HANDSHAKE_TIMEOUT_MS)
- Include diagnostic timing in challenge timeout error messages
  (elapsed vs limit) for easier debugging
- Add tests for env var override and resolution logic
Copilot AI review requested due to automatic review settings April 3, 2026 04:36
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: S maintainer Maintainer-authored PR labels Apr 3, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR fixes WS handshake timeouts on Windows slow-startup environments via three mechanisms: a setImmediate event-loop yield before the WebSocket connection starts (call.ts), a new OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS client-side env var override (handshake-timeouts.ts), and richer diagnostic timing in the challenge-timeout error message (client.ts). The two import-path cleanups in config/ (ids.js, chat-meta.js) are unrelated boundary improvements bundled into the same PR.

  • The setImmediate yield is the real root-cause fix and is well-placed and documented.
  • The env var escape hatch is silently capped at MAX_CONNECT_CHALLENGE_TIMEOUT_MS = DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS = 10 000 ms, so users cannot raise the client-side timeout above the default — the primary use case for slow environments. See inline comment on handshake-timeouts.ts:31.

Confidence Score: 5/5

Safe to merge; the setImmediate fix correctly addresses the root cause, and the clamping concern is a usability limitation rather than a runtime defect.

All findings are P2. The core fix (setImmediate yield) is minimal and correct. The env var clamping concern doesn't cause incorrect behavior — it only limits the env var's usefulness for raising the timeout above the default — so it doesn't block merge.

src/gateway/handshake-timeouts.ts — MAX_CONNECT_CHALLENGE_TIMEOUT_MS cap

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/handshake-timeouts.ts
Line: 31

Comment:
**Env var silently capped at the default timeout**

`clampConnectChallengeTimeoutMs` caps values at `MAX_CONNECT_CHALLENGE_TIMEOUT_MS`, which equals `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS` (10,000 ms). So `OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS=30000` is silently truncated to 10,000 ms — the same as the default. The env var can only lower the timeout, not raise it, which is the opposite of what "slow environments" actually need. Either raise `MAX_CONNECT_CHALLENGE_TIMEOUT_MS` to a sensible upper bound (e.g. 60,000 ms) or document that the range is capped at the default.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: improve WS handshake reliability on..." | Re-trigger Greptile

}
const envOverride = getConnectChallengeTimeoutMsFromEnv();
if (envOverride !== undefined) {
return clampConnectChallengeTimeoutMs(envOverride);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Env var silently capped at the default timeout

clampConnectChallengeTimeoutMs caps values at MAX_CONNECT_CHALLENGE_TIMEOUT_MS, which equals DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS (10,000 ms). So OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS=30000 is silently truncated to 10,000 ms — the same as the default. The env var can only lower the timeout, not raise it, which is the opposite of what "slow environments" actually need. Either raise MAX_CONNECT_CHALLENGE_TIMEOUT_MS to a sensible upper bound (e.g. 60,000 ms) or document that the range is capped at the default.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/handshake-timeouts.ts
Line: 31

Comment:
**Env var silently capped at the default timeout**

`clampConnectChallengeTimeoutMs` caps values at `MAX_CONNECT_CHALLENGE_TIMEOUT_MS`, which equals `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS` (10,000 ms). So `OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS=30000` is silently truncated to 10,000 ms — the same as the default. The env var can only lower the timeout, not raise it, which is the opposite of what "slow environments" actually need. Either raise `MAX_CONNECT_CHALLENGE_TIMEOUT_MS` to a sensible upper bound (e.g. 60,000 ms) or document that the range is capped at the default.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the reliability and diagnosability of the CLI → Gateway WebSocket v2 handshake in slow-startup / event-loop-starved environments (notably Windows with large dist bundles), and reduces some startup coupling to heavier channel-registry modules.

Changes:

  • Yield once to the event loop before starting the WS client to let pending I/O drain prior to handshake initiation.
  • Add client-side OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env override (with clamping) and tests for parsing/precedence.
  • Improve connect-challenge timeout errors with elapsed/limit timing, and adjust config imports to use leaf channel ID/meta modules.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/gateway/handshake-timeouts.ts Adds env var parsing for connect-challenge timeout and updates resolution precedence.
src/gateway/handshake-timeouts.test.ts Adds tests for the new env var parsing + override behavior.
src/gateway/client.ts Enhances connect-challenge timeout error message with elapsed vs configured limit.
src/gateway/call.ts Yields via setImmediate before starting WS client to reduce handshake flakiness on slow startups.
src/config/validation.ts Switches channel imports to leaf modules (channels/ids, channels/chat-meta) to avoid heavier registry import.
src/config/schema.ts Switches CHANNEL_IDS import to channels/ids leaf module.

Comment on lines +49 to +56
test("resolveConnectChallengeTimeoutMs falls back to env override", () => {
const original = process.env.OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS;
try {
process.env.OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS = "5000";
expect(resolveConnectChallengeTimeoutMs()).toBe(5_000);
// Explicit value still takes precedence over env
expect(resolveConnectChallengeTimeoutMs(3_000)).toBe(3_000);
} finally {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier tests in this file expect resolveConnectChallengeTimeoutMs() to return the default, but the implementation now reads process.env.OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS. If a developer has that env var set locally, these tests will become order-/machine-dependent. Consider clearing/restoring OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS in a beforeEach/afterEach for the whole describe, or refactoring resolveConnectChallengeTimeoutMs to accept an env parameter so tests can stay hermetic without mutating global process.env.

Copilot uses AI. Check for mistakes.
@BradGroux BradGroux self-assigned this Apr 3, 2026
@BradGroux BradGroux merged commit 6e94b04 into openclaw:main Apr 3, 2026
47 of 48 checks passed
ngutman pushed a commit that referenced this pull request Apr 3, 2026
…60075)

* fix: import CHANNEL_IDS from leaf module to avoid TDZ on init (#48832)

schema.ts and validation.ts imported CHANNEL_IDS from channels/registry.js,
which re-exports from channels/ids.js but also imports plugins/runtime.js.
When the bundler resolves this dependency graph, the re-exported CHANNEL_IDS
can be undefined at the point config/validation.ts evaluates (temporal dead
zone), causing 'CHANNEL_IDS is not iterable' on startup.

Fix: import CHANNEL_IDS directly from channels/ids.js (the leaf module with
zero heavy dependencies) and normalizeChatChannelId from channels/chat-meta.js.

Fixes #48832

* fix: improve WS handshake reliability on slow-startup environments (#48736)

On Windows with large dist bundles (46MB/639 files), heavy synchronous
module loading blocks the event loop during CLI startup, preventing
timely processing of the connect.challenge frame and causing ~80%
handshake timeout failures.

Changes:
- Yield event loop (setImmediate) before starting WS connection in
  callGateway to let pending I/O drain after heavy module loading
- Add OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var override for
  client-side connect challenge timeout (server already has
  OPENCLAW_HANDSHAKE_TIMEOUT_MS)
- Include diagnostic timing in challenge timeout error messages
  (elapsed vs limit) for easier debugging
- Add tests for env var override and resolution logic

---------

Co-authored-by: Brad Groux <[email protected]>
steipete pushed a commit to duncanita/openclaw that referenced this pull request Apr 4, 2026
…penclaw#60075)

* fix: import CHANNEL_IDS from leaf module to avoid TDZ on init (openclaw#48832)

schema.ts and validation.ts imported CHANNEL_IDS from channels/registry.js,
which re-exports from channels/ids.js but also imports plugins/runtime.js.
When the bundler resolves this dependency graph, the re-exported CHANNEL_IDS
can be undefined at the point config/validation.ts evaluates (temporal dead
zone), causing 'CHANNEL_IDS is not iterable' on startup.

Fix: import CHANNEL_IDS directly from channels/ids.js (the leaf module with
zero heavy dependencies) and normalizeChatChannelId from channels/chat-meta.js.

Fixes openclaw#48832

* fix: improve WS handshake reliability on slow-startup environments (openclaw#48736)

On Windows with large dist bundles (46MB/639 files), heavy synchronous
module loading blocks the event loop during CLI startup, preventing
timely processing of the connect.challenge frame and causing ~80%
handshake timeout failures.

Changes:
- Yield event loop (setImmediate) before starting WS connection in
  callGateway to let pending I/O drain after heavy module loading
- Add OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var override for
  client-side connect challenge timeout (server already has
  OPENCLAW_HANDSHAKE_TIMEOUT_MS)
- Include diagnostic timing in challenge timeout error messages
  (elapsed vs limit) for easier debugging
- Add tests for env var override and resolution logic

---------

Co-authored-by: Brad Groux <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLI WebSocket handshake timeout on Windows (intermittent, ~80% failure rate)

2 participants