fix: improve WS handshake reliability on slow-startup environments#60075
Conversation
…aw#48832) schema.ts and validation.ts imported CHANNEL_IDS from channels/registry.js, which re-exports from channels/ids.js but also imports plugins/runtime.js. When the bundler resolves this dependency graph, the re-exported CHANNEL_IDS can be undefined at the point config/validation.ts evaluates (temporal dead zone), causing 'CHANNEL_IDS is not iterable' on startup. Fix: import CHANNEL_IDS directly from channels/ids.js (the leaf module with zero heavy dependencies) and normalizeChatChannelId from channels/chat-meta.js. Fixes openclaw#48832
…penclaw#48736) On Windows with large dist bundles (46MB/639 files), heavy synchronous module loading blocks the event loop during CLI startup, preventing timely processing of the connect.challenge frame and causing ~80% handshake timeout failures. Changes: - Yield event loop (setImmediate) before starting WS connection in callGateway to let pending I/O drain after heavy module loading - Add OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var override for client-side connect challenge timeout (server already has OPENCLAW_HANDSHAKE_TIMEOUT_MS) - Include diagnostic timing in challenge timeout error messages (elapsed vs limit) for easier debugging - Add tests for env var override and resolution logic
Greptile SummaryThis PR fixes WS handshake timeouts on Windows slow-startup environments via three mechanisms: a
Confidence Score: 5/5Safe to merge; the setImmediate fix correctly addresses the root cause, and the clamping concern is a usability limitation rather than a runtime defect. All findings are P2. The core fix (setImmediate yield) is minimal and correct. The env var clamping concern doesn't cause incorrect behavior — it only limits the env var's usefulness for raising the timeout above the default — so it doesn't block merge. src/gateway/handshake-timeouts.ts — MAX_CONNECT_CHALLENGE_TIMEOUT_MS cap Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/handshake-timeouts.ts
Line: 31
Comment:
**Env var silently capped at the default timeout**
`clampConnectChallengeTimeoutMs` caps values at `MAX_CONNECT_CHALLENGE_TIMEOUT_MS`, which equals `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS` (10,000 ms). So `OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS=30000` is silently truncated to 10,000 ms — the same as the default. The env var can only lower the timeout, not raise it, which is the opposite of what "slow environments" actually need. Either raise `MAX_CONNECT_CHALLENGE_TIMEOUT_MS` to a sensible upper bound (e.g. 60,000 ms) or document that the range is capped at the default.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: improve WS handshake reliability on..." | Re-trigger Greptile |
| } | ||
| const envOverride = getConnectChallengeTimeoutMsFromEnv(); | ||
| if (envOverride !== undefined) { | ||
| return clampConnectChallengeTimeoutMs(envOverride); |
There was a problem hiding this comment.
Env var silently capped at the default timeout
clampConnectChallengeTimeoutMs caps values at MAX_CONNECT_CHALLENGE_TIMEOUT_MS, which equals DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS (10,000 ms). So OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS=30000 is silently truncated to 10,000 ms — the same as the default. The env var can only lower the timeout, not raise it, which is the opposite of what "slow environments" actually need. Either raise MAX_CONNECT_CHALLENGE_TIMEOUT_MS to a sensible upper bound (e.g. 60,000 ms) or document that the range is capped at the default.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/handshake-timeouts.ts
Line: 31
Comment:
**Env var silently capped at the default timeout**
`clampConnectChallengeTimeoutMs` caps values at `MAX_CONNECT_CHALLENGE_TIMEOUT_MS`, which equals `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS` (10,000 ms). So `OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS=30000` is silently truncated to 10,000 ms — the same as the default. The env var can only lower the timeout, not raise it, which is the opposite of what "slow environments" actually need. Either raise `MAX_CONNECT_CHALLENGE_TIMEOUT_MS` to a sensible upper bound (e.g. 60,000 ms) or document that the range is capped at the default.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
Improves the reliability and diagnosability of the CLI → Gateway WebSocket v2 handshake in slow-startup / event-loop-starved environments (notably Windows with large dist bundles), and reduces some startup coupling to heavier channel-registry modules.
Changes:
- Yield once to the event loop before starting the WS client to let pending I/O drain prior to handshake initiation.
- Add client-side
OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MSenv override (with clamping) and tests for parsing/precedence. - Improve connect-challenge timeout errors with elapsed/limit timing, and adjust config imports to use leaf channel ID/meta modules.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/gateway/handshake-timeouts.ts | Adds env var parsing for connect-challenge timeout and updates resolution precedence. |
| src/gateway/handshake-timeouts.test.ts | Adds tests for the new env var parsing + override behavior. |
| src/gateway/client.ts | Enhances connect-challenge timeout error message with elapsed vs configured limit. |
| src/gateway/call.ts | Yields via setImmediate before starting WS client to reduce handshake flakiness on slow startups. |
| src/config/validation.ts | Switches channel imports to leaf modules (channels/ids, channels/chat-meta) to avoid heavier registry import. |
| src/config/schema.ts | Switches CHANNEL_IDS import to channels/ids leaf module. |
| test("resolveConnectChallengeTimeoutMs falls back to env override", () => { | ||
| const original = process.env.OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS; | ||
| try { | ||
| process.env.OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS = "5000"; | ||
| expect(resolveConnectChallengeTimeoutMs()).toBe(5_000); | ||
| // Explicit value still takes precedence over env | ||
| expect(resolveConnectChallengeTimeoutMs(3_000)).toBe(3_000); | ||
| } finally { |
There was a problem hiding this comment.
Earlier tests in this file expect resolveConnectChallengeTimeoutMs() to return the default, but the implementation now reads process.env.OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS. If a developer has that env var set locally, these tests will become order-/machine-dependent. Consider clearing/restoring OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS in a beforeEach/afterEach for the whole describe, or refactoring resolveConnectChallengeTimeoutMs to accept an env parameter so tests can stay hermetic without mutating global process.env.
…60075) * fix: import CHANNEL_IDS from leaf module to avoid TDZ on init (#48832) schema.ts and validation.ts imported CHANNEL_IDS from channels/registry.js, which re-exports from channels/ids.js but also imports plugins/runtime.js. When the bundler resolves this dependency graph, the re-exported CHANNEL_IDS can be undefined at the point config/validation.ts evaluates (temporal dead zone), causing 'CHANNEL_IDS is not iterable' on startup. Fix: import CHANNEL_IDS directly from channels/ids.js (the leaf module with zero heavy dependencies) and normalizeChatChannelId from channels/chat-meta.js. Fixes #48832 * fix: improve WS handshake reliability on slow-startup environments (#48736) On Windows with large dist bundles (46MB/639 files), heavy synchronous module loading blocks the event loop during CLI startup, preventing timely processing of the connect.challenge frame and causing ~80% handshake timeout failures. Changes: - Yield event loop (setImmediate) before starting WS connection in callGateway to let pending I/O drain after heavy module loading - Add OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var override for client-side connect challenge timeout (server already has OPENCLAW_HANDSHAKE_TIMEOUT_MS) - Include diagnostic timing in challenge timeout error messages (elapsed vs limit) for easier debugging - Add tests for env var override and resolution logic --------- Co-authored-by: Brad Groux <[email protected]>
…penclaw#60075) * fix: import CHANNEL_IDS from leaf module to avoid TDZ on init (openclaw#48832) schema.ts and validation.ts imported CHANNEL_IDS from channels/registry.js, which re-exports from channels/ids.js but also imports plugins/runtime.js. When the bundler resolves this dependency graph, the re-exported CHANNEL_IDS can be undefined at the point config/validation.ts evaluates (temporal dead zone), causing 'CHANNEL_IDS is not iterable' on startup. Fix: import CHANNEL_IDS directly from channels/ids.js (the leaf module with zero heavy dependencies) and normalizeChatChannelId from channels/chat-meta.js. Fixes openclaw#48832 * fix: improve WS handshake reliability on slow-startup environments (openclaw#48736) On Windows with large dist bundles (46MB/639 files), heavy synchronous module loading blocks the event loop during CLI startup, preventing timely processing of the connect.challenge frame and causing ~80% handshake timeout failures. Changes: - Yield event loop (setImmediate) before starting WS connection in callGateway to let pending I/O drain after heavy module loading - Add OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS env var override for client-side connect challenge timeout (server already has OPENCLAW_HANDSHAKE_TIMEOUT_MS) - Include diagnostic timing in challenge timeout error messages (elapsed vs limit) for easier debugging - Add tests for env var override and resolution logic --------- Co-authored-by: Brad Groux <[email protected]>
Fixes #48736
Problem
On Windows with large dist bundles (46MB / 639 files), heavy synchronous module loading during CLI startup blocks the event loop, preventing timely processing of the
connect.challengeframe. This causes ~80% handshake timeout failures for CLI commands that use WebSocket connections (e.g.openclaw cron list), while the gateway itself is healthy.Key evidence from the issue:
openclaw --version: 127ms vsopenclaw cron listfirst stdout: 3,775msChanges
Event loop yield before WS connection (
call.ts): AddedsetImmediateyield beforeclient.start()so pending I/O from heavy module loading can drain before the handshake begins.Client-side env var override (
handshake-timeouts.ts): AddedOPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MSenv var for the client-side connect challenge timeout. The server already supportsOPENCLAW_HANDSHAKE_TIMEOUT_MS— this gives users a client-side escape hatch for slow environments.Diagnostic timing in error messages (
client.ts): Challenge timeout errors now include elapsed time and configured limit (e.g.waited 10023ms, limit 10000ms), making it much easier to diagnose timing issues.Tests: Added coverage for env var parsing, resolution precedence, and fallback behavior.
Testing
pnpm build— passespnpm check— passes (tsgo, lint, all checks)pnpm vitest run src/gateway/handshake-timeouts.test.ts— 5/5 tests pass