-
-
Notifications
You must be signed in to change notification settings - Fork 69.5k
CLI WebSocket handshake timeout on Windows (intermittent, ~80% failure rate) #48736
Description
Environment
- OS: Windows 10/11 (x64)
- Node.js: v24.13.0
- OpenClaw: 2026.3.13 (61d171a)
- Gateway bind: lan (0.0.0.0:18789)
- Auth mode: token
Problem
CLI commands that require a WebSocket connection to the gateway fail ~80% of the time with handshake timeout on Windows. The gateway is running and healthy — Telegram, Discord, cron jobs, and the embedded agent all work perfectly.
What fails (~80% of the time):
openclaw cron list(table mode)openclaw cron editopenclaw agentopenclaw cron list --json(intermittent success)
What works:
openclaw status— works (slow, ~10s)openclaw gateway status— works (RPC probe OK)openclaw config get/set— worksopenclaw --version— instant (127ms)
Error output:
gateway connect failed: Error: gateway closed (1000):
Error: gateway closed (1000 normal closure): no close reason
Gateway target: ws://127.0.0.1:18789
Source: local loopback
Gateway logs:
{"subsystem":"gateway/ws"} handshake timeout conn=<id> remote=127.0.0.1
{"cause":"handshake-timeout","handshake":"failed","durationMs":5237,"handshakeMs":3005}Root Cause Analysis
The CLI opens the WebSocket but never sends the connect request frame within the 3-second handshake timeout. The gateway receives the connection but times out waiting for client identification.
Proof the gateway protocol works:
A minimal Node.js script (~5KB) implementing the v2 handshake protocol manually (open WS → receive connect.challenge → sign nonce with Ed25519 → send connect → receive hello-ok) connects 100% of the time (5/5 repeated tests). Same machine, same gateway, same auth token, same device identity.
Timing data:
openclaw --version: 127ms- First CLI stdout for
openclaw cron list: 3,775ms - Total CLI time for failed
cron list: ~8,800ms - Helper script total time: <1,000ms
Hypothesis:
The dist directory contains 46MB of JavaScript across 639 files. Heavy module initialization may cause event loop delays on Windows, preventing timely processing of the connect.challenge event. The queueConnect() method sets a 2-second challenge timeout, and the gateway has a 3-second server-side handshake timeout. On Windows with this module load overhead, the CLI may not process the incoming challenge frame before one of these timeouts fires.
What was tried (none fixed it):
openclaw gateway install --forceopenclaw doctor --repairnpm update -g openclaw(already latest)- Setting OPENCLAW_GATEWAY_TOKEN env var
- Explicit
--url ws://127.0.0.1:18789 --token <token>flags - Device identity verified valid, crypto signing works, raw TCP/WS connects in 2-5ms
Workaround
A standalone helper script implementing the v2 handshake protocol directly works 100% reliably. Happy to share if helpful for debugging.
Suggested fixes:
- Make the challenge timeout and/or handshake timeout configurable
- Investigate event loop blocking during CLI startup on Windows
- Check if CLI opens multiple sequential WS connections (doctor check + command) that compete
- Consider lazy-loading modules so the handshake handler is registered before heavy init completes