Skip to content

CLI WebSocket handshake timeout on Windows (intermittent, ~80% failure rate) #48736

@joeywrightphoto

Description

@joeywrightphoto

Environment

  • OS: Windows 10/11 (x64)
  • Node.js: v24.13.0
  • OpenClaw: 2026.3.13 (61d171a)
  • Gateway bind: lan (0.0.0.0:18789)
  • Auth mode: token

Problem

CLI commands that require a WebSocket connection to the gateway fail ~80% of the time with handshake timeout on Windows. The gateway is running and healthy — Telegram, Discord, cron jobs, and the embedded agent all work perfectly.

What fails (~80% of the time):

  • openclaw cron list (table mode)
  • openclaw cron edit
  • openclaw agent
  • openclaw cron list --json (intermittent success)

What works:

  • openclaw status — works (slow, ~10s)
  • openclaw gateway status — works (RPC probe OK)
  • openclaw config get/set — works
  • openclaw --version — instant (127ms)

Error output:

gateway connect failed: Error: gateway closed (1000):
Error: gateway closed (1000 normal closure): no close reason
Gateway target: ws://127.0.0.1:18789
Source: local loopback

Gateway logs:

{"subsystem":"gateway/ws"} handshake timeout conn=<id> remote=127.0.0.1
{"cause":"handshake-timeout","handshake":"failed","durationMs":5237,"handshakeMs":3005}

Root Cause Analysis

The CLI opens the WebSocket but never sends the connect request frame within the 3-second handshake timeout. The gateway receives the connection but times out waiting for client identification.

Proof the gateway protocol works:

A minimal Node.js script (~5KB) implementing the v2 handshake protocol manually (open WS → receive connect.challenge → sign nonce with Ed25519 → send connect → receive hello-ok) connects 100% of the time (5/5 repeated tests). Same machine, same gateway, same auth token, same device identity.

Timing data:

  • openclaw --version: 127ms
  • First CLI stdout for openclaw cron list: 3,775ms
  • Total CLI time for failed cron list: ~8,800ms
  • Helper script total time: <1,000ms

Hypothesis:

The dist directory contains 46MB of JavaScript across 639 files. Heavy module initialization may cause event loop delays on Windows, preventing timely processing of the connect.challenge event. The queueConnect() method sets a 2-second challenge timeout, and the gateway has a 3-second server-side handshake timeout. On Windows with this module load overhead, the CLI may not process the incoming challenge frame before one of these timeouts fires.

What was tried (none fixed it):

  • openclaw gateway install --force
  • openclaw doctor --repair
  • npm update -g openclaw (already latest)
  • Setting OPENCLAW_GATEWAY_TOKEN env var
  • Explicit --url ws://127.0.0.1:18789 --token <token> flags
  • Device identity verified valid, crypto signing works, raw TCP/WS connects in 2-5ms

Workaround

A standalone helper script implementing the v2 handshake protocol directly works 100% reliably. Happy to share if helpful for debugging.

Suggested fixes:

  1. Make the challenge timeout and/or handshake timeout configurable
  2. Investigate event loop blocking during CLI startup on Windows
  3. Check if CLI opens multiple sequential WS connections (doctor check + command) that compete
  4. Consider lazy-loading modules so the handshake handler is registered before heavy init completes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions