Skip to content

Gateway WS handshake timeout (3s) too aggressive — causes spurious 'gateway closed (1000)' on busy event loops #46892

@fuller-stack-dev

Description

@fuller-stack-dev

Summary

When the gateway event loop is busy (processing agent turns, compaction, or concurrent sessions), the 3-second WebSocket handshake timeout (DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3 in gateway-cli-*.js) fires before the connect challenge completes. The gateway closes the connection with code 1000 (normal closure), and the CLI reports:

gateway connect failed: Error: gateway closed (1000): 

This affects all CLI-to-gateway WS calls, including read-only operations like openclaw cron list.

Environment

  • OpenClaw: 2026.3.13 (61d171a)
  • Host: macOS 26.3.1, Apple Silicon Mac mini, Node v24.14.0
  • Gateway config: maxConcurrent: 4, loopback bind

Steps to Reproduce

  1. Run a gateway with multiple concurrent agent sessions (3-4 active)
  2. From a cron job or external script, run openclaw cron list --json while the gateway is processing agent turns
  3. The CLI connects via WS but the gateway's handshake challenge isn't answered within 3 seconds
  4. Gateway closes the WS with code 1000, CLI reports failure

This is intermittent — depends on event loop pressure at the exact moment of connection.

Root Cause

DEFAULT_HANDSHAKE_TIMEOUT_MS is hardcoded to 3e3 (3 seconds) in the gateway:

// gateway-cli-*.js line ~7586
const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3;
const getHandshakeTimeoutMs = () => {
    if (process.env.VITEST && process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) {
        const parsed = Number(process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS);
        if (Number.isFinite(parsed) && parsed > 0) return parsed;
    }
    return DEFAULT_HANDSHAKE_TIMEOUT_MS;
};

The env var override (OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) is gated behind process.env.VITEST, making it test-only.

Why This Surfaced in v2026.3.13

The v2026.3.13 fix "Gateway/client requests: reject unanswered gateway RPC calls after a bounded timeout" introduced active rejection of stalled connections. In v2026.3.12, busy handshakes would hang indefinitely (the CLI's own subprocess timeout would handle it). Now the gateway actively closes them, surfacing the 3s limit as a user-visible failure.

Suggested Fix

  1. Increase default from 3s to ~10s — 3s is too tight for a local loopback connection when the event loop is under load
  2. Make it user-configurable via gateway.handshakeTimeoutMs in openclaw.json (or similar config key)
  3. Remove the VITEST gate on OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS so users can override via env var as a stopgap

Workaround

Monkey-patch the installed package:

sed -i 's/const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3;/const DEFAULT_HANDSHAKE_TIMEOUT_MS = 10e3;/' \
  $(dirname $(which openclaw))/../lib/node_modules/openclaw/dist/gateway-cli-*.js
# Then restart gateway

Gateway Log Evidence

{
  "cause": "handshake-timeout",
  "handshake": "failed",
  "durationMs": 3908,
  "handshakeMs": 3002,
  "host": "127.0.0.1:18789",
  "code": 1000,
  "reason": "n/a"
}

Observed ~34 failures over 18 hours with the same pattern — always handshakeMs: 3002.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions