-
-
Notifications
You must be signed in to change notification settings - Fork 69.2k
Gateway WS handshake timeout (3s) too aggressive — causes spurious 'gateway closed (1000)' on busy event loops #46892
Description
Summary
When the gateway event loop is busy (processing agent turns, compaction, or concurrent sessions), the 3-second WebSocket handshake timeout (DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3 in gateway-cli-*.js) fires before the connect challenge completes. The gateway closes the connection with code 1000 (normal closure), and the CLI reports:
gateway connect failed: Error: gateway closed (1000):
This affects all CLI-to-gateway WS calls, including read-only operations like openclaw cron list.
Environment
- OpenClaw: 2026.3.13 (61d171a)
- Host: macOS 26.3.1, Apple Silicon Mac mini, Node v24.14.0
- Gateway config:
maxConcurrent: 4, loopback bind
Steps to Reproduce
- Run a gateway with multiple concurrent agent sessions (3-4 active)
- From a cron job or external script, run
openclaw cron list --jsonwhile the gateway is processing agent turns - The CLI connects via WS but the gateway's handshake challenge isn't answered within 3 seconds
- Gateway closes the WS with code 1000, CLI reports failure
This is intermittent — depends on event loop pressure at the exact moment of connection.
Root Cause
DEFAULT_HANDSHAKE_TIMEOUT_MS is hardcoded to 3e3 (3 seconds) in the gateway:
// gateway-cli-*.js line ~7586
const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3;
const getHandshakeTimeoutMs = () => {
if (process.env.VITEST && process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) {
const parsed = Number(process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS);
if (Number.isFinite(parsed) && parsed > 0) return parsed;
}
return DEFAULT_HANDSHAKE_TIMEOUT_MS;
};The env var override (OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) is gated behind process.env.VITEST, making it test-only.
Why This Surfaced in v2026.3.13
The v2026.3.13 fix "Gateway/client requests: reject unanswered gateway RPC calls after a bounded timeout" introduced active rejection of stalled connections. In v2026.3.12, busy handshakes would hang indefinitely (the CLI's own subprocess timeout would handle it). Now the gateway actively closes them, surfacing the 3s limit as a user-visible failure.
Suggested Fix
- Increase default from 3s to ~10s — 3s is too tight for a local loopback connection when the event loop is under load
- Make it user-configurable via
gateway.handshakeTimeoutMsinopenclaw.json(or similar config key) - Remove the VITEST gate on
OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MSso users can override via env var as a stopgap
Workaround
Monkey-patch the installed package:
sed -i 's/const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3;/const DEFAULT_HANDSHAKE_TIMEOUT_MS = 10e3;/' \
$(dirname $(which openclaw))/../lib/node_modules/openclaw/dist/gateway-cli-*.js
# Then restart gatewayGateway Log Evidence
{
"cause": "handshake-timeout",
"handshake": "failed",
"durationMs": 3908,
"handshakeMs": 3002,
"host": "127.0.0.1:18789",
"code": 1000,
"reason": "n/a"
}Observed ~34 failures over 18 hours with the same pattern — always handshakeMs: 3002.