Skip to content

fix(gateway): raise local loopback probe timeout from 800ms to 8s#50568

Draft
belisarius222 wants to merge 1 commit intoopenclaw:mainfrom
voltropy:fix/gateway-probe-timeout-45560
Draft

fix(gateway): raise local loopback probe timeout from 800ms to 8s#50568
belisarius222 wants to merge 1 commit intoopenclaw:mainfrom
voltropy:fix/gateway-probe-timeout-45560

Conversation

@belisarius222
Copy link
Copy Markdown

AI-assisted PR — authored with Claude Code (Claude Opus 4.6). Fully tested locally (unit tests). Author understands all changes.

Summary

  • Problem: openclaw gateway probe times out on local loopback connections because the per-target probe budget for localLoopback is hardcoded to 800ms, which is too aggressive for machines where the WS handshake takes longer (e.g., token-auth setups, slower hardware, or when the gateway is under load).
  • Why it matters: Users see spurious "timeout" / "gateway closed (1000)" errors even though the gateway is running and reachable via HTTP, making diagnostics misleading and blocking local gateway CLI workflows.
  • What changed: (1) Raised the localLoopback probe budget cap from 800ms to 8000ms. (2) Raised the overall default timeout for gatewayStatusCommand from 3000ms to 10000ms so the new loopback budget is usable without an explicit --timeout flag. (3) Refactored resolveProbeBudgetMs to use a declarative config map instead of if/else chains.
  • What did NOT change: The sshTunnel (2000ms), configRemote (1500ms), and explicit (1500ms) per-target budgets are unchanged. The --timeout flag still works and still caps all per-target budgets. The server-side handshake timeout (DEFAULT_HANDSHAKE_TIMEOUT_MS = 10_000) is untouched.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • openclaw gateway probe default overall timeout changed from 3s to 10s (still overridable via --timeout).
  • Local loopback probe budget raised from 800ms to 8000ms — local gateway probes are far less likely to time out spuriously.
  • No config file or environment variable changes required.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (same WebSocket probe, just a longer timeout)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Ubuntu 24.04 (reported), macOS (dev)
  • Runtime/container: Node.js
  • Model/provider: N/A (not model-dependent)
  • Integration/channel: Local gateway (ws://127.0.0.1:18789)
  • Relevant config: gateway.mode = local, gateway.bind = lan, gateway.auth.mode = token

Steps

  1. Configure local gateway with token auth and LAN bind.
  2. Confirm gateway is running: openclaw status and curl http://127.0.0.1:18789/ returns 200.
  3. Run openclaw gateway probe --json.

Expected

  • Probe connects successfully and returns "ok": true with health/status data.

Actual

  • Probe reports "error": "timeout" for the localLoopback target because the 800ms budget expires before the WS handshake + token auth completes.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

All 23 tests in src/commands/gateway-status/ pass after the change (npx vitest run src/commands/gateway-status).

Human Verification (required)

  • Verified scenarios: All 23 gateway-status unit tests pass (both gateway-status.test.ts and gateway-status/helpers.test.ts). Confirmed resolveProbeBudgetMs returns 8000 for localLoopback and respects overallMs cap.
  • Edge cases checked: overallMs smaller than 8000 still correctly caps the loopback budget (e.g., --timeout 2000 yields a 2000ms loopback budget). All four TargetKind values are covered in the config map with no fallthrough gaps.
  • What I did not verify: Live end-to-end probe against a running local gateway on Ubuntu 24.04 with token auth (no access to the reporter's environment). The reporter or a maintainer should confirm the fix resolves the issue in that specific setup.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes — the --timeout flag still works identically; only defaults changed.
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert: Revert this commit, or pass --timeout 3000 to restore the previous effective behavior.
  • Files/config to restore: src/commands/gateway-status.ts, src/commands/gateway-status/helpers.ts
  • Known bad symptoms: If the new 10s default causes the gateway probe command to feel sluggish when the gateway is truly unreachable, users can pass a shorter --timeout value.

Risks and Mitigations

  • Risk: The higher default timeout (10s) makes openclaw gateway probe take longer to report failure when the gateway is genuinely down.
    • Mitigation: 10s is consistent with the server-side DEFAULT_HANDSHAKE_TIMEOUT_MS (also 10s) and matches timeouts used elsewhere in the codebase (e.g., status-all.ts, status.scan.ts). Users who prefer faster failure can pass --timeout <ms>.

AI Disclosure

  • Marked as AI-assisted (Claude Code, Claude Opus 4.6)
  • Degree of testing: fully tested (all 23 gateway-status unit tests pass)
  • Author confirms understanding of all changes
  • Codex review: not available in this environment

The local loopback probe budget was hardcoded to 800ms, which is too
aggressive and causes spurious timeouts on slower machines or when the
gateway handshake takes longer than expected. Raise the per-target
budget for localLoopback to 8000ms and the overall default from 3s
to 10s so the new budget is usable without an explicit --timeout flag.

Fixes openclaw#45560

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@openclaw-barnacle openclaw-barnacle bot added commands Command implementations size: XS labels Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: local gateway CLI handshake fails (probe timeout / gateway call closed 1000) even though gateway is running and WS challenge is reachable

1 participant