Skip to content

fix(gateway): handshake timeout + token-only scope grant (#47103, #48167)#47388

Open
haoruilee wants to merge 8 commits intoopenclaw:mainfrom
haoruilee:fix/devices-list-handshake-timeout
Open

fix(gateway): handshake timeout + token-only scope grant (#47103, #48167)#47388
haoruilee wants to merge 8 commits intoopenclaw:mainfrom
haoruilee:fix/devices-list-handshake-timeout

Conversation

@haoruilee
Copy link
Copy Markdown
Contributor

@haoruilee haoruilee commented Mar 15, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: openclaw devices list and openclaw devices approve fail on 2026.3.12+ with gateway closed (1000 normal closure): no close reason. Separately, CLI and Dashboard using static token auth get missing scope: operator.read — token connects but receives zero scopes.
  • Why it matters: Users on slower systems (e.g. Ubuntu 24.04, Raspberry Pi) cannot list or approve devices. Token-only auth users cannot run read RPCs (devices list, status, probe).
  • What changed: (1) Restored DEFAULT_HANDSHAKE_TIMEOUT_MS from 3 seconds to 10 seconds so the CLI has enough time to complete the handshake. (2) Grant operator.read for device-less token/password auth so read RPCs work instead of clearing scopes to zero.
  • What did NOT change (scope boundary): All other preauth hardening from Hardening: tighten preauth WebSocket handshake limits #44089 remains: MAX_PREAUTH_PAYLOAD_BYTES, preauth payload size checks, and setSocketMaxPayload after connect. Admin methods still require operator.admin.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  1. Handshake timeout increased from 3 seconds to 10 seconds. Unauthenticated connections that do not complete the handshake within 10 seconds are closed.
  2. Token-only operator auth (no device identity) now receives operator.read scope so openclaw devices list, openclaw status, gateway probe, and Dashboard read operations work. No config changes.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Ubuntu 24.04 (or any system where handshake can exceed 3s)
  • Runtime/container: Node 22+, npm install
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): Default gateway.bind=loopback, token auth for scope fix

Steps

  1. Install openclaw 2026.3.12 or 2026.3.13
  2. Start gateway (openclaw gateway run or via app)
  3. Run openclaw devices list or openclaw devices list --token DASHBOARD_TOKEN
  4. Run openclaw gateway probe

Expected

Devices list (or empty list) and probe succeed without error.

Actual

  • Handshake timeout: Error: gateway closed (1000 normal closure): no close reason
  • Scope gap: missing scope: operator.read

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

pnpm test src/gateway/server.preauth-hardening.test.ts passes (uses env override for short timeout). pnpm test src/gateway/server.auth.compat-baseline.test.ts and server.auth.default-token.test.ts pass. Manual repro: openclaw devices list succeeds after the fix.

Human Verification (required)

  • Verified scenarios: Preauth hardening tests pass; auth compat and default-token tests pass; pnpm check passes.
  • Edge cases checked: Handshake timeout still enforced (10s instead of 3s); admin methods still require operator.admin.
  • What you did NOT verify: Live repro on a slow system (Raspberry Pi, etc.).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert: Revert this PR or set OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS (test-only; no production override).
  • Files/config to restore: src/gateway/server-constants.ts, src/gateway/server/ws-connection/message-handler.ts
  • Known bad symptoms reviewers should watch for: None expected; reverts to pre-Hardening: tighten preauth WebSocket handshake limits #44089 timeout behavior; scope grant aligns with token-only auth expectations.

Risks and Mitigations

  • Risk: Slightly longer window for unauthenticated connections to stay open (10s vs 3s).
  • Mitigation: 10s was the original value; other preauth limits (payload size, etc.) remain. 10s is still a bounded timeout.
  • Risk (scope fix): Token-only auth gains operator.read instead of zero scopes.
  • Mitigation: Token is validated against gateway config; minimal read scope only; admin methods still require operator.admin.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR restores DEFAULT_HANDSHAKE_TIMEOUT_MS from 3 seconds to 10 seconds in src/gateway/server-constants.ts, reverting an overly aggressive value set by #44089 that caused openclaw devices list and openclaw devices approve to fail on slower systems with a premature "gateway closed (1000 normal closure)" error.

The change is minimal and well-scoped:

  • Only the timeout constant is changed; all other preauth hardening from Hardening: tighten preauth WebSocket handshake limits #44089 (payload size limits, setSocketMaxPayload, etc.) remains intact.
  • A clear explanatory comment has been added to document why the 10s value is needed.
  • The test suite uses an env override (OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS=200) so existing tests remain fast and still correctly validate the timeout enforcement logic.
  • server.auth.default-token.suite.ts calls getHandshakeTimeoutMs() dynamically, so it automatically adapts to the restored value without any test changes.
  • The CHANGELOG entry is accurate and appropriately links back to both issues.

Confidence Score: 5/5

  • Safe to merge — the change is a single-line constant restore with no new attack surface, and all other security hardening remains in place.
  • The fix is a minimal, targeted restoration of the original timeout value. The security trade-off is explicitly acknowledged (slightly longer window for unauthenticated connections), and it is well-mitigated by the unchanged preauth payload size limits. Tests validate the timeout logic via an env override, so they remain fast and correct. No logic errors, security regressions, or API surface changes are introduced.
  • No files require special attention.

Last reviewed commit: 3ec248f

@openclaw-barnacle openclaw-barnacle bot added the channel: whatsapp-web Channel integration: whatsapp-web label Mar 15, 2026
xuwei-xy pushed a commit to xuwei-xy/openclaw that referenced this pull request Mar 15, 2026
- openclaw#47391 fix(whatsapp): restore config-driven block streaming for WhatsApp delivery
- openclaw#47388 fix(gateway): restore handshake timeout to 10s to fix devices list on slow systems, solve openclaw#47103
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccb1a19b42

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@cursor cursor bot force-pushed the fix/devices-list-handshake-timeout branch from 6ca10cc to 01c94c1 Compare March 17, 2026 02:23
@openclaw-barnacle openclaw-barnacle bot removed the channel: whatsapp-web Channel integration: whatsapp-web label Mar 17, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 01c94c1adb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@cursor cursor bot requested a review from a team as a code owner March 17, 2026 06:05
@haoruilee haoruilee changed the title fix(gateway): restore handshake timeout to 10s to fix devices list on slow systems, solve #47103 fix(gateway): handshake timeout + token-only scope grant (#47103, #48167) Mar 17, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c33138d26

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@cursor cursor bot force-pushed the fix/devices-list-handshake-timeout branch from a9c7146 to 4d413b7 Compare March 20, 2026 05:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d413b78eb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

cursoragent and others added 2 commits March 20, 2026 13:04
The previous condition cleared scopes for all device-less non-control-ui
connections because !isControlUi made the OR always true. Split logic:
clear on non-allow; normalize to operator.read only for allow + token/password.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94084a2549

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Allow sufficient time for CLI to load device identity, sign the connect payload,
// and complete the handshake on slow systems (cold start, disk I/O). Too short causes
// premature close with "gateway closed (1000 normal closure)" before connect completes.
export const DEFAULT_HANDSHAKE_TIMEOUT_MS = 10_000;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Raise probe client budgets with the longer handshake timeout

On hosts where the WebSocket handshake takes more than ~800ms–5s, this change still leaves the status/probe flows timing out before the server’s new 10s window is useful. I checked src/commands/gateway-status/helpers.ts:119-127 and src/commands/status.scan.shared.ts:73-78: both cap probeGateway() well below 10s (800ms for local gateway status, 2.5s/5s for status), so the same slow machines that motivated this change will still fail those commands even though callGateway-based paths like devices list now get longer on the server side.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

2 participants