Skip to content

fix: increase DEFAULT_HANDSHAKE_TIMEOUT_MS from 3s to 10s#49049

Open
c-amg wants to merge 6 commits intoopenclaw:mainfrom
c-amg:fix/handshake-timeout
Open

fix: increase DEFAULT_HANDSHAKE_TIMEOUT_MS from 3s to 10s#49049
c-amg wants to merge 6 commits intoopenclaw:mainfrom
c-amg:fix/handshake-timeout

Conversation

@c-amg
Copy link
Copy Markdown

@c-amg c-amg commented Mar 17, 2026

Summary

Increase the default WebSocket handshake timeout from 3 seconds to 10 seconds to fix spurious gateway closed 1000 errors when the event loop is busy.

Changes

  • Increase DEFAULT_HANDSHAKE_TIMEOUT_MS from 3,000ms to 10,000ms
  • Remove VITEST gate on OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS env var so users can override

Fixes

- Increase default handshake timeout from 3 seconds to 10 seconds
- Remove VITEST gate on OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS env var
- This fixes spurious 'gateway closed (1000)' errors when event loop is busy

See: openclaw#46892
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 17, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR bumps DEFAULT_HANDSHAKE_TIMEOUT_MS from 3 s to 10 s and removes the VITEST-only guard on OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS, allowing the override to be used in any environment. Both changes are straightforward and address real-world reliability issues caused by a busy event loop tripping the too-short 3 s timeout.

  • Timeout increase (3 s → 10 s): Reasonable headroom for slow/loaded environments; low risk of resource exhaustion since WS handshakes are expected to be short-lived regardless.
  • VITEST gate removal: Correct — the env-var override is now legitimately useful outside test runs. The only style concern is that OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS still carries a _TEST_ prefix that may confuse production operators looking for a tunable knob; consider a follow-up rename to OPENCLAW_HANDSHAKE_TIMEOUT_MS (keeping the old name as a fallback for backward compatibility).

Confidence Score: 5/5

  • Safe to merge — minimal, targeted change with no logic regressions.
  • The diff touches a single constants file, increases a timeout value, and removes an overly-restrictive env-var guard. No functional regressions are introduced; the only note is a cosmetic naming concern with the env var that can be addressed in a follow-up.
  • No files require special attention.

Comments Outside Diff (1)

  1. src/gateway/server-constants.ts, line 26-31 (link)

    P2 _TEST_ in env var name is misleading for production use

    Now that the VITEST gate is removed and this override is intended for production users too, the name OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS still carries a _TEST_ prefix that strongly implies it is a test-only knob. A user scanning their environment variables or reading docs may skip it for production tuning precisely because of that name.

    Consider introducing a production-facing name (e.g. OPENCLAW_HANDSHAKE_TIMEOUT_MS) in a follow-up. Backward compatibility could be preserved by checking both, falling back to the old name:

    if (process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS || process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) {
      const parsed = Number(
        process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS ?? process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS
      );
      if (Number.isFinite(parsed) && parsed > 0) {
        return parsed;
      }
    }

    This is non-blocking for the current fix, but worth tracking.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/gateway/server-constants.ts
    Line: 26-31
    
    Comment:
    **`_TEST_` in env var name is misleading for production use**
    
    Now that the VITEST gate is removed and this override is intended for production users too, the name `OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS` still carries a `_TEST_` prefix that strongly implies it is a test-only knob. A user scanning their environment variables or reading docs may skip it for production tuning precisely because of that name.
    
    Consider introducing a production-facing name (e.g. `OPENCLAW_HANDSHAKE_TIMEOUT_MS`) in a follow-up. Backward compatibility could be preserved by checking both, falling back to the old name:
    
    ```ts
    if (process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS || process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) {
      const parsed = Number(
        process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS ?? process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS
      );
      if (Number.isFinite(parsed) && parsed > 0) {
        return parsed;
      }
    }
    ```
    
    This is non-blocking for the current fix, but worth tracking.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/server-constants.ts
Line: 26-31

Comment:
**`_TEST_` in env var name is misleading for production use**

Now that the VITEST gate is removed and this override is intended for production users too, the name `OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS` still carries a `_TEST_` prefix that strongly implies it is a test-only knob. A user scanning their environment variables or reading docs may skip it for production tuning precisely because of that name.

Consider introducing a production-facing name (e.g. `OPENCLAW_HANDSHAKE_TIMEOUT_MS`) in a follow-up. Backward compatibility could be preserved by checking both, falling back to the old name:

```ts
if (process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS || process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) {
  const parsed = Number(
    process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS ?? process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS
  );
  if (Number.isFinite(parsed) && parsed > 0) {
    return parsed;
  }
}
```

This is non-blocking for the current fix, but worth tracking.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 2792f8b

OpenClaw Assistant and others added 3 commits March 17, 2026 16:08
…HAKE_TIMEOUT_MS

Remove _TEST_ prefix from env var name for clarity - this is now
a useful production override, not just for test environments.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7150d9fb22

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +26 to +27
if (process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS) {
const parsed = Number(process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS fallback

Switching the override lookup to OPENCLAW_HANDSHAKE_TIMEOUT_MS breaks existing callers that still set OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS; this commit does not update those call sites, so gateway tests now silently use the 10s default. In particular, src/gateway/server.preauth-hardening.test.ts expects idle sockets to close in under 1s and will fail once the override is ignored, and other suites that relied on short overrides become significantly slower. Please preserve backward compatibility for the old env var (or migrate all known callers in the same change).

Useful? React with 👍 / 👎.

c-amg added 2 commits March 18, 2026 08:52
…EST_HANDSHAKE_TIMEOUT_MS

Keep backward compatibility for existing callers. Addresses Codex review P1 suggestion on PR openclaw#49049.
…r backward compat

Added constants for payload limits and handshake timeout.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d156147097

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
};
export const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3_000;
export const DEFAULT_HANDSHAKE_TIMEOUT_MS = 10_000;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update the client challenge timeout alongside the server timeout

Raising DEFAULT_HANDSHAKE_TIMEOUT_MS to 10s only extends the server-side idle-socket timer. The default GatewayClient path still gives up after 2,000ms if connect.challenge has not arrived (src/gateway/client.ts:699-712), so any event-loop stall between 2s and 10s will continue to fail the handshake before the client can send connect. For callers using the default connectDelayMs, this change only fixes the 2-3s window and leaves the reported busy-loop failure mode in place.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway WS handshake timeout (3s) too aggressive — causes spurious 'gateway closed (1000)' on busy event loops

1 participant