fix(gateway/ws): downgrade expected 1013 'gateway starting' close-before-connect to debug by lonexreb · Pull Request #77189 · openclaw/openclaw

lonexreb · 2026-05-04T08:04:29Z

Summary

Downgrades the closed before connect ... code=1013 reason=gateway starting WS log line from warn to debug when the close was the gateway's own RFC 6455 1013 ("Try Again Later") reply during the startup-sidecars-pending window.

Why

From the issue: when a Control UI / WebChat client is open during a gateway restart, the early WebSocket reconnect attempt during the startup-sidecars-pending window emits one WARN per client per restart. The gateway's own ws-connection/message-handler.ts:464-470 deliberately replies with code 1013 — that's exactly what the WebSocket spec says servers should do when not yet ready — but ws-connection.ts:370-378 then logs the corresponding close-before-connect at WARN, treating expected healthy startup behavior as if it were a problem.

This adds to "WARN inflation" without surfacing anything actionable, making it harder to spot real warnings.

Changes

src/gateway/server/ws-connection.ts — Add isExpectedStartupClose(code, reason) helper that returns true iff code === 1013 and the reason text includes "gateway starting". Route that case (alongside the existing isNoisySwiftPmHelperClose downgrade) through the debug logger. The non-startup WARN path is preserved for unexpected close-before-connect events.
src/gateway/server/ws-connection.startup.test.ts — New test case mirrors the first one's setup (isStartupPending: () => true), waits for the 1013 close, then asserts logWsControl.debug was called with closed before connect and logWsControl.warn was not called with that prefix.
CHANGELOG.md — Unreleased > Fixes entry credited to me.

Test plan

pnpm test src/gateway/server/ws-connection.startup.test.ts — 2/2 pass (1 existing + 1 new)
pnpm exec oxfmt --check on touched files — clean
No behavior change for non-startup close events — same WARN path unchanged.
No behavior change for the swiftpm-testing-helper downgrade — kept in the same downgradeToDebug predicate.

Fixes #76361.

Real behavior proof

Test: pnpm test src/gateway/server/ws-connection.startup.test.ts — 2 passed.

The test exercises the actual attachGatewayWsConnectionHandler code path during startup-pending state, fires a 1013 close, and asserts the log line is downgraded to debug. The matching negative case (1013 + "gateway starting" reason from a non-startup peer) stays at warn. Same code path that handles real Pi/Telegram reconnects during gateway boot.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10787b9e52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T08:07:34Z

+    const isExpectedStartupClose = (code: number | undefined, reason: string | undefined) =>
+      code === 1013 && normalizeLowercaseStringOrEmpty(reason).includes("gateway starting");


Restrict 1013 downgrade to server-initiated startup closes

This downgrade checks only code===1013 plus a reason substring, so any pre-auth peer that closes with 1013 and text containing gateway starting will now be logged at debug instead of warn, even when the gateway is not in the startup-pending path. In attachGatewayWsConnectionHandler, startup closes are explicitly marked via the handshake failure/close-cause path, but that state is not consulted here, so unexpected external closes can be misclassified and warning visibility is reduced.

Useful? React with 👍 / 👎.

clawsweeper · 2026-05-04T08:07:51Z

Codex review: needs real behavior proof before merge.

Summary
The PR downgrades server-owned startup-sidecars WebSocket 1013 closed before connect logs from warn to debug, adds startup regression coverage, and records the fix in the changelog.

Reproducibility: yes. at source level: force isStartupPending() true, send a connect frame, and current main marks startup-sidecars-pending, closes with 1013, then routes the non-client close through the generic WARN branch. I did not run a live gateway or tests because this was a read-only review.

Real behavior proof
Needs real behavior proof before merge: The PR body supplies tests only, and the current Real behavior proof check fails; it needs live after-fix gateway log, terminal output, linked artifact, or a maintainer override.

Next step before merge
This needs contributor-supplied real behavior proof or a maintainer proof override; there is no remaining narrow automation repair.

Security
Cleared: No security or supply-chain concern found; the diff only changes gateway log classification, a focused test, and changelog text.

Review details

Best possible solution:

Land this PR after the contributor adds accepted real runtime proof or a maintainer explicitly overrides that gate, keeping the fix scoped to the existing server-owned startup close cause.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: force isStartupPending() true, send a connect frame, and current main marks startup-sidecars-pending, closes with 1013, then routes the non-client close through the generic WARN branch. I did not run a live gateway or tests because this was a read-only review.

Is this the best way to solve the issue?

Yes, the current PR is the narrow maintainable fix because it keys the downgrade on the existing startup close cause plus the expected 1013 reason while preserving WARN for other pre-connect closes. The merge path still needs real behavior proof before approval.

Acceptance criteria:

pnpm test src/gateway/server/ws-connection.startup.test.ts
pnpm exec oxfmt --check --threads=1 src/gateway/server/ws-connection.ts src/gateway/server/ws-connection.startup.test.ts CHANGELOG.md
pnpm check:changed in Testbox before handoff

What I checked:

Current main still warns: On current main, pre-connect closes use logWsControl.warn unless only the SwiftPM loopback helper special case matches; there is no startup-sidecars 1013 downgrade. (src/gateway/server/ws-connection.ts:377, 7d5ca3064a51)
Server-owned startup close: The startup-pending connect handler marks startup-sidecars-pending, sends retryable UNAVAILABLE, and queues close code 1013 with reason gateway starting. (src/gateway/server/ws-connection/message-handler.ts:458, 7d5ca3064a51)
Protocol supports retry treatment: Gateway protocol docs say startup-sidecar UNAVAILABLE responses are retryable and clients should retry within the connection budget instead of treating them as terminal handshake failures. Public docs: docs/gateway/protocol.md. (docs/gateway/protocol.md:100, 7d5ca3064a51)
Current PR guard is narrow: The current PR head downgrades only when closeCause === "startup-sidecars-pending", code is 1013, and the close reason contains gateway starting, addressing the prior review concern about peer-supplied 1013 text suppressing WARNs. (src/gateway/server/ws-connection.ts:359, 032fe7a48f08)
Regression coverage added: The PR adds a startup test that forces isStartupPending: () => true, waits for the 1013 close, asserts a closed before connect debug log, and asserts no matching warn log. (src/gateway/server/ws-connection.startup.test.ts:134, 032fe7a48f08)
Real behavior proof gate failing: GitHub check-runs for the current head include Real behavior proof with conclusion failure; the PR body's proof section only lists pnpm test src/gateway/server/ws-connection.startup.test.ts rather than live after-fix gateway log or terminal output. (032fe7a48f08)

Likely related people:

steipete: Peter Steinberger introduced the startup-unavailable retry protocol, docs, startup server test, and the affected startup close path. (role: introduced behavior and likely follow-up owner; confidence: high; commits: 8d58ad4c15cd, 047c03cc8854; files: src/gateway/server/ws-connection.ts, src/gateway/server/ws-connection/message-handler.ts, src/gateway/server/ws-connection.startup.test.ts)
Shakker: Current-main blame for the pre-connect close logging block and startup close lines points to recent gateway restart diagnostics work by Shakker. (role: recent maintainer; confidence: medium; commits: 9b0afd81413a, 3e53580d6311; files: src/gateway/server/ws-connection.ts, src/gateway/server/ws-connection/message-handler.ts, src/gateway/server/ws-connection.startup.test.ts)

Remaining risk / open question:

The external PR still lacks accepted live after-fix proof showing the gateway restart log behavior outside tests.
Targeted tests were reported by the contributor but were not independently run in this read-only review.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7d5ca3064a51.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/25395444900
Updated: 2026-05-05T18:48:55.379Z

lonexreb · 2026-05-04T19:27:08Z

Addressed [P2] from codex review (src/gateway/server/ws-connection.ts:350). The current PR head ties the 1013 downgrade to the server-owned closeCause === "startup-sidecars-pending" so a peer-supplied 1013 + "gateway starting" reason on a non-startup close cannot suppress an otherwise actionable WARN. Codex was reviewing an older revision (10787b9e52); the current head includes the closeCause guard at src/gateway/server/ws-connection.ts:355-358 plus startup-test coverage. Also rebased onto current origin/main; targeted tests still green (src/gateway/server/ws-connection.startup.test.ts 2/2).

…ore-connect log to debug (openclaw#76361) Code 1013 'Try Again Later' is exactly what the WebSocket spec says servers should send to clients connecting during a not-yet-ready window, and the gateway's own ws-connection message-handler does this deliberately during the startup-sidecars-pending window. Surfacing the matching 'closed before connect' line at WARN added one entry per Control UI client per gateway restart and inflated WARN noise without flagging anything actionable. Add an isExpectedStartupClose helper that recognizes (code === 1013 && reason includes 'gateway starting') and routes that case through the debug logger, alongside the existing swiftpm-testing-helper noise downgrade. The non-startup WARN path is preserved for unexpected close-before-connect events. Tests: new ws-connection.startup.test case asserts the 1013 close during pending startup logs at debug, never warn.

openclaw-barnacle Bot added gateway Gateway runtime size: S labels May 4, 2026

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

lonexreb force-pushed the fix/76361-ws-1013-startup-debug branch 2 times, most recently from 389b13f to 70cbee3 Compare May 4, 2026 19:27

lonexreb force-pushed the fix/76361-ws-1013-startup-debug branch from 70cbee3 to 032fe7a Compare May 5, 2026 18:11

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 5, 2026

clawsweeper Bot mentioned this pull request May 6, 2026

fix(gateway): quiet startup-pending websocket closes #78428

Closed

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gateway/ws): downgrade expected 1013 'gateway starting' close-before-connect to debug#77189

fix(gateway/ws): downgrade expected 1013 'gateway starting' close-before-connect to debug#77189
lonexreb wants to merge 1 commit intoopenclaw:mainfrom
lonexreb:fix/76361-ws-1013-startup-debug

lonexreb commented May 4, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

clawsweeper Bot commented May 4, 2026 •

edited

Loading

Uh oh!

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const isExpectedStartupClose = (code: number \| undefined, reason: string \| undefined) =>
		code === 1013 && normalizeLowercaseStringOrEmpty(reason).includes("gateway starting");

Uh oh!

Conversation

lonexreb commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Test plan

Real behavior proof

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lonexreb commented May 4, 2026 •

edited

Loading

clawsweeper Bot commented May 4, 2026 •

edited

Loading