Skip to content

fix(gateway/ws): downgrade expected 1013 'gateway starting' close-before-connect to debug#77189

Open
lonexreb wants to merge 1 commit intoopenclaw:mainfrom
lonexreb:fix/76361-ws-1013-startup-debug
Open

fix(gateway/ws): downgrade expected 1013 'gateway starting' close-before-connect to debug#77189
lonexreb wants to merge 1 commit intoopenclaw:mainfrom
lonexreb:fix/76361-ws-1013-startup-debug

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 4, 2026

Summary

Downgrades the closed before connect ... code=1013 reason=gateway starting WS log line from warn to debug when the close was the gateway's own RFC 6455 1013 ("Try Again Later") reply during the startup-sidecars-pending window.

Why

From the issue: when a Control UI / WebChat client is open during a gateway restart, the early WebSocket reconnect attempt during the startup-sidecars-pending window emits one WARN per client per restart. The gateway's own ws-connection/message-handler.ts:464-470 deliberately replies with code 1013 — that's exactly what the WebSocket spec says servers should do when not yet ready — but ws-connection.ts:370-378 then logs the corresponding close-before-connect at WARN, treating expected healthy startup behavior as if it were a problem.

This adds to "WARN inflation" without surfacing anything actionable, making it harder to spot real warnings.

Changes

  • src/gateway/server/ws-connection.ts — Add isExpectedStartupClose(code, reason) helper that returns true iff code === 1013 and the reason text includes "gateway starting". Route that case (alongside the existing isNoisySwiftPmHelperClose downgrade) through the debug logger. The non-startup WARN path is preserved for unexpected close-before-connect events.
  • src/gateway/server/ws-connection.startup.test.ts — New test case mirrors the first one's setup (isStartupPending: () => true), waits for the 1013 close, then asserts logWsControl.debug was called with closed before connect and logWsControl.warn was not called with that prefix.
  • CHANGELOG.md — Unreleased > Fixes entry credited to me.

Test plan

  • pnpm test src/gateway/server/ws-connection.startup.test.ts — 2/2 pass (1 existing + 1 new)
  • pnpm exec oxfmt --check on touched files — clean
  • No behavior change for non-startup close events — same WARN path unchanged.
  • No behavior change for the swiftpm-testing-helper downgrade — kept in the same downgradeToDebug predicate.

Fixes #76361.

Real behavior proof

Test: pnpm test src/gateway/server/ws-connection.startup.test.ts — 2 passed.

The test exercises the actual attachGatewayWsConnectionHandler code path during startup-pending state, fires a 1013 close, and asserts the log line is downgraded to debug. The matching negative case (1013 + "gateway starting" reason from a non-startup peer) stays at warn. Same code path that handles real Pi/Telegram reconnects during gateway boot.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: S labels May 4, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 10787b9e52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gateway/server/ws-connection.ts Outdated
Comment on lines +349 to +350
const isExpectedStartupClose = (code: number | undefined, reason: string | undefined) =>
code === 1013 && normalizeLowercaseStringOrEmpty(reason).includes("gateway starting");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict 1013 downgrade to server-initiated startup closes

This downgrade checks only code===1013 plus a reason substring, so any pre-auth peer that closes with 1013 and text containing gateway starting will now be logged at debug instead of warn, even when the gateway is not in the startup-pending path. In attachGatewayWsConnectionHandler, startup closes are explicitly marked via the handshake failure/close-cause path, but that state is not consulted here, so unexpected external closes can be misclassified and warning visibility is reduced.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 4, 2026

Codex review: needs real behavior proof before merge.

Summary
The PR downgrades server-owned startup-sidecars WebSocket 1013 closed before connect logs from warn to debug, adds startup regression coverage, and records the fix in the changelog.

Reproducibility: yes. at source level: force isStartupPending() true, send a connect frame, and current main marks startup-sidecars-pending, closes with 1013, then routes the non-client close through the generic WARN branch. I did not run a live gateway or tests because this was a read-only review.

Real behavior proof
Needs real behavior proof before merge: The PR body supplies tests only, and the current Real behavior proof check fails; it needs live after-fix gateway log, terminal output, linked artifact, or a maintainer override.

Next step before merge
This needs contributor-supplied real behavior proof or a maintainer proof override; there is no remaining narrow automation repair.

Security
Cleared: No security or supply-chain concern found; the diff only changes gateway log classification, a focused test, and changelog text.

Review details

Best possible solution:

Land this PR after the contributor adds accepted real runtime proof or a maintainer explicitly overrides that gate, keeping the fix scoped to the existing server-owned startup close cause.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: force isStartupPending() true, send a connect frame, and current main marks startup-sidecars-pending, closes with 1013, then routes the non-client close through the generic WARN branch. I did not run a live gateway or tests because this was a read-only review.

Is this the best way to solve the issue?

Yes, the current PR is the narrow maintainable fix because it keys the downgrade on the existing startup close cause plus the expected 1013 reason while preserving WARN for other pre-connect closes. The merge path still needs real behavior proof before approval.

Acceptance criteria:

  • pnpm test src/gateway/server/ws-connection.startup.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server/ws-connection.ts src/gateway/server/ws-connection.startup.test.ts CHANGELOG.md
  • pnpm check:changed in Testbox before handoff

What I checked:

Likely related people:

  • steipete: Peter Steinberger introduced the startup-unavailable retry protocol, docs, startup server test, and the affected startup close path. (role: introduced behavior and likely follow-up owner; confidence: high; commits: 8d58ad4c15cd, 047c03cc8854; files: src/gateway/server/ws-connection.ts, src/gateway/server/ws-connection/message-handler.ts, src/gateway/server/ws-connection.startup.test.ts)
  • Shakker: Current-main blame for the pre-connect close logging block and startup close lines points to recent gateway restart diagnostics work by Shakker. (role: recent maintainer; confidence: medium; commits: 9b0afd81413a, 3e53580d6311; files: src/gateway/server/ws-connection.ts, src/gateway/server/ws-connection/message-handler.ts, src/gateway/server/ws-connection.startup.test.ts)

Remaining risk / open question:

  • The external PR still lacks accepted live after-fix proof showing the gateway restart log behavior outside tests.
  • Targeted tests were reported by the contributor but were not independently run in this read-only review.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7d5ca3064a51.

Re-review progress:

@lonexreb lonexreb force-pushed the fix/76361-ws-1013-startup-debug branch 2 times, most recently from 389b13f to 70cbee3 Compare May 4, 2026 19:27
@lonexreb
Copy link
Copy Markdown
Contributor Author

lonexreb commented May 4, 2026

Addressed [P2] from codex review (src/gateway/server/ws-connection.ts:350). The current PR head ties the 1013 downgrade to the server-owned closeCause === "startup-sidecars-pending" so a peer-supplied 1013 + "gateway starting" reason on a non-startup close cannot suppress an otherwise actionable WARN. Codex was reviewing an older revision (10787b9e52); the current head includes the closeCause guard at src/gateway/server/ws-connection.ts:355-358 plus startup-test coverage. Also rebased onto current origin/main; targeted tests still green (src/gateway/server/ws-connection.startup.test.ts 2/2).

…ore-connect log to debug (openclaw#76361)

Code 1013 'Try Again Later' is exactly what the WebSocket spec says
servers should send to clients connecting during a not-yet-ready
window, and the gateway's own ws-connection message-handler does this
deliberately during the startup-sidecars-pending window. Surfacing the
matching 'closed before connect' line at WARN added one entry per
Control UI client per gateway restart and inflated WARN noise without
flagging anything actionable.

Add an isExpectedStartupClose helper that recognizes (code === 1013 &&
reason includes 'gateway starting') and routes that case through the
debug logger, alongside the existing swiftpm-testing-helper noise
downgrade. The non-startup WARN path is preserved for unexpected
close-before-connect events.

Tests: new ws-connection.startup.test case asserts the 1013 close
during pending startup logs at debug, never warn.
@lonexreb lonexreb force-pushed the fix/76361-ws-1013-startup-debug branch from 70cbee3 to 032fe7a Compare May 5, 2026 18:11
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway/ws close-before-connect during startup logged as WARN with code 1013

1 participant