fix(gateway): skip seq-gap broadcast for stale post-lifecycle events by caesargattuso · Pull Request #43751 · openclaw/openclaw

caesargattuso · 2026-03-12T06:24:17Z

fix(gateway): skip seq-gap broadcast for stale post-lifecycle events

After a lifecycle:end event clears agentRunSeq, any remaining
in-flight events for the same runId arrive with last === 0, causing
a spurious seq gap error broadcast (expected: 1, received: N).

Guard the check with last > 0 so only genuinely out-of-order events
(where a prior seq was recorded) trigger the error.

Summary

Problem: after lifecycle:end clears agentRunSeq, residual in-flight events (e.g. a trailing chat or heartbeat) find last === 0 and incorrectly trigger a seq gap error broadcast.
Why it matters: the spurious error is surfaced to clients as an agent error event, causing downstream consumers (e.g. ClawdbotWebSocketClient) to treat a successful run as failed.
What changed: the seq-gap check in createAgentEventHandler is now guarded by last > 0; only events where a prior seq was already recorded can produce a gap error.
What did NOT change: the seq tracking, cleanup, and deletion logic are untouched; real gaps on active runs are still reported.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Gateway / orchestration

Linked Issue/PR

Closes #
Related #

User-visible / Behavior Changes

Spurious seq gap error events are no longer broadcast to clients at the end of a successful agent run.

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: Linux (production)
Runtime/container: Node 22 / Docker
Model/provider: any
Integration/channel (if any): Clawdbot WebSocket

Steps

Start a long agent run that produces many streaming events (e.g. image generation).
Observe the final log sequence: lifecycle:end (seq N) → trailing chat final / heartbeat events arrive after.
Before fix: agent error event with reason: seq gap, expected: 1, received: N+1 is broadcast.
After fix: no spurious error event is broadcast.

Expected

Run completes cleanly; no seq gap error event after lifecycle:end.

Actual (before fix)

agent error { reason: "seq gap", expected: 1, received: 307 } broadcast to clients, causing WebSocket client to treat the run as errored.

Evidence

Trace/log snippets — provided in issue description (ClawdbotWebSocketClient logs showing seq=695 error after lifecycle:end at seq=692).

Human Verification (required)

Verified scenarios: reviewed full event sequence in production logs; confirmed lifecycle:end clears agentRunSeq before residual events arrive.
Edge cases checked: genuine mid-run seq gaps (last > 0) are unaffected; first event of a new run (last === 0, seq === 1) passes without error.
What you did not verify: live end-to-end re-run in production after deploy.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert the last > 0 && guard in src/gateway/server-chat.ts.
Files/config to restore: src/gateway/server-chat.ts only.
Known bad symptoms: genuine seq gaps on active runs silently ignored (not applicable — guard only skips when last === 0).

Risks and Mitigations

Risk: a genuinely missing first event (seq 1 never arrives, seq 2 is first) would not be reported.
- Mitigation: this scenario is already undetectable under the old logic too (last === 0, seq === 2 would have reported expected: 1, received: 2 — but that is also indistinguishable from the stale-event case); the practical risk is negligible.

greptile-apps · 2026-03-12T06:26:38Z

Greptile Summary

This PR adds a one-line guard (last > 0) to the sequence-gap check in createAgentEventHandler, preventing spurious seq gap error broadcasts when residual in-flight events arrive after lifecycle:end has already cleared agentRunSeq.

Key observations:

The fix is minimal, targeted, and correctly addresses the described production issue where agentRunSeq.delete(evt.runId) at lifecycle-end caused the next stale event to see last === 0, which then incorrectly fired a gap error (expected: 1, received: N).
Active-run gap detection (where last > 0) is completely unaffected.
One subtle nuance worth being aware of: the stale event still executes agentRunSeq.set(evt.runId, evt.seq) (line 572), re-inserting the run's key. This means only the first stale event after lifecycle-end is fully shielded; if multiple non-consecutive stale events arrive, the second and later ones could still trigger a gap error since last would be > 0 at that point. In practice this is unlikely because in-flight events are dispatched sequentially and should arrive consecutively.
The acknowledged risk — a genuinely missing first event (seq 2 arriving before seq 1) going unreported — is real but negligible and practically indistinguishable from the stale-event scenario, as the PR author correctly notes.
The explanatory comment added alongside the guard is clear and accurately describes both the first-event and stale-event cases.

Confidence Score: 4/5

Safe to merge — minimal, well-reasoned one-line fix that eliminates a real production bug without affecting genuine gap detection on active runs.
The change is a single conditional guard on an already-narrow code path. The logic is sound: last === 0 only occurs when the run's seq entry was never set or was deleted by lifecycle-end, so skipping the gap check in that state is correct. Genuine in-run gaps (where last > 0) are fully preserved. The only minor residual risk — multiple non-consecutive stale events re-triggering after the first one re-inserts the key — is an edge case within an edge case and is not introduced by this PR.
No files require special attention.

_{Last reviewed commit: 1cdb818}

byungsker

The root cause analysis is spot-on: after lifecycle:end calls agentRunSeq.delete(evt.runId), the next ?? 0 fallback turns any stale in-flight event into a spurious gap report. The last > 0 guard is the minimal correct fix.

One side-effect worth noting: the stale event still runs agentRunSeq.set(evt.runId, evt.seq) at line 570, so it re-creates a zombie entry for a completed run. It will never be cleaned up by the lifecycle-end path (which already ran). In practice this is harmless — the run is finished and no further events arrive — but it could cause a subtle gap-suppression if a *new* run somehow reused the same runId (extremely unlikely with UUID run IDs). Not a blocker, just worth knowing.

No tests were added; a unit test covering the post-lifecycle stale-event path would be a nice-to-have but is non-trivial to set up given the mock complexity. LGTM for the fix itself.

caesargattuso · 2026-03-20T07:36:27Z

Hi @obviyus , this is just a one-line bug fix that's already passed botcheck — could you take a look?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 833bf37612

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server-chat.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65b7067c3a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server-chat.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13f8d12660

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server-chat.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 58d273c3cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server-chat.ts

…argattuso

…argattuso)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94b7ffd2a7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server-node-events.ts

obviyus · 2026-03-20T09:27:16Z

Landed on main.

Landed source commit: 92d23fb
Merge commit: 57f1cf6

Thanks @caesargattuso.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92d23fbdb9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T09:30:37Z

src/gateway/server-node-events.ts

+      const runId = randomUUID();

      // Ensure chat UI clients refresh when this run completes (even though it wasn't started via chat.send).
-      // This maps agent bus events (keyed by sessionId) to chat events (keyed by clientRunId).
-      ctx.addChatRun(sessionId, {
+      // This maps agent bus events (keyed by per-turn runId) to chat events (keyed by clientRunId).
+      ctx.addChatRun(runId, {


Keep voice transcript streams keyed by sessionId

Fresh evidence from this commit: voice.transcript now generates a per-turn UUID here instead of reusing the session ID. The shared iOS/macOS chat UI drops external agent events unless evt.runId == sessionId (apps/shared/OpenClawKit/Sources/OpenClawChatUI/ChatViewModel.swift:889-892), and its regression tests model external streaming with runId: sessionId (apps/shared/OpenClawKit/Tests/OpenClawKitTests/ChatViewModelTests.swift:523-540). In practice, if a user is watching a session while a node voice transcript runs, live assistant text and tool cards stop appearing until the final chat refresh lands.

Useful? React with 👍 / 👎.

@caesargattuso

…penclaw#43751) * fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caesargattuso) * fix: keep agent.request run ids session-scoped --------- Co-authored-by: Ayaan Zaidi <[email protected]>

@caesargattuso

…penclaw#43751) * fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caesargattuso) * fix: keep agent.request run ids session-scoped --------- Co-authored-by: Ayaan Zaidi <[email protected]>

@caesargattuso

…penclaw#43751) * fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caesargattuso) * fix: keep agent.request run ids session-scoped --------- Co-authored-by: Ayaan Zaidi <[email protected]>

@caesargattuso

…penclaw#43751) * fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caesargattuso) * fix: keep agent.request run ids session-scoped --------- Co-authored-by: Ayaan Zaidi <[email protected]> (cherry picked from commit 57f1cf6)

@fuller-stack-dev

* fix(gateway): skip device pairing when auth.mode=none Fixes openclaw#42931 When gateway.auth.mode is set to "none", authentication succeeds with method "none" but sharedAuthOk remains false because the auth-context only recognises token/password/trusted-proxy methods. This causes all pairing-skip conditions to fail, so Control UI browser connections get closed with code 1008 "pairing required" despite auth being disabled. Short-circuit the skipPairing check: if the operator explicitly disabled authentication, device pairing (which is itself an auth mechanism) must also be bypassed. Fixes openclaw#42931 (cherry picked from commit 9bffa34) * fix(gateway): strip unbound scopes for shared-auth connects (cherry picked from commit 7dc447f) * fix(gateway): increase WS handshake timeout from 3s to 10s (openclaw#49262) * fix(gateway): increase WS handshake timeout from 3s to 10s The 3-second default is too aggressive when the event loop is under load (concurrent sessions, compaction, agent turns), causing spurious 'gateway closed (1000)' errors on CLI commands like `openclaw cron list`. Changes: - Increase DEFAULT_HANDSHAKE_TIMEOUT_MS from 3_000 to 10_000 - Add OPENCLAW_HANDSHAKE_TIMEOUT_MS env var for user override (no VITEST gate) - Keep OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS as fallback for existing tests Fixes openclaw#46892 * fix: restore VITEST guard on test env var, use || for empty-string fallback, fix formatting * fix: cover gateway handshake timeout env override (openclaw#49262) (thanks @fuller-stack-dev) --------- Co-authored-by: Wilfred <[email protected]> Co-authored-by: Ayaan Zaidi <[email protected]> (cherry picked from commit 36f394c) * fix(gateway): skip Control UI pairing when auth.mode=none (closes openclaw#42931) (openclaw#47148) When auth is completely disabled (mode=none), requiring device pairing for Control UI operator sessions adds friction without security value since any client can already connect without credentials. Add authMode parameter to shouldSkipControlUiPairing so the bypass fires only for Control UI + operator role + auth.mode=none. This avoids the openclaw#43478 regression where a top-level OR disabled pairing for ALL websocket clients. (cherry picked from commit 26e0a3e) * fix(gateway): clear trusted-proxy control ui scopes (cherry picked from commit ccf16cd) * fix(gateway): guard interface discovery failures Closes openclaw#44180. Refs openclaw#47590. Co-authored-by: Peter Steinberger <[email protected]> (cherry picked from commit 3faaf89) * fix(gateway): suppress ciao interface assertions Closes openclaw#38628. Refs openclaw#47159, openclaw#52431. Co-authored-by: Peter Steinberger <[email protected]> (cherry picked from commit c0d4abc) * fix(gateway): run before_tool_call for HTTP tools (cherry picked from commit 8cc0c9b) * fix(gateway): skip seq-gap broadcast for stale post-lifecycle events (openclaw#43751) * fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caesargattuso) * fix: keep agent.request run ids session-scoped --------- Co-authored-by: Ayaan Zaidi <[email protected]> (cherry picked from commit 57f1cf6) * fix(gateway): honor trusted proxy hook auth rate limits (cherry picked from commit 4da617e) * fix(gateway): enforce browser origin check regardless of proxy headers In trusted-proxy mode, enforceOriginCheckForAnyClient was set to false whenever proxy headers were present. This allowed browser-originated WebSocket connections from untrusted origins to bypass origin validation entirely, as the check only ran for control-ui and webchat client types. An attacker serving a page from an untrusted origin could connect through a trusted reverse proxy, inherit proxy-injected identity, and obtain operator.admin access via the sharedAuthOk / roleCanSkipDeviceIdentity path without any origin restriction. Remove the hasProxyHeaders exemption so origin validation runs for all browser-originated connections regardless of how the request arrived. Fixes GHSA-5wcw-8jjv-m286 (cherry picked from commit ebed3bb) * fix(gateway): harden health monitor account gating (openclaw#46749) * gateway: harden health monitor account gating * gateway: tighten health monitor account-id guard (cherry picked from commit 29fec8b) * fix(gateway): bound unanswered client requests (openclaw#45689) * fix(gateway): bound unanswered client requests * fix(gateway): skip default timeout for expectFinal requests * fix(gateway): preserve gateway call timeouts * fix(gateway): localize request timeout policy * fix(gateway): clamp explicit request timeouts * fix(gateway): clamp default request timeout (cherry picked from commit 5fc43ff) * fix(gateway): propagate real gateway client into plugin subagent runtime Plugin subagent dispatch used a hardcoded synthetic client carrying operator.admin, operator.approvals, and operator.pairing for all runtime.subagent.* calls. Plugin HTTP routes with auth:"plugin" require no gateway auth by design, so an unauthenticated external request could drive admin-only gateway methods (sessions.delete, agent.run) through the subagent runtime. Propagate the real gateway client into the plugin runtime request scope when one is available. Plugin HTTP routes now run inside a scoped runtime client: auth:"plugin" routes receive a non-admin synthetic operator.write client; gateway-authenticated routes retain admin-capable scopes. The security boundary is enforced at the HTTP handler level. Fixes GHSA-xw77-45gv-p728 (cherry picked from commit a1520d7) * fix(gateway): enforce caller-scope subsetting in device.token.rotate device.token.rotate accepted attacker-controlled scopes and forwarded them to rotateDeviceToken without verifying the caller held those scopes. A pairing-scoped token could rotate up to operator.admin on any already-paired device whose approvedScopes included admin. Add a caller-scope subsetting check before rotateDeviceToken: the requested scopes must be a subset of client.connect.scopes via the existing roleScopesAllow helper. Reject with missing scope: <scope> if not. Also add server.device-token-rotate-authz.test.ts covering both the priv-esc path and the admin-to-node-invoke chain. Fixes GHSA-4jpw-hj22-2xmc (cherry picked from commit dafd61b) * fix(gateway): pin plugin webhook route registry (openclaw#47902) (cherry picked from commit a69f619) * fix(gateway): split conversation reset from admin reset (cherry picked from commit c91d162) * fix(gateway): harden token fallback/reconnect behavior and docs (openclaw#42507) * fix(gateway): harden token fallback and auth reconnect handling * docs(gateway): clarify auth retry and token-drift recovery * fix(gateway): tighten auth reconnect gating across clients * fix: harden gateway token retry (openclaw#42507) (thanks @joshavant) (cherry picked from commit a76e810) * fix: adapt cherry-picks for fork TS strictness - Replace OpenClawConfig with RemoteClawConfig in server-channels and server-runtime-state - Replace loadOpenClawPlugins with loadRemoteClawPlugins in server-plugins and remove unsupported runtimeOptions field and dead subagent runtime code - Export HookClientIpConfig type from server-http and thread it through server/hooks into server-runtime-state and server.impl - Create plugins-http/ submodules (path-context, route-match, route-auth) extracted from the monolithic plugins-http.ts by upstream refactor - Create stub modules for gutted upstream layers: acp/control-plane/manager, agents/bootstrap-cache, agents/pi-embedded, agents/internal-events - Strip thinkingLevel, reasoningLevel, skillsSnapshot from SessionEntry literals in agent.ts and session-reset-service.ts (Pi-specific fields) - Remove internalEvents from agent ingress opts and loadGatewayModelCatalog from sessions patch call (not present in fork types) - Fix connect-policy tests to pass booleans instead of role strings for the sharedAuthOk parameter (fork changed the function signature) - Add isHealthMonitorEnabled to ChannelManager mocks in test files - Widen runBeforeToolCallHook mock return type to accept blocked: true - Add explicit string types for msg params in server-plugins logger Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> * fix: apply fork naming to cherry-picked bonjour files --------- Co-authored-by: Andrew Demczuk <[email protected]> Co-authored-by: Peter Steinberger <[email protected]> Co-authored-by: fuller-stack-dev <[email protected]> Co-authored-by: Wilfred <[email protected]> Co-authored-by: Ayaan Zaidi <[email protected]> Co-authored-by: caesargattuso <[email protected]> Co-authored-by: Robin Waslander <[email protected]> Co-authored-by: Tak Hoffman <[email protected]> Co-authored-by: Peter Steinberger <[email protected]> Co-authored-by: Josh Avant <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

@caesargattuso

…penclaw#43751) * fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caesargattuso) * fix: keep agent.request run ids session-scoped --------- Co-authored-by: Ayaan Zaidi <[email protected]>

openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 12, 2026

byungsker reviewed Mar 12, 2026

View reviewed changes

openclaw-barnacle bot added size: S and removed size: XS labels Mar 20, 2026

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

src/gateway/server-chat.ts Outdated Show resolved Hide resolved

obviyus force-pushed the 20260312 branch from ab47a8a to 65b7067 Compare March 20, 2026 08:03

obviyus self-assigned this Mar 20, 2026

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

src/gateway/server-chat.ts Outdated Show resolved Hide resolved

src/gateway/server-chat.ts Outdated Show resolved Hide resolved

openclaw-barnacle bot added app: web-ui App: web-ui size: M and removed size: S labels Mar 20, 2026

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

src/gateway/server-chat.ts Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

src/gateway/server-chat.ts Outdated Show resolved Hide resolved

openclaw-barnacle bot added size: S and removed app: web-ui App: web-ui size: M labels Mar 20, 2026

fix: stop stale gateway seq-gap errors (openclaw#43751) (thanks @caes…

b98e1d5

…argattuso)

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

src/gateway/server-node-events.ts Outdated Show resolved Hide resolved

obviyus force-pushed the 20260312 branch from 94b7ffd to b98e1d5 Compare March 20, 2026 09:21

fix: keep agent.request run ids session-scoped

92d23fb

obviyus merged commit 57f1cf6 into openclaw:main Mar 20, 2026
8 checks passed

chatgpt-codex-connector bot reviewed Mar 20, 2026

View reviewed changes

alexey-pelykh mentioned this pull request Mar 22, 2026

Cherry-pick: Gateway fixes and hardening (20 commits) remoteclaw/remoteclaw#1825

Closed

alexey-pelykh mentioned this pull request Mar 23, 2026

Cherry-pick: Gateway fixes and improvements (1/2) (50 commits) remoteclaw/remoteclaw#1869

Closed

Uh oh!

Conversation

caesargattuso commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps bot commented Mar 12, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

byungsker left a comment

Choose a reason for hiding this comment

Uh oh!

caesargattuso commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

obviyus commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

caesargattuso commented Mar 12, 2026 •

edited

Loading