Skip to content

fix(gateway): stop stale-socket restarts before first event#38643

Merged
Takhoffman merged 5 commits intomainfrom
issue-38464-guided
Mar 7, 2026
Merged

fix(gateway): stop stale-socket restarts before first event#38643
Takhoffman merged 5 commits intomainfrom
issue-38464-guided

Conversation

@Takhoffman
Copy link
Copy Markdown
Contributor

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: gateway stale-socket health checks restarted connected channels that had never published event-liveness timestamps, which caused Telegram reconnect storms and duplicate delivery loops.
  • Why it matters: quiet or freshly reconnected channels could be recycled indefinitely before their first real event, destabilizing sessions and creating duplicate downstream work.
  • What changed: stale-socket recovery now requires explicit event-liveness tracking, and connect-time liveness seeding is centralized in a shared helper used by Slack, Discord, and web channel startup paths.
  • What did NOT change (scope boundary): auth, transport retry policy, message parsing, and non-stale health outcomes are unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

Telegram-shaped channels no longer get restarted as stale-socket before they have published any event-liveness timestamp. Slack, Discord, and web channels now seed connect-time liveness consistently so stale-socket recovery still works after a real connection is established.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node.js via pnpm
  • Model/provider: N/A
  • Integration/channel (if any): Gateway health monitor; Telegram/Slack/Discord/Web providers
  • Relevant config (redacted): default local test config; no secret/config changes required

Steps

  1. Start a channel provider that reports connected: true before receiving its first inbound event.
  2. Let the channel health monitor evaluate a long-lived idle connection.
  3. Observe whether the provider is restarted as stale-socket without any recorded event-liveness.

Expected

  • Channels without event-liveness tracking stay healthy, while channels that do track liveness can still be restarted when they go stale.

Actual

  • Before the fix, connected channels with lastEventAt: null could be restarted repeatedly, causing reconnect churn and duplicate delivery loops.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: focused stale-socket policy and monitor regressions, shared connect-time liveness helper coverage, Slack connect-time status seeding, full repo test suite.
  • Edge cases checked: no-event channels remain healthy; Slack/Discord/web connected paths still seed stale-socket liveness; lastError clearing remains intact for Slack connect.
  • What you did not verify: live production channel sessions outside automated tests.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert commits 0d6110e08eac92a42124800b90a9138637f2c367 and e5af30283abf051964a1b5f0eb8a4a734e617d08 from the branch.
  • Files/config to restore: src/gateway/channel-health-policy.ts, provider status publishers, and src/gateway/channel-status-patches.ts.
  • Known bad symptoms reviewers should watch for: stale-socket restarts stop firing for connected channels that should recover, or channels begin reconnecting before their first inbound event again.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: a provider path that should participate in stale-socket recovery might bypass the shared connect-time liveness helper.
    • Mitigation: the helper is covered by a focused test and adopted in the currently touched Slack, Discord, and web connected paths.

@openclaw-barnacle openclaw-barnacle bot added channel: discord Channel integration: discord channel: slack Channel integration: slack channel: whatsapp-web Channel integration: whatsapp-web gateway Gateway runtime size: S maintainer Maintainer-authored PR labels Mar 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR fixes a Telegram (and Telegram-shaped channel) restart storm caused by the stale-socket health check mishandling channels that never publish lastEventAt timestamps. The core fix is in evaluateChannelHealth (channel-health-policy.ts): the old logic entered the stale-socket branch whenever lastStartAt != null, then fell back to lastEventAt ?? 0 — epoch-0 — as the "last event", causing any idle connected channel to appear stale. The new logic gates the entire stale-socket path on lastEventAt != null, so channels that opt out of event-liveness tracking are never restarted as stale.

To keep stale-socket recovery working for channels that do track liveness (Slack, Discord, Web/WhatsApp), a shared createConnectedChannelStatusPatch helper is introduced that seeds both lastConnectedAt and lastEventAt to the same connect-time timestamp. All three provider connect paths now use this helper, ensuring the stale timer starts from a known good baseline rather than epoch-0.

Key observations:

  • The removal of the upDuration > staleEventThresholdMs guard is a subtle behavioral change: a channel with an old inherited lastEventAt (patch-merged from a previous lifecycle) that has just exited the startup-connect grace window could now be flagged stale-socket before its first createConnectedChannelStatusPatch call. In practice this is guarded by the connected: false → disconnected fast-path that all three providers set on disconnect, but it is worth being aware of.
  • src/web/auto-reply/monitor.ts assigns the connected status patch field-by-field rather than using object spread like every other callsite; this is fragile if ConnectedChannelStatusPatch ever gains new fields (see inline comment).
  • Test coverage is solid: policy-level unit tests, monitor-level integration tests for the Telegram opt-out case, and a dedicated test for the new helper and the Slack connect-status publisher.

Confidence Score: 4/5

  • Safe to merge; the core fix is correct and well-tested, with one minor style inconsistency in the web monitor.
  • The fix correctly addresses the false-stale-socket root cause, all three provider connect paths are covered by the shared helper, and tests exercise both the opt-out (Telegram) and opt-in (Slack) paths. The one point deduction is for the verbose field-by-field assignment in web/auto-reply/monitor.ts which diverges from every other callsite and could silently miss future fields, and for the undocumented behavioral change around removal of the upDuration guard (unlikely to cause issues in practice given connected: false disconnect signals, but worth tracking).
  • src/web/auto-reply/monitor.ts — verbose field-by-field patch assignment should match other callsites; src/gateway/channel-health-policy.ts — removal of upDuration guard is correct but reviewers should be aware of the edge case with inherited stale lastEventAt across lifecycle restarts.

Last reviewed commit: 0d6110e

@Takhoffman Takhoffman force-pushed the issue-38464-guided branch from 0d6110e to 78f6f98 Compare March 7, 2026 06:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d6110e08e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5a5dc64e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Takhoffman Takhoffman merged commit 8873e13 into main Mar 7, 2026
29 of 30 checks passed
@Takhoffman Takhoffman deleted the issue-38464-guided branch March 7, 2026 06:58
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b47b76116

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +111 to +112
if (lastStartAt != null && snapshot.lastEventAt < lastStartAt) {
return { healthy: true, reason: "healthy" };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound inherited-event bypass by lifecycle timeout

The new early return for lastEventAt < lastStartAt makes stale-socket checks permanently skip when a restarted lifecycle inherits old status and then never publishes a fresh event timestamp. This can happen with patch-merged runtime state (src/gateway/server-channels.ts merges patches without clearing connected/lastEventAt) plus startup paths that can block before status publication (for example Slack waits on app.start() before calling publishSlackConnectedStatus in src/slack/monitor/provider.ts), leaving connected: true and an old lastEventAt forever and preventing health monitor recovery for a hung start.

Useful? React with 👍 / 👎.

mrosmarin added a commit to mrosmarin/openclaw that referenced this pull request Mar 7, 2026
* main: (45 commits)
  chore: update dependencies except carbon
  test(memory): make mcporter EINVAL retry test deterministic
  refactor(extensions): reuse shared helper primitives
  refactor(core): extract shared dedup helpers
  fix(ci): re-enable detect-secrets on main
  docs: reorder 2026.3.7 changelog highlights
  chore: bump version to 2026.3.7
  fix(android): align run command with app id
  docs: add changelog entry for Android package rename (openclaw#38712)
  fix(android): rename app package to ai.openclaw.app
  fix(gateway): stop stale-socket restarts before first event (openclaw#38643)
  fix(gateway): skip stale-socket restarts for Telegram polling (openclaw#38405)
  fix(gateway): invalidate bootstrap cache on session rollover (openclaw#38535)
  docs: update changelog for reply media delivery (openclaw#38572)
  fix: contain final reply media normalization failures
  fix: contain block reply media failures
  fix: normalize reply media paths
  Fix owner-only auth and overlapping skill env regressions (openclaw#38548)
  fix(feishu): disable block streaming to prevent silent reply drops (openclaw#38422)
  fix: suppress ACP NO_REPLY fragments in console output (openclaw#38436)
  ...
mcaxtr pushed a commit to mcaxtr/openclaw that referenced this pull request Mar 7, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
joshavant pushed a commit that referenced this pull request Mar 7, 2026
* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
vincentkoc pushed a commit to BryanTegomoh/openclaw-fork that referenced this pull request Mar 8, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
Saitop pushed a commit to NomiciAI/openclaw that referenced this pull request Mar 8, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
dhoman pushed a commit to dhoman/chrono-claw that referenced this pull request Mar 11, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
yumesha pushed a commit to yumesha/openclaw that referenced this pull request Mar 11, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
senw-developers pushed a commit to senw-developers/va-openclaw that referenced this pull request Mar 17, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps

(cherry picked from commit 8873e13)
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 20, 2026
…#38643)

* fix(gateway): guard stale-socket restarts by event liveness

* fix(gateway): centralize connect-time liveness tracking

* fix(web): apply connected status patch atomically

* fix(gateway): require active socket for stale checks

* fix(gateway): ignore inherited stale event timestamps

(cherry picked from commit 8873e13)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord channel: slack Channel integration: slack channel: whatsapp-web Channel integration: whatsapp-web gateway Gateway runtime maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2026.3.2: Health-monitor stale-socket storms cause Telegram message duplication and device re-pair loops

1 participant