fix(gateway): stop stale-socket restarts before first event#38643
fix(gateway): stop stale-socket restarts before first event#38643Takhoffman merged 5 commits intomainfrom
Conversation
Greptile SummaryThis PR fixes a Telegram (and Telegram-shaped channel) restart storm caused by the stale-socket health check mishandling channels that never publish To keep stale-socket recovery working for channels that do track liveness (Slack, Discord, Web/WhatsApp), a shared Key observations:
Confidence Score: 4/5
Last reviewed commit: 0d6110e |
0d6110e to
78f6f98
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d6110e08e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5a5dc64e3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b47b76116
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (lastStartAt != null && snapshot.lastEventAt < lastStartAt) { | ||
| return { healthy: true, reason: "healthy" }; |
There was a problem hiding this comment.
Bound inherited-event bypass by lifecycle timeout
The new early return for lastEventAt < lastStartAt makes stale-socket checks permanently skip when a restarted lifecycle inherits old status and then never publishes a fresh event timestamp. This can happen with patch-merged runtime state (src/gateway/server-channels.ts merges patches without clearing connected/lastEventAt) plus startup paths that can block before status publication (for example Slack waits on app.start() before calling publishSlackConnectedStatus in src/slack/monitor/provider.ts), leaving connected: true and an old lastEventAt forever and preventing health monitor recovery for a hung start.
Useful? React with 👍 / 👎.
* main: (45 commits) chore: update dependencies except carbon test(memory): make mcporter EINVAL retry test deterministic refactor(extensions): reuse shared helper primitives refactor(core): extract shared dedup helpers fix(ci): re-enable detect-secrets on main docs: reorder 2026.3.7 changelog highlights chore: bump version to 2026.3.7 fix(android): align run command with app id docs: add changelog entry for Android package rename (openclaw#38712) fix(android): rename app package to ai.openclaw.app fix(gateway): stop stale-socket restarts before first event (openclaw#38643) fix(gateway): skip stale-socket restarts for Telegram polling (openclaw#38405) fix(gateway): invalidate bootstrap cache on session rollover (openclaw#38535) docs: update changelog for reply media delivery (openclaw#38572) fix: contain final reply media normalization failures fix: contain block reply media failures fix: normalize reply media paths Fix owner-only auth and overlapping skill env regressions (openclaw#38548) fix(feishu): disable block streaming to prevent silent reply drops (openclaw#38422) fix: suppress ACP NO_REPLY fragments in console output (openclaw#38436) ...
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
* fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps (cherry picked from commit 8873e13)
…#38643) * fix(gateway): guard stale-socket restarts by event liveness * fix(gateway): centralize connect-time liveness tracking * fix(web): apply connected status patch atomically * fix(gateway): require active socket for stale checks * fix(gateway): ignore inherited stale event timestamps (cherry picked from commit 8873e13)
Summary
Describe the problem and fix in 2–5 bullets:
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
Telegram-shaped channels no longer get restarted as
stale-socketbefore they have published any event-liveness timestamp. Slack, Discord, and web channels now seed connect-time liveness consistently so stale-socket recovery still works after a real connection is established.Security Impact (required)
Yes/No) NoYes/No) NoYes/No) NoYes/No) NoYes/No) NoYes, explain risk + mitigation:Repro + Verification
Environment
pnpmSteps
connected: truebefore receiving its first inbound event.stale-socketwithout any recorded event-liveness.Expected
Actual
lastEventAt: nullcould be restarted repeatedly, causing reconnect churn and duplicate delivery loops.Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
lastErrorclearing remains intact for Slack connect.Compatibility / Migration
Yes/No) YesYes/No) NoYes/No) NoFailure Recovery (if this breaks)
0d6110e08eac92a42124800b90a9138637f2c367ande5af30283abf051964a1b5f0eb8a4a734e617d08from the branch.src/gateway/channel-health-policy.ts, provider status publishers, andsrc/gateway/channel-status-patches.ts.Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write
None.