fix(mattermost): detect stale websocket after bot disable/enable cycle#53604
Conversation
Greptile SummaryThis PR addresses a subtle but real production issue: after a Mattermost bot account is disabled and re-enabled, the existing WebSocket silently stops receiving server-pushed events while TCP and request/reply still work. The solution polls the bot's Key changes:
One logic issue is present: the initial Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/mattermost/src/mattermost/monitor-websocket.ts
Line: 173-193
Comment:
**Timer leak when WebSocket closes before initial `getBotUpdateAt` resolves**
The initial `getBotUpdateAt()` call is async and fires without awaiting. If the WebSocket closes (or the promise settles via `rejectOnce`/`resolveOnce`) before this Promise resolves, `clearTimers()` runs when `healthCheckTimer` is still `undefined`. Then, when the Promise resolves later, `healthCheckTimer = setInterval(...)` creates a new interval that is **never cleared** — it will keep firing REST requests and potentially calling `ws.terminate()` on an already-dead socket indefinitely.
The fix is to guard the `.then()` body with the `settled` flag before creating the timer:
```suggestion
if (opts.getBotUpdateAt) {
const getBotUpdateAt = opts.getBotUpdateAt;
// Capture initial update_at, then start polling.
getBotUpdateAt()
.then((ts) => {
if (settled) return;
initialUpdateAt = ts;
healthCheckTimer = setInterval(() => {
getBotUpdateAt()
.then((current) => {
if (initialUpdateAt !== undefined && current !== initialUpdateAt) {
opts.runtime.log?.(
`mattermost: bot account updated (update_at changed: ${initialUpdateAt} → ${current}) — reconnecting`,
);
ws.terminate();
}
})
.catch((err) => {
opts.runtime.error?.(`mattermost: health check error: ${String(err)}`);
});
}, healthCheckIntervalMs);
})
.catch((err) => {
opts.runtime.error?.(`mattermost: failed to get initial update_at: ${String(err)}`);
});
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(mattermost): detect stale websocket ..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7cd989195
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
b7cd989 to
e9bdde2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9bdde25fc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
bf120b8 to
651d3fd
Compare
651d3fd to
7a4abe7
Compare
4dfaf9c to
7a029ff
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a029ff3dd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (initialUpdateAt === undefined) { | ||
| initialUpdateAt = current; | ||
| return; |
There was a problem hiding this comment.
Initialize update_at baseline from connect time
The health-check logic sets initialUpdateAt from the first successful /users/me poll, which creates a blind spot when that first poll fails. If the bot is disabled/re-enabled during that failure window, the websocket can already be stale, but the later successful poll stores the post-change update_at as baseline and never triggers reconnect. Since startup already fetched bot identity before opening the socket (monitorMattermostProvider), this race can leave event delivery broken indefinitely under transient API errors right after connect.
Useful? React with 👍 / 👎.
7a029ff to
15fc65a
Compare
0f18204 to
699b0bb
Compare
After a bot is disabled and re-enabled in Mattermost, the existing WebSocket silently stops delivering events even though the connection remains alive. Poll the bot's update_at timestamp via REST API every 30 seconds and reconnect when it changes. Also wrap the initial fetchMattermostMe with runWithReconnect so the monitor survives a 401 during bot re-activation.
699b0bb to
818d437
Compare
|
Merged via squash.
Thanks @Qinsam! |
openclaw#53604) Merged via squash. Prepared head SHA: 818d437 Co-authored-by: Qinsam <[email protected]> Co-authored-by: mukhtharcm <[email protected]> Reviewed-by: @mukhtharcm
openclaw#53604) Merged via squash. Prepared head SHA: 818d437 Co-authored-by: Qinsam <[email protected]> Co-authored-by: mukhtharcm <[email protected]> Reviewed-by: @mukhtharcm
openclaw#53604) Merged via squash. Prepared head SHA: 818d437 Co-authored-by: Qinsam <[email protected]> Co-authored-by: mukhtharcm <[email protected]> Reviewed-by: @mukhtharcm
openclaw#53604) Merged via squash. Prepared head SHA: 818d437 Co-authored-by: Qinsam <[email protected]> Co-authored-by: mukhtharcm <[email protected]> Reviewed-by: @mukhtharcm
Problem
When a Mattermost bot account is disabled and then re-enabled, the existing WebSocket connection silently stops delivering events (posted, reaction, etc.) even though:
get_statuses) still receiveseq_replyresponsesThe only thing that breaks is event delivery — the bot appears online but never receives new messages. This requires a manual gateway restart to recover.
Root Cause
Mattermost invalidates the bot's session when the account is disabled. Re-enabling the account creates a new session, but the old WebSocket connection is not notified and continues to operate in a degraded state where only request/reply works but server-pushed events are no longer delivered.
Solution
Poll the bot's
update_attimestamp via REST API every 30 seconds. When the timestamp changes (indicating the account was modified, e.g. disabled/enabled), terminate the WebSocket connection so the reconnect loop can establish a fresh one that receives events properly.This approach was chosen after testing and ruling out several alternatives:
get_statuses): Server still replies even when event delivery is brokenuser_updatedevent detection: The bot's own WebSocket may not receive this event when it's the one being disabledfetchMattermostMeactive check: Returns success after a quick disable/enable since the bot is already active againComparing
update_atreliably detects any account modification regardless of timing.Additionally, the initial
fetchMattermostMecall is now wrapped withrunWithReconnectso the monitor survives a 401 during bot re-activation without exhausting the framework's auto-restart budget.Changes
client.ts: Addupdate_atfield toMattermostUsertypemonitor-websocket.ts: AddgetBotUpdateAtoption that polls the bot'supdate_attimestamp and terminates the connection when it changesmonitor.ts: PassgetBotUpdateAtcallback usingfetchMattermostMe; wrap initial API call withrunWithReconnectfor resiliencemonitor-websocket.test.ts: Tests for update_at change detection, stable connection, and API error handlingTest plan
pnpm test:extension mattermost)pnpm build && pnpm checkpasses