Skip to content

fix(mattermost): detect stale websocket after bot disable/enable cycle#53604

Merged
mukhtharcm merged 4 commits intoopenclaw:mainfrom
Qinsam:fix/mattermost-ws-health-check-v2
Mar 30, 2026
Merged

fix(mattermost): detect stale websocket after bot disable/enable cycle#53604
mukhtharcm merged 4 commits intoopenclaw:mainfrom
Qinsam:fix/mattermost-ws-health-check-v2

Conversation

@Qinsam
Copy link
Copy Markdown
Contributor

@Qinsam Qinsam commented Mar 24, 2026

Problem

When a Mattermost bot account is disabled and then re-enabled, the existing WebSocket connection silently stops delivering events (posted, reaction, etc.) even though:

  • The TCP connection remains alive
  • WebSocket-level ping/pong still works
  • Application-level action requests (get_statuses) still receive seq_reply responses

The only thing that breaks is event delivery — the bot appears online but never receives new messages. This requires a manual gateway restart to recover.

Root Cause

Mattermost invalidates the bot's session when the account is disabled. Re-enabling the account creates a new session, but the old WebSocket connection is not notified and continues to operate in a degraded state where only request/reply works but server-pushed events are no longer delivered.

Solution

Poll the bot's update_at timestamp via REST API every 30 seconds. When the timestamp changes (indicating the account was modified, e.g. disabled/enabled), terminate the WebSocket connection so the reconnect loop can establish a fresh one that receives events properly.

This approach was chosen after testing and ruling out several alternatives:

  • WebSocket ping/pong: Only detects TCP-level death, not silent event delivery failure
  • Application-level ping (get_statuses): Server still replies even when event delivery is broken
  • user_updated event detection: The bot's own WebSocket may not receive this event when it's the one being disabled
  • REST API fetchMattermostMe active check: Returns success after a quick disable/enable since the bot is already active again

Comparing update_at reliably detects any account modification regardless of timing.

Additionally, the initial fetchMattermostMe call is now wrapped with runWithReconnect so the monitor survives a 401 during bot re-activation without exhausting the framework's auto-restart budget.

Changes

  • client.ts: Add update_at field to MattermostUser type
  • monitor-websocket.ts: Add getBotUpdateAt option that polls the bot's update_at timestamp and terminates the connection when it changes
  • monitor.ts: Pass getBotUpdateAt callback using fetchMattermostMe; wrap initial API call with runWithReconnect for resilience
  • monitor-websocket.test.ts: Tests for update_at change detection, stable connection, and API error handling

Test plan

  • Unit tests pass (pnpm test:extension mattermost)
  • pnpm build && pnpm check passes
  • Manual test: send message via Mattermost → bot replies ✓
  • Manual test: disable bot → enable bot → send message → bot recovers and replies ✓
  • Manual test: API temporarily unreachable → logs error but does not terminate connection ✓

@openclaw-barnacle openclaw-barnacle bot added channel: mattermost Channel integration: mattermost size: S labels Mar 24, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR addresses a subtle but real production issue: after a Mattermost bot account is disabled and re-enabled, the existing WebSocket silently stops receiving server-pushed events while TCP and request/reply still work. The solution polls the bot's update_at REST field every 30 seconds and terminates the WebSocket when it changes, triggering the existing reconnect loop to establish a fresh, fully-functional connection.

Key changes:

  • monitor-websocket.ts: New getBotUpdateAt / healthCheckIntervalMs options; setInterval-based polling starts on WebSocket open and is cleared on close.
  • monitor.ts: Initial fetchMattermostMe wrapped in runWithReconnect so a 401 during re-activation doesn't exhaust the restart budget; getBotUpdateAt wired to fetchMattermostMe.
  • client.ts: update_at?: number added to MattermostUser.
  • monitor-websocket.test.ts: Three new unit tests covering the happy path (change detected), stable path (no spurious reconnect), and error-resilience path.

One logic issue is present: the initial getBotUpdateAt() call is async (not awaited), so if the WebSocket closes before it resolves, clearTimers() runs when healthCheckTimer is still undefined. The .then() callback then creates the interval after the connection is already settled, producing a timer leak that fires REST requests indefinitely. Adding a if (settled) return; guard at the start of the .then() callback fully fixes this.

Confidence Score: 4/5

  • PR is close to merge-ready; one targeted fix to prevent a timer leak is needed before landing.
  • The overall approach (polling update_at to detect the disable/enable cycle) is well-reasoned and correctly tested. There is one concrete logic bug: if the WebSocket closes in the narrow window before the async initial getBotUpdateAt() call resolves, the setInterval is created after clearTimers() has already run and is never cleaned up, leaking REST requests indefinitely. Adding a single if (settled) return; guard in the .then() callback fully addresses this. All other changes (type addition, runWithReconnect wrapping, tests) are correct and well-structured.
  • extensions/mattermost/src/mattermost/monitor-websocket.ts — the .then() callback on the initial getBotUpdateAt() call needs a settled guard before creating the setInterval.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/mattermost/src/mattermost/monitor-websocket.ts
Line: 173-193

Comment:
**Timer leak when WebSocket closes before initial `getBotUpdateAt` resolves**

The initial `getBotUpdateAt()` call is async and fires without awaiting. If the WebSocket closes (or the promise settles via `rejectOnce`/`resolveOnce`) before this Promise resolves, `clearTimers()` runs when `healthCheckTimer` is still `undefined`. Then, when the Promise resolves later, `healthCheckTimer = setInterval(...)` creates a new interval that is **never cleared** — it will keep firing REST requests and potentially calling `ws.terminate()` on an already-dead socket indefinitely.

The fix is to guard the `.then()` body with the `settled` flag before creating the timer:

```suggestion
          if (opts.getBotUpdateAt) {
            const getBotUpdateAt = opts.getBotUpdateAt;
            // Capture initial update_at, then start polling.
            getBotUpdateAt()
              .then((ts) => {
                if (settled) return;
                initialUpdateAt = ts;
                healthCheckTimer = setInterval(() => {
                  getBotUpdateAt()
                    .then((current) => {
                      if (initialUpdateAt !== undefined && current !== initialUpdateAt) {
                        opts.runtime.log?.(
                          `mattermost: bot account updated (update_at changed: ${initialUpdateAt} → ${current}) — reconnecting`,
                        );
                        ws.terminate();
                      }
                    })
                    .catch((err) => {
                      opts.runtime.error?.(`mattermost: health check error: ${String(err)}`);
                    });
                }, healthCheckIntervalMs);
              })
              .catch((err) => {
                opts.runtime.error?.(`mattermost: failed to get initial update_at: ${String(err)}`);
              });
          }
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(mattermost): detect stale websocket ..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b7cd989195

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Qinsam Qinsam force-pushed the fix/mattermost-ws-health-check-v2 branch from b7cd989 to e9bdde2 Compare March 24, 2026 10:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9bdde25fc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Qinsam Qinsam force-pushed the fix/mattermost-ws-health-check-v2 branch from bf120b8 to 651d3fd Compare March 24, 2026 10:19
@mukhtharcm mukhtharcm self-assigned this Mar 27, 2026
@mukhtharcm mukhtharcm force-pushed the fix/mattermost-ws-health-check-v2 branch from 651d3fd to 7a4abe7 Compare March 27, 2026 15:53
@mukhtharcm mukhtharcm force-pushed the fix/mattermost-ws-health-check-v2 branch 2 times, most recently from 4dfaf9c to 7a029ff Compare March 27, 2026 16:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a029ff3dd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +186 to +188
if (initialUpdateAt === undefined) {
initialUpdateAt = current;
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Initialize update_at baseline from connect time

The health-check logic sets initialUpdateAt from the first successful /users/me poll, which creates a blind spot when that first poll fails. If the bot is disabled/re-enabled during that failure window, the websocket can already be stale, but the later successful poll stores the post-change update_at as baseline and never triggers reconnect. Since startup already fetched bot identity before opening the socket (monitorMattermostProvider), this race can leave event delivery broken indefinitely under transient API errors right after connect.

Useful? React with 👍 / 👎.

@mukhtharcm mukhtharcm force-pushed the fix/mattermost-ws-health-check-v2 branch from 7a029ff to 15fc65a Compare March 30, 2026 02:14
@mukhtharcm mukhtharcm requested a review from a team as a code owner March 30, 2026 02:14
@openclaw-barnacle openclaw-barnacle bot added commands Command implementations agents Agent runtime and tooling labels Mar 30, 2026
@mukhtharcm mukhtharcm force-pushed the fix/mattermost-ws-health-check-v2 branch 2 times, most recently from 0f18204 to 699b0bb Compare March 30, 2026 02:23
Qinsam and others added 4 commits March 30, 2026 02:24
After a bot is disabled and re-enabled in Mattermost, the existing
WebSocket silently stops delivering events even though the connection
remains alive.  Poll the bot's update_at timestamp via REST API every
30 seconds and reconnect when it changes.

Also wrap the initial fetchMattermostMe with runWithReconnect so the
monitor survives a 401 during bot re-activation.
@mukhtharcm mukhtharcm force-pushed the fix/mattermost-ws-health-check-v2 branch from 699b0bb to 818d437 Compare March 30, 2026 02:24
@mukhtharcm mukhtharcm merged commit 47839d3 into openclaw:main Mar 30, 2026
7 checks passed
@mukhtharcm
Copy link
Copy Markdown
Member

Merged via squash.

Thanks @Qinsam!

MonkeyLeeT pushed a commit to MonkeyLeeT/openclaw that referenced this pull request Mar 30, 2026
openclaw#53604)

Merged via squash.

Prepared head SHA: 818d437
Co-authored-by: Qinsam <[email protected]>
Co-authored-by: mukhtharcm <[email protected]>
Reviewed-by: @mukhtharcm
pritchie pushed a commit to pritchie/openclaw that referenced this pull request Mar 30, 2026
openclaw#53604)

Merged via squash.

Prepared head SHA: 818d437
Co-authored-by: Qinsam <[email protected]>
Co-authored-by: mukhtharcm <[email protected]>
Reviewed-by: @mukhtharcm
alexjiang1 pushed a commit to alexjiang1/openclaw that referenced this pull request Mar 31, 2026
openclaw#53604)

Merged via squash.

Prepared head SHA: 818d437
Co-authored-by: Qinsam <[email protected]>
Co-authored-by: mukhtharcm <[email protected]>
Reviewed-by: @mukhtharcm
pgondhi987 pushed a commit to pgondhi987/openclaw that referenced this pull request Mar 31, 2026
openclaw#53604)

Merged via squash.

Prepared head SHA: 818d437
Co-authored-by: Qinsam <[email protected]>
Co-authored-by: mukhtharcm <[email protected]>
Reviewed-by: @mukhtharcm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: mattermost Channel integration: mattermost commands Command implementations size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants