Skip to content

fix(feishu): supervisor loop to recover from WSClient retry exhaustion#57978

Closed
zhangxin840 wants to merge 4 commits intoopenclaw:mainfrom
zhangxin840:fix/feishu-ws-retry-exhaustion-watchdog
Closed

fix(feishu): supervisor loop to recover from WSClient retry exhaustion#57978
zhangxin840 wants to merge 4 commits intoopenclaw:mainfrom
zhangxin840:fix/feishu-ws-retry-exhaustion-watchdog

Conversation

@zhangxin840
Copy link
Copy Markdown

Bug

Feishu WebSocket stops responding after a network disruption and never self-heals. Reproducible log pattern from a real production incident:

00:33:46  Last healthy heartbeat
00:36:39  [ws] reconnect started (SDK internal retry)
00:36:48  ECONNRESET, "unable to connect after 1 times"
00:37–00:47  ECONNRESET retries 1→24, then silence
01:43     HTTP messages still arriving (network recovered) — but Kim cannot reply
06:49:51  feishu[kim]: starting WebSocket connection...  ← only because config hot-reload fired

Gateway was dead for 6+ hours after the SDK exhausted its retry budget.

Root Cause

WSClient.start() is fire-and-forget: it kicks off an async reconnect loop inside the SDK and returns immediately (confirmed by reading the SDK source). When the server-configured reconnectCount retries are exhausted the SDK sets isConnecting = false and stops silently — no error thrown, no event emitted, no promise rejected.

The previous monitorWebSocket() parked on abortSignal after calling start(). Since start() returns instantly, the function could never observe the silent death. The only observable signal is that getReconnectInfo().lastConnectTime stops advancing.

Channel comparison:

Channel Supervisor reconnect Backoff Stall detection
Slack Explicit while loop Exponential No
Telegram grammY runner + explicit loop Exponential 90 s watchdog
Discord Full lifecycle controller Exponential 5 min watchdog
Feishu (before) None — delegates to SDK None None

Fix

Replaces the park-on-abort pattern with a supervisor loop + stall detector in monitorWebSocket():

  1. runFeishuWSClientUntilDead() — wraps a single WSClient lifetime. Polls getReconnectInfo().lastConnectTime every 10 s. Stall clock starts only after the first non-zero lastConnectTime (so a healthy connection that never needs to reconnect never trips a false stall). After 90 s of no new connect activity → resolves "stalled".

  2. Supervisor while loop — on each stall or error, closes the dead client ({ force: true }), computes backoff, and creates a fresh WSClient. Uses the existing computeBackoff / sleepWithAbort from openclaw/plugin-sdk/infra-runtime (same helpers as Slack/Telegram).

  3. Backoff counter semanticssupervisorAttempt increments on boot-time failures (SDK died before a confirmed connect). It resets to 0 if the SDK made at least one successful connect in the cycle, so a runtime network disruption after a healthy session retries quickly (5 s) rather than accumulating against the boot-time backoff schedule.

  4. Non-recoverable errors — pattern-matched with narrow word-boundary regexes (not broad substring "app_id") to avoid permanently stopping the supervisor on recoverable errors that happen to mention field names.

  5. Cleanup safety — outer try/finally guarantees wsClients / botOpenIds / botNames are cleared even if an unexpected synchronous throw bypasses the per-cycle finally.

Issues addressed from reviews of prior PRs (#52623, #52666)

Issue PR This fix
Broken backoff: supervisorAttempts = 0 reset fires before SDK can stall #52623 P1 Reset tied to onFirstConnect callback, not to fire-and-forget start() return
Overly broad "app_id" substring match stops supervisor on recoverable errors #52623 P2 Narrowed to /invalid app_id\b/i and three other exact-phrase patterns
cleanup() skip if unexpected rejection bypasses per-cycle finally #52666 P2 Outer try/finally covers all exit paths
SDK timer leak on close() (larksuite/node-sdk#177 / #40451) both Documented with tracking comment; upstream issue, not introduced here

Test plan

  • Gateway starts normally with Feishu WebSocket channel enabled
  • Simulate network drop while WS is connected → supervisor detects stall after ~90 s, logs WebSocket stall detected, recreates client
  • After network recovery, WS reconnects without gateway restart
  • Abort signal (gateway stop / config reload) exits cleanly at any point in the loop
  • Invalid appId fast-fails without entering the retry loop

Fixes #52618

🤖 Generated with Claude Code

@openclaw-barnacle openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: M labels Mar 30, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bce01ac235

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 319 to 320
botOpenIds.delete(accountId);
botNames.delete(accountId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve bot identity across supervisor cycles

Do not clear botOpenIds/botNames in the per-cycle finally block, because this block runs on every stall-driven restart, not just shutdown. After the first WebSocket supervisor recycle, bot identity is never repopulated in monitorWebSocket, but downstream handlers still depend on these maps (for example extensions/feishu/src/monitor.account.ts:423/:439), and checkBotMentioned returns false when botOpenId is missing (extensions/feishu/src/bot-content.ts:231-234). In accounts where group replies require mentions (the default unless group policy is open), the bot will stop responding to group messages after the first reconnect cycle.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 30, 2026

Greptile Summary

This PR replaces the old fire-and-forget monitorWebSocket with a proper supervisor loop that polls getReconnectInfo().lastConnectTime to detect when the Lark SDK silently exhausts its retry budget, then closes the dead client, waits with exponential backoff, and creates a fresh one — mirroring the Slack/Telegram patterns and fixing the 6-hour production outage where the gateway went dark after a network disruption.

The overall approach is solid and directly addresses the bug. A few items worth verifying before merge:

  • Stall detection / backoff semantics depend on lastConnectTime SDK behaviour — the correctness of both the "no false stall on healthy connections" guarantee and the supervisorAttempt reset hinge on whether getReconnectInfo().lastConnectTime stays 0 for sessions that have never needed a reconnect, and whether it advances on failed attempts or only successful ones. If the initial connect updates it, healthy connections would be evicted after 90 s. If failed attempts advance it, the exponential backoff effectively collapses to always 5 s. A short comment citing the confirmed SDK source location would future-proof this.
  • Dead no-op try/catch around createFeishuWSClient (just rethrows) is cleanup debt.
  • Two consecutive JSDoc blocks describing supervisorAttempt should be merged into one.

Confidence Score: 4/5

  • Safe to merge once the lastConnectTime SDK semantics are confirmed; the supervisor loop itself is well-structured and addresses the production bug.
  • The core fix is correct and well-documented. Two P2 style issues (dead try/catch, duplicate comment) are trivial. The logic concern around getReconnectInfo().lastConnectTime semantics is the only item that could represent a real defect — specifically, if lastConnectTime updates on the initial connect then healthy sessions would be evicted every 90 s, and if it updates on failed attempts then the exponential backoff is effectively disabled. Confirming SDK behaviour (with a comment/source reference) before merge is warranted, dropping the score from 5 to 4.
  • extensions/feishu/src/monitor.transport.ts — confirm getReconnectInfo().lastConnectTime semantics (lines 196–233)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 276-282

Comment:
**No-op `try/catch` is dead code**

The `try/catch` around `createFeishuWSClient` only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from `createFeishuWSClient` (e.g. missing credentials) will still propagate out of the `while` loop and through the outer `try/finally` to clean up state correctly.

```suggestion
      wsClient = createFeishuWSClient(account);
```

Or, keeping the `let` declaration:
```ts
let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 251-267

Comment:
**Duplicate JSDoc blocks should be merged**

There are two back-to-back JSDoc comment blocks describing the supervisor loop and `supervisorAttempt`. These should be merged into a single block to avoid confusion about which comment applies to which thing.

```suggestion
  /**
   * Supervisor loop: each iteration creates a fresh WSClient and runs it until
   * either (a) the abort signal fires or (b) the SDK silently exhausts its
   * internal retry budget (stall detected). On stall we close the dead client,
   * wait with exponential backoff, then start a fresh cycle.
   *
   * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
   * before any successful connect). It resets to 0 whenever the SDK records a
   * successful connection, so a runtime network disruption that causes a stall
   * does not accumulate against the boot-time backoff schedule.
   */
  let supervisorAttempt = 0;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 196-233

Comment:
**Stall detection and backoff reset correctness depend on `lastConnectTime` semantics**

The correctness of two closely related invariants hinges on exactly when the Lark SDK updates `getReconnectInfo().lastConnectTime`:

**1. False-stall risk for healthy connections**

The comment on `FEISHU_WS_STALL_DETECT_MS` says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if `lastConnectTime` stays `0` for a connection that has never needed to reconnect. Once `lastConnectTime` becomes non-zero (`staleSinceMs` is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy `idleMs >= FEISHU_WS_STALL_DETECT_MS` and be falsely evicted.

If the SDK sets `lastConnectTime` on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.

**2. Backoff reset may fire too eagerly**

`hadSuccessfulConnect` is set to `true` the first time `lastConnectTime` becomes non-zero (line 206: `isFirstConnect` check calls `onFirstConnect`). If `lastConnectTime` advances on any connection attempt (including failed ECONNRESET retries), `hadSuccessfulConnect` becomes `true` during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, `supervisorAttempt` is reset to `0` and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.

**Suggested verification**

It would be worth adding a comment or an inline source reference (e.g. `// larksuite/node-sdk src/ws/client.ts:L…`) documenting the confirmed SDK semantics: does `lastConnectTime` advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(feishu): add supervisor loop to reco..." | Re-trigger Greptile

Comment on lines +276 to +282
let wsClient: Lark.WSClient;
try {
wsClient = createFeishuWSClient(account);
} catch (err) {
// Non-recoverable config error (missing credentials etc.) — bail out.
throw err;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No-op try/catch is dead code

The try/catch around createFeishuWSClient only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from createFeishuWSClient (e.g. missing credentials) will still propagate out of the while loop and through the outer try/finally to clean up state correctly.

Suggested change
let wsClient: Lark.WSClient;
try {
wsClient = createFeishuWSClient(account);
} catch (err) {
// Non-recoverable config error (missing credentials etc.) — bail out.
throw err;
}
wsClient = createFeishuWSClient(account);

Or, keeping the let declaration:

let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 276-282

Comment:
**No-op `try/catch` is dead code**

The `try/catch` around `createFeishuWSClient` only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from `createFeishuWSClient` (e.g. missing credentials) will still propagate out of the `while` loop and through the outer `try/finally` to clean up state correctly.

```suggestion
      wsClient = createFeishuWSClient(account);
```

Or, keeping the `let` declaration:
```ts
let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +251 to +267
/**
* Supervisor loop: each iteration creates a fresh WSClient and runs it until
* either (a) the abort signal fires or (b) the SDK silently exhausts its
* internal retry budget (stall detected). On stall we close the dead client,
* wait with exponential backoff, then start a fresh cycle.
*
* Backoff attempt counter is incremented only when the SDK stalled; a
* successful connection resets it so transient disruptions do not accumulate
* against the backoff schedule.
*/
/**
* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
* before any successful connect). It resets to 0 whenever the SDK records a
* successful connection, so a healthy network disruption that causes a stall
* later does not keep accumulating the backoff schedule.
*/
let supervisorAttempt = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate JSDoc blocks should be merged

There are two back-to-back JSDoc comment blocks describing the supervisor loop and supervisorAttempt. These should be merged into a single block to avoid confusion about which comment applies to which thing.

Suggested change
/**
* Supervisor loop: each iteration creates a fresh WSClient and runs it until
* either (a) the abort signal fires or (b) the SDK silently exhausts its
* internal retry budget (stall detected). On stall we close the dead client,
* wait with exponential backoff, then start a fresh cycle.
*
* Backoff attempt counter is incremented only when the SDK stalled; a
* successful connection resets it so transient disruptions do not accumulate
* against the backoff schedule.
*/
/**
* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
* before any successful connect). It resets to 0 whenever the SDK records a
* successful connection, so a healthy network disruption that causes a stall
* later does not keep accumulating the backoff schedule.
*/
let supervisorAttempt = 0;
/**
* Supervisor loop: each iteration creates a fresh WSClient and runs it until
* either (a) the abort signal fires or (b) the SDK silently exhausts its
* internal retry budget (stall detected). On stall we close the dead client,
* wait with exponential backoff, then start a fresh cycle.
*
* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
* before any successful connect). It resets to 0 whenever the SDK records a
* successful connection, so a runtime network disruption that causes a stall
* does not accumulate against the boot-time backoff schedule.
*/
let supervisorAttempt = 0;
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 251-267

Comment:
**Duplicate JSDoc blocks should be merged**

There are two back-to-back JSDoc comment blocks describing the supervisor loop and `supervisorAttempt`. These should be merged into a single block to avoid confusion about which comment applies to which thing.

```suggestion
  /**
   * Supervisor loop: each iteration creates a fresh WSClient and runs it until
   * either (a) the abort signal fires or (b) the SDK silently exhausts its
   * internal retry budget (stall detected). On stall we close the dead client,
   * wait with exponential backoff, then start a fresh cycle.
   *
   * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
   * before any successful connect). It resets to 0 whenever the SDK records a
   * successful connection, so a runtime network disruption that causes a stall
   * does not accumulate against the boot-time backoff schedule.
   */
  let supervisorAttempt = 0;
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +196 to +233
const stallPoller = setInterval(() => {
if (abortSignal?.aborted) {
clearInterval(stallPoller);
return;
}

const { lastConnectTime } = wsClient.getReconnectInfo();

if (lastConnectTime !== lastSeenConnectTime) {
// SDK recorded a new connect attempt — reset stall clock.
const isFirstConnect = lastSeenConnectTime === 0 && lastConnectTime > 0;
lastSeenConnectTime = lastConnectTime;
staleSinceMs = Date.now();
if (isFirstConnect) {
onFirstConnect?.();
}
return;
}

// Still waiting for the SDK to make its first attempt: don't start
// the stall clock yet to avoid false positives on slow-start scenarios.
if (staleSinceMs === null) {
return;
}

const idleMs = Date.now() - staleSinceMs;
if (idleMs >= FEISHU_WS_STALL_DETECT_MS) {
log(
`feishu[${accountId}]: WebSocket stall detected ` +
`(no SDK connect activity for ${Math.round(idleMs / 1000)}s); ` +
`will restart supervisor cycle`,
);
clearInterval(stallPoller);
abortSignal?.removeEventListener("abort", handleAbort);
resolve("stalled");
}
}, FEISHU_WS_STALL_POLL_MS);
stallPoller.unref?.();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stall detection and backoff reset correctness depend on lastConnectTime semantics

The correctness of two closely related invariants hinges on exactly when the Lark SDK updates getReconnectInfo().lastConnectTime:

1. False-stall risk for healthy connections

The comment on FEISHU_WS_STALL_DETECT_MS says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if lastConnectTime stays 0 for a connection that has never needed to reconnect. Once lastConnectTime becomes non-zero (staleSinceMs is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy idleMs >= FEISHU_WS_STALL_DETECT_MS and be falsely evicted.

If the SDK sets lastConnectTime on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.

2. Backoff reset may fire too eagerly

hadSuccessfulConnect is set to true the first time lastConnectTime becomes non-zero (line 206: isFirstConnect check calls onFirstConnect). If lastConnectTime advances on any connection attempt (including failed ECONNRESET retries), hadSuccessfulConnect becomes true during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, supervisorAttempt is reset to 0 and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.

Suggested verification

It would be worth adding a comment or an inline source reference (e.g. // larksuite/node-sdk src/ws/client.ts:L…) documenting the confirmed SDK semantics: does lastConnectTime advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting. Can you confirm: does getReconnectInfo().lastConnectTime stay 0 for a session that connects initially and is never disrupted (i.e. it only advances during reconnect loops, not the first connect)? And similarly, does it advance only on successful reconnects, or also on failed attempts such as ECONNRESET retries?

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 196-233

Comment:
**Stall detection and backoff reset correctness depend on `lastConnectTime` semantics**

The correctness of two closely related invariants hinges on exactly when the Lark SDK updates `getReconnectInfo().lastConnectTime`:

**1. False-stall risk for healthy connections**

The comment on `FEISHU_WS_STALL_DETECT_MS` says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if `lastConnectTime` stays `0` for a connection that has never needed to reconnect. Once `lastConnectTime` becomes non-zero (`staleSinceMs` is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy `idleMs >= FEISHU_WS_STALL_DETECT_MS` and be falsely evicted.

If the SDK sets `lastConnectTime` on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.

**2. Backoff reset may fire too eagerly**

`hadSuccessfulConnect` is set to `true` the first time `lastConnectTime` becomes non-zero (line 206: `isFirstConnect` check calls `onFirstConnect`). If `lastConnectTime` advances on any connection attempt (including failed ECONNRESET retries), `hadSuccessfulConnect` becomes `true` during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, `supervisorAttempt` is reset to `0` and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.

**Suggested verification**

It would be worth adding a comment or an inline source reference (e.g. `// larksuite/node-sdk src/ws/client.ts:L…`) documenting the confirmed SDK semantics: does `lastConnectTime` advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting. Can you confirm: does `getReconnectInfo().lastConnectTime` stay `0` for a session that connects initially and is never disrupted (i.e. it only advances during reconnect loops, not the first connect)? And similarly, does it advance only on successful reconnects, or also on failed attempts such as ECONNRESET retries?

How can I resolve this? If you propose a fix, please make it concise.

## Changes from review feedback

### P1 — Bot identity cleared on supervisor restart (critical bug fix)
`botOpenIds`/`botNames` were previously cleared in the per-cycle `finally`
block, which runs on every stall-driven restart. After the first reconnect
cycle, `checkBotMentioned()` would return false (missing `botOpenId`) and
all group accounts requiring a bot mention would go silent.

Fix: only clear `wsClients` in the per-cycle `finally`. `botOpenIds`/
`botNames` are now only cleared in the outer `finally` (the shutdown path).

### P2.1 — Remove dead no-op try/catch around createFeishuWSClient
The try/catch only re-threw without any special handling. Removed; any
synchronous throw from `createFeishuWSClient` still propagates correctly
through the outer try/finally.

### P2.2 — Merge duplicate JSDoc blocks for supervisor loop
Two consecutive JSDoc comment blocks describing the same supervisor loop
and `supervisorAttempt` have been merged into one.

### P2.3 — Document and fix lastConnectTime SDK semantics
Confirmed by reading SDK source
(@larksuiteoapi/node-sdk/lib/index.js WSClient.reConnect):
`lastConnectTime` is set inside `tryConnect()` which is called on EVERY
connection attempt, including the very first `start()` call — not only on
reconnects. The original code started the stall clock on the first
`lastConnectTime` observation, which would falsely evict healthy
connections after 90 s.

Fix: the stall clock now only starts after the SECOND `lastConnectTime`
change (i.e. the first genuine reconnect attempt after a WebSocket drop).
Healthy connections that never need to reconnect are never stalled.

The constant comment and `runFeishuWSClientUntilDead` JSDoc have been
updated to document these semantics with a source reference.

### Rebase
Rebased onto upstream/main; resolved conflict with the `safeEqualSecret`
import refactor (both imports are now present).

### Tests
Added regression test for P1: verifies `botOpenIds`/`botNames` survive a
supervisor cycle stall+restart and are only cleared on final shutdown.
@zhangxin840 zhangxin840 force-pushed the fix/feishu-ws-retry-exhaustion-watchdog branch from bce01ac to aea91c8 Compare April 3, 2026 10:28
@zhangxin840
Copy link
Copy Markdown
Author

Thanks for the detailed review. All three items have been addressed in the latest push (aea91c8):


P1 — Bot identity cleared on supervisor restart (now fixed)

Confirmed the bug: botOpenIds.delete(accountId) and botNames.delete(accountId) were in the per-cycle finally block, which runs on every stall-driven restart — not only on shutdown. After the first reconnect cycle, checkBotMentioned() would find no botOpenId and group accounts requiring a mention would go silent.

Fix: the per-cycle finally now only calls wsClients.delete(accountId). Bot identity maps are cleared solely in the outer finally (the true shutdown path). Added a regression test in monitor.cleanup.test.ts that uses fake timers to simulate a full stall+restart cycle and asserts that botOpenIds/botNames survive into the next cycle.


lastConnectTime SDK semantics — confirmed and fixed

Read the SDK source at dist/extensions/feishu/node_modules/@larksuiteoapi/node-sdk/lib/index.js (WSClient.reConnect). Confirmed: lastConnectTime is set inside tryConnect(), which is called on every connection attempt — including the very first start() call via reConnect(isStart=true). It is not limited to reconnect-after-drop scenarios.

The original code started the stall clock on the first non-zero lastConnectTime observation. This would falsely evict any healthy long-lived connection after 90 s, contradicting the guarantee stated in the FEISHU_WS_STALL_DETECT_MS comment.

Fix: introduced a connectChangeCount counter. The stall clock now only starts after the second lastConnectTime change — i.e. the first genuine reconnect attempt (triggered by the WebSocket close handler calling reConnect(isStart=false)). The FEISHU_WS_STALL_DETECT_MS JSDoc and runFeishuWSClientUntilDead JSDoc have been updated to document the confirmed SDK behavior with a source reference.

The backoff reset via hadSuccessfulConnect / onFirstConnect is not affected: it fires on the first connect attempt regardless, which is the correct heuristic for distinguishing boot-time failure from runtime disruption.


Dead no-op try/catch — removed

The try { wsClient = createFeishuWSClient(account) } catch (err) { throw err } block has been removed. Any synchronous throw from createFeishuWSClient (e.g. missing credentials) still propagates through the outer try/finally for correct cleanup.


Duplicate JSDoc blocks — merged

The two back-to-back JSDoc blocks describing the supervisor loop and supervisorAttempt have been merged into one.


Rebase

The branch has been rebased onto current upstream/main. The only conflict was the import line: upstream introduced safeEqualSecret from browser-support to replace the inline timingSafeEqualString helper. Both the new upstream import and the PR's computeBackoff/sleepWithAbort imports are now present.


CI failures

  • checks-fast-extensions: failure in extensions/matrix/src/matrix/monitor/events.test.ts (room ID assertion !fallback:example.org vs !dm:example.org) — pre-existing Matrix extension issue, unrelated to this PR.
  • checks-windows-node-test-3, -4, -6: src/tasks/flow-registry.store.test.ts timeout at 120 s — pre-existing Windows runner flake, unrelated to this PR.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aea91c8f73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +295 to +297
onFirstConnect: () => {
hadSuccessfulConnect = true;
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Base backoff on confirmed connection, not first attempt

onFirstConnect is documented as firing when the SDK records its first connect attempt, but this callback sets hadSuccessfulConnect = true. In startup outage scenarios (or any cycle where handshake never succeeds), that still resets supervisorAttempt to 0, so computeBackoff keeps using the minimum delay and the supervisor never escalates retry spacing. This defeats the intended exponential backoff for boot-time failures and can create a persistent 5s reconnect storm.

Useful? React with 👍 / 👎.

`onFirstConnect` fired on the first connect *attempt*, not after the
WebSocket handshake was confirmed live. In a startup-outage scenario this
caused `hadSuccessfulConnect` to be set prematurely, which reset
`supervisorAttempt` to 0 on every stall cycle and killed exponential
backoff.

Fix: replace `onFirstConnect` with `onConfirmedConnect`. The poll loop now
accumulates a `connectedAndStableCount` each tick where `lastConnectTime`
is non-zero, unchanged, and the stall clock has not started. Only after
`FEISHU_WS_STABLE_TICKS_REQUIRED` (2) consecutive stable ticks (~20 s of
confirmed liveness) does it fire `onConfirmedConnect` and set
`hadSuccessfulConnect = true`. The existing `connectChangeCount`/stall
clock logic is unchanged.

Add two regression tests to `monitor.cleanup.test.ts`:
- startup-outage: reconnect fires before stable ticks accumulate →
  `supervisorAttempt >= 1` (backoff applies)
- normal recovery: stable ticks accumulate before any reconnect →
  `supervisorAttempt == 0` (fast recovery)

Addresses Codex review PRR_kwDOQb6kR87xtZOm.
@zhangxin840
Copy link
Copy Markdown
Author

Re: Codex review PRR_kwDOQb6kR87xtZOm (P2 — base backoff on confirmed connection, not first attempt)

Fixed in ba25f3d.

Replaced onFirstConnect with onConfirmedConnect. The poll loop now tracks connectedAndStableCount: incremented each tick where lastConnectTime is non-zero, stable (unchanged from last tick), and the stall clock has not started. hadSuccessfulConnect is set to true only after FEISHU_WS_STABLE_TICKS_REQUIRED = 2 consecutive stable ticks (~20 s of confirmed liveness).

In the startup-outage scenario you described: the first attempt fires and lastConnectTime advances, but before two stable ticks accumulate the SDK moves to a reconnect attempt — connectedAndStableCount resets to 0, onConfirmedConnect never fires, hadSuccessfulConnect stays false, and supervisorAttempt keeps incrementing → full exponential backoff applies correctly.

The connectChangeCount/stall-clock logic from the previous commit is unchanged. Two new regression tests in monitor.cleanup.test.ts cover both paths:

  • Startup outage: reconnect fires before stable ticks accumulate → supervisorAttempt >= 1
  • Normal recovery: stable ticks accumulate before any reconnect → supervisorAttempt == 0

Re: CI failure checks-node-channels-3 — WhatsApp extension test

The failing test extensions/whatsapp/src/action-runtime.test.ts > applies default account allowFrom when accountId is omitted asserts status: 403 on a ToolAuthorizationError, but that class only sets this.name — it has no status property in the current codebase (extensions/whatsapp/src/runtime-api.ts:50). This test failure is pre-existing on upstream main and is completely unrelated to the Feishu supervisor changes in this PR. No fix attempted here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba25f3d0d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +249 to +252
if (!isFirstChange) {
// Genuine reconnect attempt: start (or reset) the stall clock.
staleSinceMs = Date.now();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset stall clock after reconnect recovery

Starting staleSinceMs on every post-initial lastConnectTime change without ever clearing it means any successful reconnect is treated as a pending stall forever. After one transient reconnect, lastConnectTime normally stays unchanged during healthy operation, so idleMs will cross 90s and force a supervisor restart even though the socket has recovered; this can cause repeated teardown/reconnect loops on healthy sessions after the first network blip.

Useful? React with 👍 / 👎.

…te check

The stall detector starts a 90 s clock whenever `lastConnectTime` advances
after the initial connect (indicating the SDK entered a reconnect loop). But
`lastConnectTime` is set on every `tryConnect()` call regardless of success —
so a successful reconnect also sets the clock, causing a false supervisor
restart 90 s later even on a fully recovered connection.

The core problem: `lastConnectTime` alone cannot distinguish "SDK reconnected
successfully" from "SDK exhausted its retry budget." The only reliable signal
is the underlying WebSocket's `readyState`.

Fix: after `FEISHU_WS_STABLE_TICKS_REQUIRED` consecutive stable poll ticks
while a stall clock is running, check `wsClient.wsConfig?.getWSInstance()?.
readyState`. If it is 1 (OPEN) the reconnect succeeded — clear `staleSinceMs`
so the healthy session is not falsely evicted. If the WebSocket is CLOSED /
CONNECTING / unavailable, the clock continues and the stall fires normally.

The private `wsConfig` accessor is wrapped in try/catch so any future SDK
refactor degrades gracefully back to the existing behaviour.

Tests added:
- "clears stall clock after successful reconnect" — readyState OPEN, no
  false stall after 90 s of healthy operation
- "stall fires when reconnect socket is not OPEN" — readyState CLOSED,
  stall detected and second supervisor cycle started

Addresses Codex P1 in openclaw#57978 review 4055460051.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@zhangxin840
Copy link
Copy Markdown
Author

Addressing Codex P1 (review #4055460051)

Root cause of the false-stall issue: lastConnectTime is set inside tryConnect() on every connection attempt (confirmed in SDK source at WSClient.reConnect → tryConnect). This means it's indistinguishable whether a stable lastConnectTime reflects "SDK reconnected successfully" or "SDK exhausted retries and stopped." Simple stable-tick counting can't disambiguate the two cases.

Fix (b05a8320f8): After FEISHU_WS_STABLE_TICKS_REQUIRED consecutive stable poll ticks while the stall clock is running, we check wsClient.wsConfig?.getWSInstance()?.readyState. If the underlying WebSocket is OPEN (readyState === 1), the reconnect succeeded — we clear staleSinceMs. If it's CLOSED/CONNECTING or the accessor is unavailable (SDK refactor safeguard), the stall clock continues and fires normally.

Tradeoffs documented:

  • Accesses the non-public wsConfig property — wrapped in try/catch so future SDK changes degrade gracefully to the original behaviour
  • The readyState check is gated on FEISHU_WS_STABLE_TICKS_REQUIRED ticks (~20 s) to avoid clearing prematurely while a handshake is still in progress

Tests added:

  • "clears stall clock after successful reconnect" — readyState OPEN + stable → no false supervisor restart after 90 s
  • "stall fires when reconnect socket is not OPEN" — readyState CLOSED → stall detected and second supervisor cycle started

All 8 cleanup tests pass.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b05a8320f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +263 to +267
if (!confirmedConnectFired && lastConnectTime !== 0 && staleSinceMs === null) {
connectedAndStableCount += 1;
if (connectedAndStableCount >= FEISHU_WS_STABLE_TICKS_REQUIRED) {
confirmedConnectFired = true;
onConfirmedConnect?.();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate confirmed-connect on actual socket open

This marks a cycle as “successfully connected” after two unchanged lastConnectTime polls, but that can happen even when the handshake never succeeds (e.g., startup outage where the SDK is waiting for its next retry). In that case hadSuccessfulConnect becomes true and the supervisor resets supervisorAttempt to 0, so boot-time failures keep retrying at minimum backoff instead of escalating. Fresh evidence: the upstream WS client defaults include a reconnect nonce delay up to 30s, which is longer than this 20s confirmation window, so this false-positive path is reachable without any successful connection.

Useful? React with 👍 / 👎.

…ay fix)

The SDK's reconnect nonce delay can be up to 30 s. During this window
`lastConnectTime` is already non-zero (set by the failed initial attempt)
and stable, so FEISHU_WS_STABLE_TICKS_REQUIRED (2 ticks ≈ 20 s) is reached
before the nonce expires. This triggered `onConfirmedConnect` falsely, setting
`hadSuccessfulConnect = true` even though no connection was established —
causing boot-time failures to reset `supervisorAttempt` to 0 and preventing
exponential backoff escalation.

Fix: gate `onConfirmedConnect` on the same `getWsReadyState()` check added in
the previous commit. If `readyState === 1` (OPEN) or is not observable (null,
for tests/future SDK changes — backward-compatible fallback), fire the callback.
If readyState is known but not OPEN (e.g. 0=CONNECTING during nonce wait),
defer confirmation until the socket is actually open.

`getWsReadyState()` is extracted as a shared helper used by both the
confirmed-connect path and the post-reconnect stall-clock clear path.

Regression test added: "does not confirm connection during nonce delay" —
verifies that readyState CONNECTING blocks onConfirmedConnect and allows
supervisorAttempt to increment on subsequent stall.

Addresses Codex P2 in openclaw#57978 (2026-04-06 review comment on
line 267 of monitor.transport.ts).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@zhangxin840
Copy link
Copy Markdown
Author

Addressing Codex P2 (2026-04-06, line 267)

Issue: onConfirmedConnect could fire falsely during the SDK's reconnect nonce delay (up to 30 s). After a failed initial connect, lastConnectTime is already non-zero and stable — so FEISHU_WS_STABLE_TICKS_REQUIRED (2 ticks ≈ 20 s) is reached before the nonce expires. This set hadSuccessfulConnect = true with no actual connection, preventing exponential backoff from escalating on boot-time failures.

Fix (99d18b4893): Gate onConfirmedConnect on the same getWsReadyState() helper introduced in the previous commit. Confirmation fires only when readyState === 1 (OPEN) or when the accessor is unavailable (null — backward-compatible fallback to stable-tick behavior for tests/future SDK changes). If readyState is known but not OPEN (e.g., 0=CONNECTING during nonce wait), the callback is deferred until the socket is actually open.

getWsReadyState() is now a shared helper covering both: confirmed-connect detection and post-reconnect stall-clock clearing.

Regression test added: "does not confirm connection during nonce delay" — readyState = 0 (CONNECTING) blocks onConfirmedConnect for 2 stable ticks; after the nonce fires and all retries stall, computeBackoff is called with attempt ≥ 1.

All 9 cleanup tests pass.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded by the maintainer-labeled Feishu recovery PR #73739. Current main still lacks post-start SDK retry-exhaustion recovery, but the same #52618 work is now tracked in a current-main replacement path, while this branch is dirty/stale and adds raw supervisor error logging that should not be carried forward.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Use #52618 and #73739 as the canonical path: land a current-main Feishu WebSocket supervisor that preserves heartbeat/proxy/async client behavior, uses public SDK lifecycle callbacks where possible, keeps redacted logging, and closes older duplicate PRs after the replacement is resolved.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR body and #52618 provide concrete reproduction steps, and current main source confirms the failure mode: after wsClient.start() returns, OpenClaw waits only for abort and has no post-start signal to recreate a dead SDK client.

Is this the best way to solve the issue?

No. The desired product direction is right, but this branch is no longer the best merge path because #73739 tracks the same fix against the newer Feishu transport and preserves current redaction/lifecycle behavior.

Security review:

Security review needs attention: The diff adds raw error logging in Feishu WebSocket supervisor paths, which can expose credentials or bearer tokens in logs.

  • [medium] Raw supervisor errors may leak secrets — extensions/feishu/src/monitor.transport.ts:401
    The PR logs String(err) for non-recoverable and cycle errors. Current main redacts WebSocket error messages, and the replacement PR keeps that redaction; this branch should not be merged without applying the same sanitizer.
    Confidence: 0.86

What I checked:

Likely related people:

  • vincentkoc: Authored the recent mainline Feishu WebSocket reconnect/heartbeat repair and owns the maintainer-labeled replacement PR for the same retry-exhaustion behavior. (role: recent maintainer and likely follow-up owner; confidence: high; commits: d93e6f61580a, 319aa2f1fe67, 1faf660c3482; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/client.ts, extensions/feishu/src/monitor.cleanup.test.ts)
  • steipete: Recent API history shows repeated Feishu monitor/account hardening and refactors around webhook, monitor startup, runtime seams, and lifecycle handling. (role: recent Feishu monitor maintainer; confidence: medium; commits: f46bd2e0cc17, 8be3a4466cbc, 77d04a39d87a; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/monitor.account.ts, extensions/feishu/src/client.ts)
  • schumilin: Authored the mainline Feishu WebSocket monitor stop/abort cleanup change and cleanup tests that this PR builds on. (role: introduced adjacent cleanup behavior; confidence: medium; commits: bd4237c16c2c; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/monitor.cleanup.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 52e2d4e16a29.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(feishu): WebSocket connection not recovered after network disruption

1 participant