fix(feishu): supervisor loop to recover from WSClient retry exhaustion by zhangxin840 · Pull Request #57978 · openclaw/openclaw

zhangxin840 · 2026-03-30T23:02:44Z

Bug

Feishu WebSocket stops responding after a network disruption and never self-heals. Reproducible log pattern from a real production incident:

00:33:46  Last healthy heartbeat
00:36:39  [ws] reconnect started (SDK internal retry)
00:36:48  ECONNRESET, "unable to connect after 1 times"
00:37–00:47  ECONNRESET retries 1→24, then silence
01:43     HTTP messages still arriving (network recovered) — but Kim cannot reply
06:49:51  feishu[kim]: starting WebSocket connection...  ← only because config hot-reload fired

Gateway was dead for 6+ hours after the SDK exhausted its retry budget.

Root Cause

WSClient.start() is fire-and-forget: it kicks off an async reconnect loop inside the SDK and returns immediately (confirmed by reading the SDK source). When the server-configured reconnectCount retries are exhausted the SDK sets isConnecting = false and stops silently — no error thrown, no event emitted, no promise rejected.

The previous monitorWebSocket() parked on abortSignal after calling start(). Since start() returns instantly, the function could never observe the silent death. The only observable signal is that getReconnectInfo().lastConnectTime stops advancing.

Channel comparison:

Channel	Supervisor reconnect	Backoff	Stall detection
Slack	Explicit `while` loop	Exponential	No
Telegram	grammY runner + explicit loop	Exponential	90 s watchdog
Discord	Full lifecycle controller	Exponential	5 min watchdog
Feishu (before)	None — delegates to SDK	None	None

Fix

Replaces the park-on-abort pattern with a supervisor loop + stall detector in monitorWebSocket():

runFeishuWSClientUntilDead() — wraps a single WSClient lifetime. Polls getReconnectInfo().lastConnectTime every 10 s. Stall clock starts only after the first non-zero lastConnectTime (so a healthy connection that never needs to reconnect never trips a false stall). After 90 s of no new connect activity → resolves "stalled".
Supervisor while loop — on each stall or error, closes the dead client ({ force: true }), computes backoff, and creates a fresh WSClient. Uses the existing computeBackoff / sleepWithAbort from openclaw/plugin-sdk/infra-runtime (same helpers as Slack/Telegram).
Backoff counter semantics — supervisorAttempt increments on boot-time failures (SDK died before a confirmed connect). It resets to 0 if the SDK made at least one successful connect in the cycle, so a runtime network disruption after a healthy session retries quickly (5 s) rather than accumulating against the boot-time backoff schedule.
Non-recoverable errors — pattern-matched with narrow word-boundary regexes (not broad substring "app_id") to avoid permanently stopping the supervisor on recoverable errors that happen to mention field names.
Cleanup safety — outer try/finally guarantees wsClients / botOpenIds / botNames are cleared even if an unexpected synchronous throw bypasses the per-cycle finally.

Issues addressed from reviews of prior PRs (#52623, #52666)

Issue	PR	This fix
Broken backoff: `supervisorAttempts = 0` reset fires before SDK can stall	#52623 P1	Reset tied to `onFirstConnect` callback, not to fire-and-forget `start()` return
Overly broad `"app_id"` substring match stops supervisor on recoverable errors	#52623 P2	Narrowed to `/invalid app_id\b/i` and three other exact-phrase patterns
`cleanup()` skip if unexpected rejection bypasses per-cycle `finally`	#52666 P2	Outer `try/finally` covers all exit paths
SDK timer leak on `close()` (larksuite/node-sdk#177 / #40451)	both	Documented with tracking comment; upstream issue, not introduced here

Test plan

Gateway starts normally with Feishu WebSocket channel enabled
Simulate network drop while WS is connected → supervisor detects stall after ~90 s, logs WebSocket stall detected, recreates client
After network recovery, WS reconnects without gateway restart
Abort signal (gateway stop / config reload) exits cleanly at any point in the loop
Invalid appId fast-fails without entering the retry loop

Fixes #52618

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bce01ac235

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-30T23:08:19Z

        botOpenIds.delete(accountId);
        botNames.delete(accountId);


Preserve bot identity across supervisor cycles

Do not clear botOpenIds/botNames in the per-cycle finally block, because this block runs on every stall-driven restart, not just shutdown. After the first WebSocket supervisor recycle, bot identity is never repopulated in monitorWebSocket, but downstream handlers still depend on these maps (for example extensions/feishu/src/monitor.account.ts:423/:439), and checkBotMentioned returns false when botOpenId is missing (extensions/feishu/src/bot-content.ts:231-234). In accounts where group replies require mentions (the default unless group policy is open), the bot will stop responding to group messages after the first reconnect cycle.

Useful? React with 👍 / 👎.

greptile-apps · 2026-03-30T23:08:24Z

Greptile Summary

This PR replaces the old fire-and-forget monitorWebSocket with a proper supervisor loop that polls getReconnectInfo().lastConnectTime to detect when the Lark SDK silently exhausts its retry budget, then closes the dead client, waits with exponential backoff, and creates a fresh one — mirroring the Slack/Telegram patterns and fixing the 6-hour production outage where the gateway went dark after a network disruption.

The overall approach is solid and directly addresses the bug. A few items worth verifying before merge:

Stall detection / backoff semantics depend on lastConnectTime SDK behaviour — the correctness of both the "no false stall on healthy connections" guarantee and the supervisorAttempt reset hinge on whether getReconnectInfo().lastConnectTime stays 0 for sessions that have never needed a reconnect, and whether it advances on failed attempts or only successful ones. If the initial connect updates it, healthy connections would be evicted after 90 s. If failed attempts advance it, the exponential backoff effectively collapses to always 5 s. A short comment citing the confirmed SDK source location would future-proof this.
Dead no-op try/catch around createFeishuWSClient (just rethrows) is cleanup debt.
Two consecutive JSDoc blocks describing supervisorAttempt should be merged into one.

Confidence Score: 4/5

Safe to merge once the lastConnectTime SDK semantics are confirmed; the supervisor loop itself is well-structured and addresses the production bug.
The core fix is correct and well-documented. Two P2 style issues (dead try/catch, duplicate comment) are trivial. The logic concern around getReconnectInfo().lastConnectTime semantics is the only item that could represent a real defect — specifically, if lastConnectTime updates on the initial connect then healthy sessions would be evicted every 90 s, and if it updates on failed attempts then the exponential backoff is effectively disabled. Confirming SDK behaviour (with a comment/source reference) before merge is warranted, dropping the score from 5 to 4.
extensions/feishu/src/monitor.transport.ts — confirm getReconnectInfo().lastConnectTime semantics (lines 196–233)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 276-282

Comment:
**No-op `try/catch` is dead code**

The `try/catch` around `createFeishuWSClient` only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from `createFeishuWSClient` (e.g. missing credentials) will still propagate out of the `while` loop and through the outer `try/finally` to clean up state correctly.

```suggestion
      wsClient = createFeishuWSClient(account);
```

Or, keeping the `let` declaration:
```ts
let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 251-267

Comment:
**Duplicate JSDoc blocks should be merged**

There are two back-to-back JSDoc comment blocks describing the supervisor loop and `supervisorAttempt`. These should be merged into a single block to avoid confusion about which comment applies to which thing.

```suggestion
  /**
   * Supervisor loop: each iteration creates a fresh WSClient and runs it until
   * either (a) the abort signal fires or (b) the SDK silently exhausts its
   * internal retry budget (stall detected). On stall we close the dead client,
   * wait with exponential backoff, then start a fresh cycle.
   *
   * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
   * before any successful connect). It resets to 0 whenever the SDK records a
   * successful connection, so a runtime network disruption that causes a stall
   * does not accumulate against the boot-time backoff schedule.
   */
  let supervisorAttempt = 0;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 196-233

Comment:
**Stall detection and backoff reset correctness depend on `lastConnectTime` semantics**

The correctness of two closely related invariants hinges on exactly when the Lark SDK updates `getReconnectInfo().lastConnectTime`:

**1. False-stall risk for healthy connections**

The comment on `FEISHU_WS_STALL_DETECT_MS` says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if `lastConnectTime` stays `0` for a connection that has never needed to reconnect. Once `lastConnectTime` becomes non-zero (`staleSinceMs` is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy `idleMs >= FEISHU_WS_STALL_DETECT_MS` and be falsely evicted.

If the SDK sets `lastConnectTime` on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.

**2. Backoff reset may fire too eagerly**

`hadSuccessfulConnect` is set to `true` the first time `lastConnectTime` becomes non-zero (line 206: `isFirstConnect` check calls `onFirstConnect`). If `lastConnectTime` advances on any connection attempt (including failed ECONNRESET retries), `hadSuccessfulConnect` becomes `true` during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, `supervisorAttempt` is reset to `0` and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.

**Suggested verification**

It would be worth adding a comment or an inline source reference (e.g. `// larksuite/node-sdk src/ws/client.ts:L…`) documenting the confirmed SDK semantics: does `lastConnectTime` advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(feishu): add supervisor loop to reco..." | Re-trigger Greptile}

greptile-apps · 2026-03-30T23:08:27Z

+      let wsClient: Lark.WSClient;
+      try {
+        wsClient = createFeishuWSClient(account);
+      } catch (err) {
+        // Non-recoverable config error (missing credentials etc.) — bail out.
+        throw err;
+      }


No-op try/catch is dead code

The try/catch around createFeishuWSClient only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from createFeishuWSClient (e.g. missing credentials) will still propagate out of the while loop and through the outer try/finally to clean up state correctly.

Suggested change

let wsClient: Lark.WSClient;

try {

wsClient = createFeishuWSClient(account);

} catch (err) {

// Non-recoverable config error (missing credentials etc.) — bail out.

throw err;

}

wsClient = createFeishuWSClient(account);

Or, keeping the let declaration:

let wsClient: Lark.WSClient; wsClient = createFeishuWSClient(account);

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 276-282 Comment: **No-op `try/catch` is dead code** The `try/catch` around `createFeishuWSClient` only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from `createFeishuWSClient` (e.g. missing credentials) will still propagate out of the `while` loop and through the outer `try/finally` to clean up state correctly. ```suggestion wsClient = createFeishuWSClient(account); ``` Or, keeping the `let` declaration: ```ts let wsClient: Lark.WSClient; wsClient = createFeishuWSClient(account); ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-30T23:08:28Z

+  /**
+   * Supervisor loop: each iteration creates a fresh WSClient and runs it until
+   * either (a) the abort signal fires or (b) the SDK silently exhausts its
+   * internal retry budget (stall detected). On stall we close the dead client,
+   * wait with exponential backoff, then start a fresh cycle.
+   *
+   * Backoff attempt counter is incremented only when the SDK stalled; a
+   * successful connection resets it so transient disruptions do not accumulate
+   * against the backoff schedule.
+   */
+  /**
+   * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
+   * before any successful connect). It resets to 0 whenever the SDK records a
+   * successful connection, so a healthy network disruption that causes a stall
+   * later does not keep accumulating the backoff schedule.
+   */
+  let supervisorAttempt = 0;


Duplicate JSDoc blocks should be merged

There are two back-to-back JSDoc comment blocks describing the supervisor loop and supervisorAttempt. These should be merged into a single block to avoid confusion about which comment applies to which thing.

Suggested change

/**

* Supervisor loop: each iteration creates a fresh WSClient and runs it until

* either (a) the abort signal fires or (b) the SDK silently exhausts its

* internal retry budget (stall detected). On stall we close the dead client,

* wait with exponential backoff, then start a fresh cycle.

*

* Backoff attempt counter is incremented only when the SDK stalled; a

* successful connection resets it so transient disruptions do not accumulate

* against the backoff schedule.

*/

/**

* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled

* before any successful connect). It resets to 0 whenever the SDK records a

* successful connection, so a healthy network disruption that causes a stall

* later does not keep accumulating the backoff schedule.

*/

let supervisorAttempt = 0;

/**

* Supervisor loop: each iteration creates a fresh WSClient and runs it until

* either (a) the abort signal fires or (b) the SDK silently exhausts its

* internal retry budget (stall detected). On stall we close the dead client,

* wait with exponential backoff, then start a fresh cycle.

*

* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled

* before any successful connect). It resets to 0 whenever the SDK records a

* successful connection, so a runtime network disruption that causes a stall

* does not accumulate against the boot-time backoff schedule.

*/

let supervisorAttempt = 0;

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 251-267 Comment: **Duplicate JSDoc blocks should be merged** There are two back-to-back JSDoc comment blocks describing the supervisor loop and `supervisorAttempt`. These should be merged into a single block to avoid confusion about which comment applies to which thing. ```suggestion /** * Supervisor loop: each iteration creates a fresh WSClient and runs it until * either (a) the abort signal fires or (b) the SDK silently exhausts its * internal retry budget (stall detected). On stall we close the dead client, * wait with exponential backoff, then start a fresh cycle. * * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled * before any successful connect). It resets to 0 whenever the SDK records a * successful connection, so a runtime network disruption that causes a stall * does not accumulate against the boot-time backoff schedule. */ let supervisorAttempt = 0; ``` How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-30T23:08:29Z

+    const stallPoller = setInterval(() => {
+      if (abortSignal?.aborted) {
+        clearInterval(stallPoller);
+        return;
+      }
+
+      const { lastConnectTime } = wsClient.getReconnectInfo();
+
+      if (lastConnectTime !== lastSeenConnectTime) {
+        // SDK recorded a new connect attempt — reset stall clock.
+        const isFirstConnect = lastSeenConnectTime === 0 && lastConnectTime > 0;
+        lastSeenConnectTime = lastConnectTime;
+        staleSinceMs = Date.now();
+        if (isFirstConnect) {
+          onFirstConnect?.();
+        }
+        return;
+      }
+
+      // Still waiting for the SDK to make its first attempt: don't start
+      // the stall clock yet to avoid false positives on slow-start scenarios.
+      if (staleSinceMs === null) {
+        return;
+      }
+
+      const idleMs = Date.now() - staleSinceMs;
+      if (idleMs >= FEISHU_WS_STALL_DETECT_MS) {
+        log(
+          `feishu[${accountId}]: WebSocket stall detected ` +
+            `(no SDK connect activity for ${Math.round(idleMs / 1000)}s); ` +
+            `will restart supervisor cycle`,
+        );
+        clearInterval(stallPoller);
+        abortSignal?.removeEventListener("abort", handleAbort);
+        resolve("stalled");
+      }
+    }, FEISHU_WS_STALL_POLL_MS);
+    stallPoller.unref?.();


Stall detection and backoff reset correctness depend on lastConnectTime semantics

The correctness of two closely related invariants hinges on exactly when the Lark SDK updates getReconnectInfo().lastConnectTime:

1. False-stall risk for healthy connections

The comment on FEISHU_WS_STALL_DETECT_MS says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if lastConnectTime stays 0 for a connection that has never needed to reconnect. Once lastConnectTime becomes non-zero (staleSinceMs is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy idleMs >= FEISHU_WS_STALL_DETECT_MS and be falsely evicted.

If the SDK sets lastConnectTime on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.

2. Backoff reset may fire too eagerly

hadSuccessfulConnect is set to true the first time lastConnectTime becomes non-zero (line 206: isFirstConnect check calls onFirstConnect). If lastConnectTime advances on any connection attempt (including failed ECONNRESET retries), hadSuccessfulConnect becomes true during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, supervisorAttempt is reset to 0 and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.

Suggested verification

It would be worth adding a comment or an inline source reference (e.g. // larksuite/node-sdk src/ws/client.ts:L…) documenting the confirmed SDK semantics: does lastConnectTime advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting. Can you confirm: does getReconnectInfo().lastConnectTime stay 0 for a session that connects initially and is never disrupted (i.e. it only advances during reconnect loops, not the first connect)? And similarly, does it advance only on successful reconnects, or also on failed attempts such as ECONNRESET retries?

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 196-233 Comment: **Stall detection and backoff reset correctness depend on `lastConnectTime` semantics** The correctness of two closely related invariants hinges on exactly when the Lark SDK updates `getReconnectInfo().lastConnectTime`: **1. False-stall risk for healthy connections** The comment on `FEISHU_WS_STALL_DETECT_MS` says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if `lastConnectTime` stays `0` for a connection that has never needed to reconnect. Once `lastConnectTime` becomes non-zero (`staleSinceMs` is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy `idleMs >= FEISHU_WS_STALL_DETECT_MS` and be falsely evicted. If the SDK sets `lastConnectTime` on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart. **2. Backoff reset may fire too eagerly** `hadSuccessfulConnect` is set to `true` the first time `lastConnectTime` becomes non-zero (line 206: `isFirstConnect` check calls `onFirstConnect`). If `lastConnectTime` advances on any connection attempt (including failed ECONNRESET retries), `hadSuccessfulConnect` becomes `true` during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, `supervisorAttempt` is reset to `0` and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears. **Suggested verification** It would be worth adding a comment or an inline source reference (e.g. `// larksuite/node-sdk src/ws/client.ts:L…`) documenting the confirmed SDK semantics: does `lastConnectTime` advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting. Can you confirm: does `getReconnectInfo().lastConnectTime` stay `0` for a session that connects initially and is never disrupted (i.e. it only advances during reconnect loops, not the first connect)? And similarly, does it advance only on successful reconnects, or also on failed attempts such as ECONNRESET retries? How can I resolve this? If you propose a fix, please make it concise.

## Changes from review feedback ### P1 — Bot identity cleared on supervisor restart (critical bug fix) `botOpenIds`/`botNames` were previously cleared in the per-cycle `finally` block, which runs on every stall-driven restart. After the first reconnect cycle, `checkBotMentioned()` would return false (missing `botOpenId`) and all group accounts requiring a bot mention would go silent. Fix: only clear `wsClients` in the per-cycle `finally`. `botOpenIds`/ `botNames` are now only cleared in the outer `finally` (the shutdown path). ### P2.1 — Remove dead no-op try/catch around createFeishuWSClient The try/catch only re-threw without any special handling. Removed; any synchronous throw from `createFeishuWSClient` still propagates correctly through the outer try/finally. ### P2.2 — Merge duplicate JSDoc blocks for supervisor loop Two consecutive JSDoc comment blocks describing the same supervisor loop and `supervisorAttempt` have been merged into one. ### P2.3 — Document and fix lastConnectTime SDK semantics Confirmed by reading SDK source (@larksuiteoapi/node-sdk/lib/index.js WSClient.reConnect): `lastConnectTime` is set inside `tryConnect()` which is called on EVERY connection attempt, including the very first `start()` call — not only on reconnects. The original code started the stall clock on the first `lastConnectTime` observation, which would falsely evict healthy connections after 90 s. Fix: the stall clock now only starts after the SECOND `lastConnectTime` change (i.e. the first genuine reconnect attempt after a WebSocket drop). Healthy connections that never need to reconnect are never stalled. The constant comment and `runFeishuWSClientUntilDead` JSDoc have been updated to document these semantics with a source reference. ### Rebase Rebased onto upstream/main; resolved conflict with the `safeEqualSecret` import refactor (both imports are now present). ### Tests Added regression test for P1: verifies `botOpenIds`/`botNames` survive a supervisor cycle stall+restart and are only cleared on final shutdown.

zhangxin840 · 2026-04-03T10:29:04Z

Thanks for the detailed review. All three items have been addressed in the latest push (aea91c8):

P1 — Bot identity cleared on supervisor restart (now fixed)

Confirmed the bug: botOpenIds.delete(accountId) and botNames.delete(accountId) were in the per-cycle finally block, which runs on every stall-driven restart — not only on shutdown. After the first reconnect cycle, checkBotMentioned() would find no botOpenId and group accounts requiring a mention would go silent.

Fix: the per-cycle finally now only calls wsClients.delete(accountId). Bot identity maps are cleared solely in the outer finally (the true shutdown path). Added a regression test in monitor.cleanup.test.ts that uses fake timers to simulate a full stall+restart cycle and asserts that botOpenIds/botNames survive into the next cycle.

lastConnectTime SDK semantics — confirmed and fixed

Read the SDK source at dist/extensions/feishu/node_modules/@larksuiteoapi/node-sdk/lib/index.js (WSClient.reConnect). Confirmed: lastConnectTime is set inside tryConnect(), which is called on every connection attempt — including the very first start() call via reConnect(isStart=true). It is not limited to reconnect-after-drop scenarios.

The original code started the stall clock on the first non-zero lastConnectTime observation. This would falsely evict any healthy long-lived connection after 90 s, contradicting the guarantee stated in the FEISHU_WS_STALL_DETECT_MS comment.

Fix: introduced a connectChangeCount counter. The stall clock now only starts after the second lastConnectTime change — i.e. the first genuine reconnect attempt (triggered by the WebSocket close handler calling reConnect(isStart=false)). The FEISHU_WS_STALL_DETECT_MS JSDoc and runFeishuWSClientUntilDead JSDoc have been updated to document the confirmed SDK behavior with a source reference.

The backoff reset via hadSuccessfulConnect / onFirstConnect is not affected: it fires on the first connect attempt regardless, which is the correct heuristic for distinguishing boot-time failure from runtime disruption.

Dead no-op try/catch — removed

The try { wsClient = createFeishuWSClient(account) } catch (err) { throw err } block has been removed. Any synchronous throw from createFeishuWSClient (e.g. missing credentials) still propagates through the outer try/finally for correct cleanup.

Duplicate JSDoc blocks — merged

The two back-to-back JSDoc blocks describing the supervisor loop and supervisorAttempt have been merged into one.

Rebase

The branch has been rebased onto current upstream/main. The only conflict was the import line: upstream introduced safeEqualSecret from browser-support to replace the inline timingSafeEqualString helper. Both the new upstream import and the PR's computeBackoff/sleepWithAbort imports are now present.

CI failures

checks-fast-extensions: failure in extensions/matrix/src/matrix/monitor/events.test.ts (room ID assertion !fallback:example.org vs !dm:example.org) — pre-existing Matrix extension issue, unrelated to this PR.
checks-windows-node-test-3, -4, -6: src/tasks/flow-registry.store.test.ts timeout at 120 s — pre-existing Windows runner flake, unrelated to this PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aea91c8f73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T10:32:47Z

+          onFirstConnect: () => {
+            hadSuccessfulConnect = true;
+          },


Base backoff on confirmed connection, not first attempt

onFirstConnect is documented as firing when the SDK records its first connect attempt, but this callback sets hadSuccessfulConnect = true. In startup outage scenarios (or any cycle where handshake never succeeds), that still resets supervisorAttempt to 0, so computeBackoff keeps using the minimum delay and the supervisor never escalates retry spacing. This defeats the intended exponential backoff for boot-time failures and can create a persistent 5s reconnect storm.

Useful? React with 👍 / 👎.

`onFirstConnect` fired on the first connect *attempt*, not after the WebSocket handshake was confirmed live. In a startup-outage scenario this caused `hadSuccessfulConnect` to be set prematurely, which reset `supervisorAttempt` to 0 on every stall cycle and killed exponential backoff. Fix: replace `onFirstConnect` with `onConfirmedConnect`. The poll loop now accumulates a `connectedAndStableCount` each tick where `lastConnectTime` is non-zero, unchanged, and the stall clock has not started. Only after `FEISHU_WS_STABLE_TICKS_REQUIRED` (2) consecutive stable ticks (~20 s of confirmed liveness) does it fire `onConfirmedConnect` and set `hadSuccessfulConnect = true`. The existing `connectChangeCount`/stall clock logic is unchanged. Add two regression tests to `monitor.cleanup.test.ts`: - startup-outage: reconnect fires before stable ticks accumulate → `supervisorAttempt >= 1` (backoff applies) - normal recovery: stable ticks accumulate before any reconnect → `supervisorAttempt == 0` (fast recovery) Addresses Codex review PRR_kwDOQb6kR87xtZOm.

zhangxin840 · 2026-04-03T12:07:06Z

Re: Codex review PRR_kwDOQb6kR87xtZOm (P2 — base backoff on confirmed connection, not first attempt)

Fixed in ba25f3d.

Replaced onFirstConnect with onConfirmedConnect. The poll loop now tracks connectedAndStableCount: incremented each tick where lastConnectTime is non-zero, stable (unchanged from last tick), and the stall clock has not started. hadSuccessfulConnect is set to true only after FEISHU_WS_STABLE_TICKS_REQUIRED = 2 consecutive stable ticks (~20 s of confirmed liveness).

In the startup-outage scenario you described: the first attempt fires and lastConnectTime advances, but before two stable ticks accumulate the SDK moves to a reconnect attempt — connectedAndStableCount resets to 0, onConfirmedConnect never fires, hadSuccessfulConnect stays false, and supervisorAttempt keeps incrementing → full exponential backoff applies correctly.

The connectChangeCount/stall-clock logic from the previous commit is unchanged. Two new regression tests in monitor.cleanup.test.ts cover both paths:

Startup outage: reconnect fires before stable ticks accumulate → supervisorAttempt >= 1
Normal recovery: stable ticks accumulate before any reconnect → supervisorAttempt == 0

Re: CI failure checks-node-channels-3 — WhatsApp extension test

The failing test extensions/whatsapp/src/action-runtime.test.ts > applies default account allowFrom when accountId is omitted asserts status: 403 on a ToolAuthorizationError, but that class only sets this.name — it has no status property in the current codebase (extensions/whatsapp/src/runtime-api.ts:50). This test failure is pre-existing on upstream main and is completely unrelated to the Feishu supervisor changes in this PR. No fix attempted here.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba25f3d0d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T12:09:45Z

+        if (!isFirstChange) {
+          // Genuine reconnect attempt: start (or reset) the stall clock.
+          staleSinceMs = Date.now();
+        }


Reset stall clock after reconnect recovery

Starting staleSinceMs on every post-initial lastConnectTime change without ever clearing it means any successful reconnect is treated as a pending stall forever. After one transient reconnect, lastConnectTime normally stays unchanged during healthy operation, so idleMs will cross 90s and force a supervisor restart even though the socket has recovered; this can cause repeated teardown/reconnect loops on healthy sessions after the first network blip.

Useful? React with 👍 / 👎.

…te check The stall detector starts a 90 s clock whenever `lastConnectTime` advances after the initial connect (indicating the SDK entered a reconnect loop). But `lastConnectTime` is set on every `tryConnect()` call regardless of success — so a successful reconnect also sets the clock, causing a false supervisor restart 90 s later even on a fully recovered connection. The core problem: `lastConnectTime` alone cannot distinguish "SDK reconnected successfully" from "SDK exhausted its retry budget." The only reliable signal is the underlying WebSocket's `readyState`. Fix: after `FEISHU_WS_STABLE_TICKS_REQUIRED` consecutive stable poll ticks while a stall clock is running, check `wsClient.wsConfig?.getWSInstance()?. readyState`. If it is 1 (OPEN) the reconnect succeeded — clear `staleSinceMs` so the healthy session is not falsely evicted. If the WebSocket is CLOSED / CONNECTING / unavailable, the clock continues and the stall fires normally. The private `wsConfig` accessor is wrapped in try/catch so any future SDK refactor degrades gracefully back to the existing behaviour. Tests added: - "clears stall clock after successful reconnect" — readyState OPEN, no false stall after 90 s of healthy operation - "stall fires when reconnect socket is not OPEN" — readyState CLOSED, stall detected and second supervisor cycle started Addresses Codex P1 in openclaw#57978 review 4055460051. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

zhangxin840 · 2026-04-06T01:30:28Z

Addressing Codex P1 (review #4055460051)

Root cause of the false-stall issue: lastConnectTime is set inside tryConnect() on every connection attempt (confirmed in SDK source at WSClient.reConnect → tryConnect). This means it's indistinguishable whether a stable lastConnectTime reflects "SDK reconnected successfully" or "SDK exhausted retries and stopped." Simple stable-tick counting can't disambiguate the two cases.

Fix (b05a8320f8): After FEISHU_WS_STABLE_TICKS_REQUIRED consecutive stable poll ticks while the stall clock is running, we check wsClient.wsConfig?.getWSInstance()?.readyState. If the underlying WebSocket is OPEN (readyState === 1), the reconnect succeeded — we clear staleSinceMs. If it's CLOSED/CONNECTING or the accessor is unavailable (SDK refactor safeguard), the stall clock continues and fires normally.

Tradeoffs documented:

Accesses the non-public wsConfig property — wrapped in try/catch so future SDK changes degrade gracefully to the original behaviour
The readyState check is gated on FEISHU_WS_STABLE_TICKS_REQUIRED ticks (~20 s) to avoid clearing prematurely while a handshake is still in progress

Tests added:

"clears stall clock after successful reconnect" — readyState OPEN + stable → no false supervisor restart after 90 s
"stall fires when reconnect socket is not OPEN" — readyState CLOSED → stall detected and second supervisor cycle started

All 8 cleanup tests pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b05a8320f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T01:38:06Z

+      if (!confirmedConnectFired && lastConnectTime !== 0 && staleSinceMs === null) {
+        connectedAndStableCount += 1;
+        if (connectedAndStableCount >= FEISHU_WS_STABLE_TICKS_REQUIRED) {
+          confirmedConnectFired = true;
+          onConfirmedConnect?.();


Gate confirmed-connect on actual socket open

This marks a cycle as “successfully connected” after two unchanged lastConnectTime polls, but that can happen even when the handshake never succeeds (e.g., startup outage where the SDK is waiting for its next retry). In that case hadSuccessfulConnect becomes true and the supervisor resets supervisorAttempt to 0, so boot-time failures keep retrying at minimum backoff instead of escalating. Fresh evidence: the upstream WS client defaults include a reconnect nonce delay up to 30s, which is longer than this 20s confirmation window, so this false-positive path is reachable without any successful connection.

Useful? React with 👍 / 👎.

…ay fix) The SDK's reconnect nonce delay can be up to 30 s. During this window `lastConnectTime` is already non-zero (set by the failed initial attempt) and stable, so FEISHU_WS_STABLE_TICKS_REQUIRED (2 ticks ≈ 20 s) is reached before the nonce expires. This triggered `onConfirmedConnect` falsely, setting `hadSuccessfulConnect = true` even though no connection was established — causing boot-time failures to reset `supervisorAttempt` to 0 and preventing exponential backoff escalation. Fix: gate `onConfirmedConnect` on the same `getWsReadyState()` check added in the previous commit. If `readyState === 1` (OPEN) or is not observable (null, for tests/future SDK changes — backward-compatible fallback), fire the callback. If readyState is known but not OPEN (e.g. 0=CONNECTING during nonce wait), defer confirmation until the socket is actually open. `getWsReadyState()` is extracted as a shared helper used by both the confirmed-connect path and the post-reconnect stall-clock clear path. Regression test added: "does not confirm connection during nonce delay" — verifies that readyState CONNECTING blocks onConfirmedConnect and allows supervisorAttempt to increment on subsequent stall. Addresses Codex P2 in openclaw#57978 (2026-04-06 review comment on line 267 of monitor.transport.ts). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

zhangxin840 · 2026-04-06T02:14:01Z

Addressing Codex P2 (2026-04-06, line 267)

Issue: onConfirmedConnect could fire falsely during the SDK's reconnect nonce delay (up to 30 s). After a failed initial connect, lastConnectTime is already non-zero and stable — so FEISHU_WS_STABLE_TICKS_REQUIRED (2 ticks ≈ 20 s) is reached before the nonce expires. This set hadSuccessfulConnect = true with no actual connection, preventing exponential backoff from escalating on boot-time failures.

Fix (99d18b4893): Gate onConfirmedConnect on the same getWsReadyState() helper introduced in the previous commit. Confirmation fires only when readyState === 1 (OPEN) or when the accessor is unavailable (null — backward-compatible fallback to stable-tick behavior for tests/future SDK changes). If readyState is known but not OPEN (e.g., 0=CONNECTING during nonce wait), the callback is deferred until the socket is actually open.

getWsReadyState() is now a shared helper covering both: confirmed-connect detection and post-reconnect stall-clock clearing.

Regression test added: "does not confirm connection during nonce delay" — readyState = 0 (CONNECTING) blocks onConfirmedConnect for 2 stable ticks; after the nonce fires and all retries stall, computeBackoff is called with attempt ≥ 1.

All 9 cleanup tests pass.

clawsweeper · 2026-04-28T15:18:28Z

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded by the maintainer-labeled Feishu recovery PR #73739. Current main still lacks post-start SDK retry-exhaustion recovery, but the same #52618 work is now tracked in a current-main replacement path, while this branch is dirty/stale and adds raw supervisor error logging that should not be carried forward.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Use #52618 and #73739 as the canonical path: land a current-main Feishu WebSocket supervisor that preserves heartbeat/proxy/async client behavior, uses public SDK lifecycle callbacks where possible, keeps redacted logging, and closes older duplicate PRs after the replacement is resolved.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR body and #52618 provide concrete reproduction steps, and current main source confirms the failure mode: after wsClient.start() returns, OpenClaw waits only for abort and has no post-start signal to recreate a dead SDK client.

Is this the best way to solve the issue?

No. The desired product direction is right, but this branch is no longer the best merge path because #73739 tracks the same fix against the newer Feishu transport and preserves current redaction/lifecycle behavior.

Security review:

Security review needs attention: The diff adds raw error logging in Feishu WebSocket supervisor paths, which can expose credentials or bearer tokens in logs.

[medium] Raw supervisor errors may leak secrets — extensions/feishu/src/monitor.transport.ts:401
The PR logs String(err) for non-recoverable and cycle errors. Current main redacts WebSocket error messages, and the replacement PR keeps that redaction; this branch should not be merged without applying the same sanitizer.
Confidence: 0.86

What I checked:

Current main still parks after start: Current monitorWebSocket() awaits createFeishuWSClient(), awaits wsClient.start({ eventDispatcher }), logs started, then waits only for abort; there is no getReconnectInfo() polling, SDK lifecycle callback handling, or post-start supervisor cycle. (extensions/feishu/src/monitor.transport.ts:188, 52e2d4e16a29)
Current tests only cover start failure and cleanup: The Feishu cleanup tests cover abort cleanup, redacted close/start errors, and retry when wsClient.start() rejects, but they do not cover the reported silent retry-exhaustion case after start() has returned. (extensions/feishu/src/monitor.cleanup.test.ts:83, 52e2d4e16a29)
Canonical replacement PR is open: GitHub API reports fix(feishu): recover WebSocket after SDK retry exhaustion #73739 open, author association MEMBER, labeled maintainer, and titled fix(feishu): recover WebSocket after SDK retry exhaustion; its diff adds Feishu WS lifecycle callbacks, retry-exhaustion recovery tests, and redacted supervisor logging for the same fix(feishu): WebSocket connection not recovered after network disruption #52618 behavior. (1faf660c3482)
This PR branch is stale against current main: GitHub API reports this PR open with head 99d18b48939811ce08c028b6c1af5bb3a53f1f41, base 71f8c0344a7a22f1b05e9f68c8680712c3009c98, and mergeable_state: dirty, while current main is 52e2d4e16a290cc8027d9b418398c65d05d49717. (99d18b489398)
Security-sensitive logging concern in PR branch: The PR head logs raw String(err) in new supervisor error paths, while current main has formatFeishuWsErrorForLog() redaction and fix(feishu): recover WebSocket after SDK retry exhaustion #73739 carries that redaction into its replacement path. (extensions/feishu/src/monitor.transport.ts:401, 99d18b489398)

Likely related people:

vincentkoc: Authored the recent mainline Feishu WebSocket reconnect/heartbeat repair and owns the maintainer-labeled replacement PR for the same retry-exhaustion behavior. (role: recent maintainer and likely follow-up owner; confidence: high; commits: d93e6f61580a, 319aa2f1fe67, 1faf660c3482; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/client.ts, extensions/feishu/src/monitor.cleanup.test.ts)
steipete: Recent API history shows repeated Feishu monitor/account hardening and refactors around webhook, monitor startup, runtime seams, and lifecycle handling. (role: recent Feishu monitor maintainer; confidence: medium; commits: f46bd2e0cc17, 8be3a4466cbc, 77d04a39d87a; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/monitor.account.ts, extensions/feishu/src/client.ts)
schumilin: Authored the mainline Feishu WebSocket monitor stop/abort cleanup change and cleanup tests that this PR builds on. (role: introduced adjacent cleanup behavior; confidence: medium; commits: bd4237c16c2c; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/monitor.cleanup.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 52e2d4e16a29.

openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: M labels Mar 30, 2026

chatgpt-codex-connector Bot reviewed Mar 30, 2026

View reviewed changes

greptile-apps Bot reviewed Mar 30, 2026

View reviewed changes

zhangxin840 force-pushed the fix/feishu-ws-retry-exhaustion-watchdog branch from bce01ac to aea91c8 Compare April 3, 2026 10:28

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

openclaw-barnacle Bot added size: L and removed size: M labels Apr 3, 2026

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

This was referenced Apr 26, 2026

fix(feishu): add supervisor loop for WebSocket reconnection #52623

Closed

[Bug]: 飞书无限重连Feishu WebSocket reconnect loop never stops — gateway becomes unresponsive without manual restart #59753

Closed

BingqingLyu mentioned this pull request Apr 27, 2026

fix(feishu): supervisor loop to recover from WSClient retry exhaustion BingqingLyu/openclaw#1703

Open

5 tasks

vincentkoc mentioned this pull request Apr 28, 2026

fix(feishu): recover WebSocket after SDK retry exhaustion #73739

Merged

clawsweeper Bot mentioned this pull request Apr 28, 2026

fix(feishu): WebSocket connection not recovered after network disruption #52618

Closed

openclaw-clownfish Bot mentioned this pull request Apr 29, 2026

fix(feishu): supervise WebSocket retry exhaustion #73940

Closed

clawsweeper Bot closed this Apr 30, 2026

Uh oh!

Conversation

zhangxin840 commented Mar 30, 2026

Bug

Root Cause

Fix

Issues addressed from reviews of prior PRs (#52623, #52666)

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Mar 30, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxin840 commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxin840 commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxin840 commented Apr 6, 2026

Addressing Codex P1 (review #4055460051)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

zhangxin840 commented Apr 6, 2026

Addressing Codex P2 (2026-04-06, line 267)

Uh oh!

clawsweeper Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading