fix(feishu): supervisor loop to recover from WSClient retry exhaustion#57978
fix(feishu): supervisor loop to recover from WSClient retry exhaustion#57978zhangxin840 wants to merge 4 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bce01ac235
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| botOpenIds.delete(accountId); | ||
| botNames.delete(accountId); |
There was a problem hiding this comment.
Preserve bot identity across supervisor cycles
Do not clear botOpenIds/botNames in the per-cycle finally block, because this block runs on every stall-driven restart, not just shutdown. After the first WebSocket supervisor recycle, bot identity is never repopulated in monitorWebSocket, but downstream handlers still depend on these maps (for example extensions/feishu/src/monitor.account.ts:423/:439), and checkBotMentioned returns false when botOpenId is missing (extensions/feishu/src/bot-content.ts:231-234). In accounts where group replies require mentions (the default unless group policy is open), the bot will stop responding to group messages after the first reconnect cycle.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR replaces the old fire-and-forget The overall approach is solid and directly addresses the bug. A few items worth verifying before merge:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 276-282
Comment:
**No-op `try/catch` is dead code**
The `try/catch` around `createFeishuWSClient` only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from `createFeishuWSClient` (e.g. missing credentials) will still propagate out of the `while` loop and through the outer `try/finally` to clean up state correctly.
```suggestion
wsClient = createFeishuWSClient(account);
```
Or, keeping the `let` declaration:
```ts
let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 251-267
Comment:
**Duplicate JSDoc blocks should be merged**
There are two back-to-back JSDoc comment blocks describing the supervisor loop and `supervisorAttempt`. These should be merged into a single block to avoid confusion about which comment applies to which thing.
```suggestion
/**
* Supervisor loop: each iteration creates a fresh WSClient and runs it until
* either (a) the abort signal fires or (b) the SDK silently exhausts its
* internal retry budget (stall detected). On stall we close the dead client,
* wait with exponential backoff, then start a fresh cycle.
*
* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
* before any successful connect). It resets to 0 whenever the SDK records a
* successful connection, so a runtime network disruption that causes a stall
* does not accumulate against the boot-time backoff schedule.
*/
let supervisorAttempt = 0;
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 196-233
Comment:
**Stall detection and backoff reset correctness depend on `lastConnectTime` semantics**
The correctness of two closely related invariants hinges on exactly when the Lark SDK updates `getReconnectInfo().lastConnectTime`:
**1. False-stall risk for healthy connections**
The comment on `FEISHU_WS_STALL_DETECT_MS` says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if `lastConnectTime` stays `0` for a connection that has never needed to reconnect. Once `lastConnectTime` becomes non-zero (`staleSinceMs` is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy `idleMs >= FEISHU_WS_STALL_DETECT_MS` and be falsely evicted.
If the SDK sets `lastConnectTime` on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.
**2. Backoff reset may fire too eagerly**
`hadSuccessfulConnect` is set to `true` the first time `lastConnectTime` becomes non-zero (line 206: `isFirstConnect` check calls `onFirstConnect`). If `lastConnectTime` advances on any connection attempt (including failed ECONNRESET retries), `hadSuccessfulConnect` becomes `true` during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, `supervisorAttempt` is reset to `0` and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.
**Suggested verification**
It would be worth adding a comment or an inline source reference (e.g. `// larksuite/node-sdk src/ws/client.ts:L…`) documenting the confirmed SDK semantics: does `lastConnectTime` advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(feishu): add supervisor loop to reco..." | Re-trigger Greptile |
| let wsClient: Lark.WSClient; | ||
| try { | ||
| wsClient = createFeishuWSClient(account); | ||
| } catch (err) { | ||
| // Non-recoverable config error (missing credentials etc.) — bail out. | ||
| throw err; | ||
| } |
There was a problem hiding this comment.
The try/catch around createFeishuWSClient only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from createFeishuWSClient (e.g. missing credentials) will still propagate out of the while loop and through the outer try/finally to clean up state correctly.
| let wsClient: Lark.WSClient; | |
| try { | |
| wsClient = createFeishuWSClient(account); | |
| } catch (err) { | |
| // Non-recoverable config error (missing credentials etc.) — bail out. | |
| throw err; | |
| } | |
| wsClient = createFeishuWSClient(account); |
Or, keeping the let declaration:
let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 276-282
Comment:
**No-op `try/catch` is dead code**
The `try/catch` around `createFeishuWSClient` only re-throws without any special handling, so it has no effect. Removing it keeps the intent clearer and avoids the impression that there is error-specific logic here. Any synchronous throw from `createFeishuWSClient` (e.g. missing credentials) will still propagate out of the `while` loop and through the outer `try/finally` to clean up state correctly.
```suggestion
wsClient = createFeishuWSClient(account);
```
Or, keeping the `let` declaration:
```ts
let wsClient: Lark.WSClient;
wsClient = createFeishuWSClient(account);
```
How can I resolve this? If you propose a fix, please make it concise.| /** | ||
| * Supervisor loop: each iteration creates a fresh WSClient and runs it until | ||
| * either (a) the abort signal fires or (b) the SDK silently exhausts its | ||
| * internal retry budget (stall detected). On stall we close the dead client, | ||
| * wait with exponential backoff, then start a fresh cycle. | ||
| * | ||
| * Backoff attempt counter is incremented only when the SDK stalled; a | ||
| * successful connection resets it so transient disruptions do not accumulate | ||
| * against the backoff schedule. | ||
| */ | ||
| /** | ||
| * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled | ||
| * before any successful connect). It resets to 0 whenever the SDK records a | ||
| * successful connection, so a healthy network disruption that causes a stall | ||
| * later does not keep accumulating the backoff schedule. | ||
| */ | ||
| let supervisorAttempt = 0; |
There was a problem hiding this comment.
Duplicate JSDoc blocks should be merged
There are two back-to-back JSDoc comment blocks describing the supervisor loop and supervisorAttempt. These should be merged into a single block to avoid confusion about which comment applies to which thing.
| /** | |
| * Supervisor loop: each iteration creates a fresh WSClient and runs it until | |
| * either (a) the abort signal fires or (b) the SDK silently exhausts its | |
| * internal retry budget (stall detected). On stall we close the dead client, | |
| * wait with exponential backoff, then start a fresh cycle. | |
| * | |
| * Backoff attempt counter is incremented only when the SDK stalled; a | |
| * successful connection resets it so transient disruptions do not accumulate | |
| * against the backoff schedule. | |
| */ | |
| /** | |
| * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled | |
| * before any successful connect). It resets to 0 whenever the SDK records a | |
| * successful connection, so a healthy network disruption that causes a stall | |
| * later does not keep accumulating the backoff schedule. | |
| */ | |
| let supervisorAttempt = 0; | |
| /** | |
| * Supervisor loop: each iteration creates a fresh WSClient and runs it until | |
| * either (a) the abort signal fires or (b) the SDK silently exhausts its | |
| * internal retry budget (stall detected). On stall we close the dead client, | |
| * wait with exponential backoff, then start a fresh cycle. | |
| * | |
| * `supervisorAttempt` counts consecutive boot-time failures (cycle stalled | |
| * before any successful connect). It resets to 0 whenever the SDK records a | |
| * successful connection, so a runtime network disruption that causes a stall | |
| * does not accumulate against the boot-time backoff schedule. | |
| */ | |
| let supervisorAttempt = 0; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 251-267
Comment:
**Duplicate JSDoc blocks should be merged**
There are two back-to-back JSDoc comment blocks describing the supervisor loop and `supervisorAttempt`. These should be merged into a single block to avoid confusion about which comment applies to which thing.
```suggestion
/**
* Supervisor loop: each iteration creates a fresh WSClient and runs it until
* either (a) the abort signal fires or (b) the SDK silently exhausts its
* internal retry budget (stall detected). On stall we close the dead client,
* wait with exponential backoff, then start a fresh cycle.
*
* `supervisorAttempt` counts consecutive boot-time failures (cycle stalled
* before any successful connect). It resets to 0 whenever the SDK records a
* successful connection, so a runtime network disruption that causes a stall
* does not accumulate against the boot-time backoff schedule.
*/
let supervisorAttempt = 0;
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| const stallPoller = setInterval(() => { | ||
| if (abortSignal?.aborted) { | ||
| clearInterval(stallPoller); | ||
| return; | ||
| } | ||
|
|
||
| const { lastConnectTime } = wsClient.getReconnectInfo(); | ||
|
|
||
| if (lastConnectTime !== lastSeenConnectTime) { | ||
| // SDK recorded a new connect attempt — reset stall clock. | ||
| const isFirstConnect = lastSeenConnectTime === 0 && lastConnectTime > 0; | ||
| lastSeenConnectTime = lastConnectTime; | ||
| staleSinceMs = Date.now(); | ||
| if (isFirstConnect) { | ||
| onFirstConnect?.(); | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| // Still waiting for the SDK to make its first attempt: don't start | ||
| // the stall clock yet to avoid false positives on slow-start scenarios. | ||
| if (staleSinceMs === null) { | ||
| return; | ||
| } | ||
|
|
||
| const idleMs = Date.now() - staleSinceMs; | ||
| if (idleMs >= FEISHU_WS_STALL_DETECT_MS) { | ||
| log( | ||
| `feishu[${accountId}]: WebSocket stall detected ` + | ||
| `(no SDK connect activity for ${Math.round(idleMs / 1000)}s); ` + | ||
| `will restart supervisor cycle`, | ||
| ); | ||
| clearInterval(stallPoller); | ||
| abortSignal?.removeEventListener("abort", handleAbort); | ||
| resolve("stalled"); | ||
| } | ||
| }, FEISHU_WS_STALL_POLL_MS); | ||
| stallPoller.unref?.(); |
There was a problem hiding this comment.
Stall detection and backoff reset correctness depend on
lastConnectTime semantics
The correctness of two closely related invariants hinges on exactly when the Lark SDK updates getReconnectInfo().lastConnectTime:
1. False-stall risk for healthy connections
The comment on FEISHU_WS_STALL_DETECT_MS says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if lastConnectTime stays 0 for a connection that has never needed to reconnect. Once lastConnectTime becomes non-zero (staleSinceMs is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy idleMs >= FEISHU_WS_STALL_DETECT_MS and be falsely evicted.
If the SDK sets lastConnectTime on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.
2. Backoff reset may fire too eagerly
hadSuccessfulConnect is set to true the first time lastConnectTime becomes non-zero (line 206: isFirstConnect check calls onFirstConnect). If lastConnectTime advances on any connection attempt (including failed ECONNRESET retries), hadSuccessfulConnect becomes true during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, supervisorAttempt is reset to 0 and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.
Suggested verification
It would be worth adding a comment or an inline source reference (e.g. // larksuite/node-sdk src/ws/client.ts:L…) documenting the confirmed SDK semantics: does lastConnectTime advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting. Can you confirm: does getReconnectInfo().lastConnectTime stay 0 for a session that connects initially and is never disrupted (i.e. it only advances during reconnect loops, not the first connect)? And similarly, does it advance only on successful reconnects, or also on failed attempts such as ECONNRESET retries?
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 196-233
Comment:
**Stall detection and backoff reset correctness depend on `lastConnectTime` semantics**
The correctness of two closely related invariants hinges on exactly when the Lark SDK updates `getReconnectInfo().lastConnectTime`:
**1. False-stall risk for healthy connections**
The comment on `FEISHU_WS_STALL_DETECT_MS` says "a healthy long-lived connection that has never needed to reconnect does NOT trigger a false stall." That guarantee only holds if `lastConnectTime` stays `0` for a connection that has never needed to reconnect. Once `lastConnectTime` becomes non-zero (`staleSinceMs` is set), a healthy connection that remains stable for 90+ seconds without further SDK activity will satisfy `idleMs >= FEISHU_WS_STALL_DETECT_MS` and be falsely evicted.
If the SDK sets `lastConnectTime` on every connection attempt — including the very first one — every successful connection would begin a countdown of at most 90 s before a spurious restart.
**2. Backoff reset may fire too eagerly**
`hadSuccessfulConnect` is set to `true` the first time `lastConnectTime` becomes non-zero (line 206: `isFirstConnect` check calls `onFirstConnect`). If `lastConnectTime` advances on any connection attempt (including failed ECONNRESET retries), `hadSuccessfulConnect` becomes `true` during the first SDK retry cycle even if the connection was never actually established. When that cycle eventually stalls, `supervisorAttempt` is reset to `0` and the backoff stays at the minimum 5 s forever — the intended distinction between "boot-time failure" and "runtime disruption" disappears.
**Suggested verification**
It would be worth adding a comment or an inline source reference (e.g. `// larksuite/node-sdk src/ws/client.ts:L…`) documenting the confirmed SDK semantics: does `lastConnectTime` advance only on reconnect attempts (i.e. after the first successful connect), or on every connect attempt including the initial one? If the former, both invariants hold as-is; if the latter, the stall guard and the backoff reset both need adjusting. Can you confirm: does `getReconnectInfo().lastConnectTime` stay `0` for a session that connects initially and is never disrupted (i.e. it only advances during reconnect loops, not the first connect)? And similarly, does it advance only on successful reconnects, or also on failed attempts such as ECONNRESET retries?
How can I resolve this? If you propose a fix, please make it concise.## Changes from review feedback ### P1 — Bot identity cleared on supervisor restart (critical bug fix) `botOpenIds`/`botNames` were previously cleared in the per-cycle `finally` block, which runs on every stall-driven restart. After the first reconnect cycle, `checkBotMentioned()` would return false (missing `botOpenId`) and all group accounts requiring a bot mention would go silent. Fix: only clear `wsClients` in the per-cycle `finally`. `botOpenIds`/ `botNames` are now only cleared in the outer `finally` (the shutdown path). ### P2.1 — Remove dead no-op try/catch around createFeishuWSClient The try/catch only re-threw without any special handling. Removed; any synchronous throw from `createFeishuWSClient` still propagates correctly through the outer try/finally. ### P2.2 — Merge duplicate JSDoc blocks for supervisor loop Two consecutive JSDoc comment blocks describing the same supervisor loop and `supervisorAttempt` have been merged into one. ### P2.3 — Document and fix lastConnectTime SDK semantics Confirmed by reading SDK source (@larksuiteoapi/node-sdk/lib/index.js WSClient.reConnect): `lastConnectTime` is set inside `tryConnect()` which is called on EVERY connection attempt, including the very first `start()` call — not only on reconnects. The original code started the stall clock on the first `lastConnectTime` observation, which would falsely evict healthy connections after 90 s. Fix: the stall clock now only starts after the SECOND `lastConnectTime` change (i.e. the first genuine reconnect attempt after a WebSocket drop). Healthy connections that never need to reconnect are never stalled. The constant comment and `runFeishuWSClientUntilDead` JSDoc have been updated to document these semantics with a source reference. ### Rebase Rebased onto upstream/main; resolved conflict with the `safeEqualSecret` import refactor (both imports are now present). ### Tests Added regression test for P1: verifies `botOpenIds`/`botNames` survive a supervisor cycle stall+restart and are only cleared on final shutdown.
bce01ac to
aea91c8
Compare
|
Thanks for the detailed review. All three items have been addressed in the latest push (aea91c8): P1 — Bot identity cleared on supervisor restart (now fixed) Confirmed the bug: Fix: the per-cycle
Read the SDK source at The original code started the stall clock on the first non-zero Fix: introduced a The backoff reset via Dead no-op try/catch — removed The Duplicate JSDoc blocks — merged The two back-to-back JSDoc blocks describing the supervisor loop and Rebase The branch has been rebased onto current upstream/main. The only conflict was the import line: upstream introduced CI failures
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aea91c8f73
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| onFirstConnect: () => { | ||
| hadSuccessfulConnect = true; | ||
| }, |
There was a problem hiding this comment.
Base backoff on confirmed connection, not first attempt
onFirstConnect is documented as firing when the SDK records its first connect attempt, but this callback sets hadSuccessfulConnect = true. In startup outage scenarios (or any cycle where handshake never succeeds), that still resets supervisorAttempt to 0, so computeBackoff keeps using the minimum delay and the supervisor never escalates retry spacing. This defeats the intended exponential backoff for boot-time failures and can create a persistent 5s reconnect storm.
Useful? React with 👍 / 👎.
`onFirstConnect` fired on the first connect *attempt*, not after the WebSocket handshake was confirmed live. In a startup-outage scenario this caused `hadSuccessfulConnect` to be set prematurely, which reset `supervisorAttempt` to 0 on every stall cycle and killed exponential backoff. Fix: replace `onFirstConnect` with `onConfirmedConnect`. The poll loop now accumulates a `connectedAndStableCount` each tick where `lastConnectTime` is non-zero, unchanged, and the stall clock has not started. Only after `FEISHU_WS_STABLE_TICKS_REQUIRED` (2) consecutive stable ticks (~20 s of confirmed liveness) does it fire `onConfirmedConnect` and set `hadSuccessfulConnect = true`. The existing `connectChangeCount`/stall clock logic is unchanged. Add two regression tests to `monitor.cleanup.test.ts`: - startup-outage: reconnect fires before stable ticks accumulate → `supervisorAttempt >= 1` (backoff applies) - normal recovery: stable ticks accumulate before any reconnect → `supervisorAttempt == 0` (fast recovery) Addresses Codex review PRR_kwDOQb6kR87xtZOm.
|
Re: Codex review PRR_kwDOQb6kR87xtZOm (P2 — base backoff on confirmed connection, not first attempt) Fixed in ba25f3d. Replaced In the startup-outage scenario you described: the first attempt fires and The
Re: CI failure The failing test |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ba25f3d0d0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (!isFirstChange) { | ||
| // Genuine reconnect attempt: start (or reset) the stall clock. | ||
| staleSinceMs = Date.now(); | ||
| } |
There was a problem hiding this comment.
Reset stall clock after reconnect recovery
Starting staleSinceMs on every post-initial lastConnectTime change without ever clearing it means any successful reconnect is treated as a pending stall forever. After one transient reconnect, lastConnectTime normally stays unchanged during healthy operation, so idleMs will cross 90s and force a supervisor restart even though the socket has recovered; this can cause repeated teardown/reconnect loops on healthy sessions after the first network blip.
Useful? React with 👍 / 👎.
…te check The stall detector starts a 90 s clock whenever `lastConnectTime` advances after the initial connect (indicating the SDK entered a reconnect loop). But `lastConnectTime` is set on every `tryConnect()` call regardless of success — so a successful reconnect also sets the clock, causing a false supervisor restart 90 s later even on a fully recovered connection. The core problem: `lastConnectTime` alone cannot distinguish "SDK reconnected successfully" from "SDK exhausted its retry budget." The only reliable signal is the underlying WebSocket's `readyState`. Fix: after `FEISHU_WS_STABLE_TICKS_REQUIRED` consecutive stable poll ticks while a stall clock is running, check `wsClient.wsConfig?.getWSInstance()?. readyState`. If it is 1 (OPEN) the reconnect succeeded — clear `staleSinceMs` so the healthy session is not falsely evicted. If the WebSocket is CLOSED / CONNECTING / unavailable, the clock continues and the stall fires normally. The private `wsConfig` accessor is wrapped in try/catch so any future SDK refactor degrades gracefully back to the existing behaviour. Tests added: - "clears stall clock after successful reconnect" — readyState OPEN, no false stall after 90 s of healthy operation - "stall fires when reconnect socket is not OPEN" — readyState CLOSED, stall detected and second supervisor cycle started Addresses Codex P1 in openclaw#57978 review 4055460051. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Addressing Codex P1 (review #4055460051)Root cause of the false-stall issue: Fix ( Tradeoffs documented:
Tests added:
All 8 cleanup tests pass. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b05a8320f8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (!confirmedConnectFired && lastConnectTime !== 0 && staleSinceMs === null) { | ||
| connectedAndStableCount += 1; | ||
| if (connectedAndStableCount >= FEISHU_WS_STABLE_TICKS_REQUIRED) { | ||
| confirmedConnectFired = true; | ||
| onConfirmedConnect?.(); |
There was a problem hiding this comment.
Gate confirmed-connect on actual socket open
This marks a cycle as “successfully connected” after two unchanged lastConnectTime polls, but that can happen even when the handshake never succeeds (e.g., startup outage where the SDK is waiting for its next retry). In that case hadSuccessfulConnect becomes true and the supervisor resets supervisorAttempt to 0, so boot-time failures keep retrying at minimum backoff instead of escalating. Fresh evidence: the upstream WS client defaults include a reconnect nonce delay up to 30s, which is longer than this 20s confirmation window, so this false-positive path is reachable without any successful connection.
Useful? React with 👍 / 👎.
…ay fix) The SDK's reconnect nonce delay can be up to 30 s. During this window `lastConnectTime` is already non-zero (set by the failed initial attempt) and stable, so FEISHU_WS_STABLE_TICKS_REQUIRED (2 ticks ≈ 20 s) is reached before the nonce expires. This triggered `onConfirmedConnect` falsely, setting `hadSuccessfulConnect = true` even though no connection was established — causing boot-time failures to reset `supervisorAttempt` to 0 and preventing exponential backoff escalation. Fix: gate `onConfirmedConnect` on the same `getWsReadyState()` check added in the previous commit. If `readyState === 1` (OPEN) or is not observable (null, for tests/future SDK changes — backward-compatible fallback), fire the callback. If readyState is known but not OPEN (e.g. 0=CONNECTING during nonce wait), defer confirmation until the socket is actually open. `getWsReadyState()` is extracted as a shared helper used by both the confirmed-connect path and the post-reconnect stall-clock clear path. Regression test added: "does not confirm connection during nonce delay" — verifies that readyState CONNECTING blocks onConfirmedConnect and allows supervisorAttempt to increment on subsequent stall. Addresses Codex P2 in openclaw#57978 (2026-04-06 review comment on line 267 of monitor.transport.ts). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Addressing Codex P2 (2026-04-06, line 267)Issue: Fix (
Regression test added: "does not confirm connection during nonce delay" — All 9 cleanup tests pass. |
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Close as superseded by the maintainer-labeled Feishu recovery PR #73739. Current main still lacks post-start SDK retry-exhaustion recovery, but the same #52618 work is now tracked in a current-main replacement path, while this branch is dirty/stale and adds raw supervisor error logging that should not be carried forward. So I’m closing this here and keeping the remaining discussion on the canonical linked item. Review detailsBest possible solution: Use #52618 and #73739 as the canonical path: land a current-main Feishu WebSocket supervisor that preserves heartbeat/proxy/async client behavior, uses public SDK lifecycle callbacks where possible, keeps redacted logging, and closes older duplicate PRs after the replacement is resolved. Do we have a high-confidence way to reproduce the issue? Yes. The PR body and #52618 provide concrete reproduction steps, and current main source confirms the failure mode: after Is this the best way to solve the issue? No. The desired product direction is right, but this branch is no longer the best merge path because #73739 tracks the same fix against the newer Feishu transport and preserves current redaction/lifecycle behavior. Security review: Security review needs attention: The diff adds raw error logging in Feishu WebSocket supervisor paths, which can expose credentials or bearer tokens in logs.
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 52e2d4e16a29. |
Bug
Feishu WebSocket stops responding after a network disruption and never self-heals. Reproducible log pattern from a real production incident:
Gateway was dead for 6+ hours after the SDK exhausted its retry budget.
Root Cause
WSClient.start()is fire-and-forget: it kicks off an async reconnect loop inside the SDK and returns immediately (confirmed by reading the SDK source). When the server-configuredreconnectCountretries are exhausted the SDK setsisConnecting = falseand stops silently — no error thrown, no event emitted, no promise rejected.The previous
monitorWebSocket()parked onabortSignalafter callingstart(). Sincestart()returns instantly, the function could never observe the silent death. The only observable signal is thatgetReconnectInfo().lastConnectTimestops advancing.Channel comparison:
whileloopFix
Replaces the park-on-abort pattern with a supervisor loop + stall detector in
monitorWebSocket():runFeishuWSClientUntilDead()— wraps a single WSClient lifetime. PollsgetReconnectInfo().lastConnectTimeevery 10 s. Stall clock starts only after the first non-zerolastConnectTime(so a healthy connection that never needs to reconnect never trips a false stall). After 90 s of no new connect activity → resolves"stalled".Supervisor
whileloop — on each stall or error, closes the dead client ({ force: true }), computes backoff, and creates a fresh WSClient. Uses the existingcomputeBackoff/sleepWithAbortfromopenclaw/plugin-sdk/infra-runtime(same helpers as Slack/Telegram).Backoff counter semantics —
supervisorAttemptincrements on boot-time failures (SDK died before a confirmed connect). It resets to 0 if the SDK made at least one successful connect in the cycle, so a runtime network disruption after a healthy session retries quickly (5 s) rather than accumulating against the boot-time backoff schedule.Non-recoverable errors — pattern-matched with narrow word-boundary regexes (not broad substring
"app_id") to avoid permanently stopping the supervisor on recoverable errors that happen to mention field names.Cleanup safety — outer
try/finallyguaranteeswsClients/botOpenIds/botNamesare cleared even if an unexpected synchronous throw bypasses the per-cyclefinally.Issues addressed from reviews of prior PRs (#52623, #52666)
supervisorAttempts = 0reset fires before SDK can stallonFirstConnectcallback, not to fire-and-forgetstart()return"app_id"substring match stops supervisor on recoverable errors/invalid app_id\b/iand three other exact-phrase patternscleanup()skip if unexpected rejection bypasses per-cyclefinallytry/finallycovers all exit pathsclose()(larksuite/node-sdk#177 / #40451)Test plan
WebSocket stall detected, recreates clientappIdfast-fails without entering the retry loopFixes #52618
🤖 Generated with Claude Code