fix(feishu): add application-level WebSocket reconnection with backoff#68865
fix(feishu): add application-level WebSocket reconnection with backoff#68865tianhaocui wants to merge 4 commits intoopenclaw:mainfrom
Conversation
The Feishu WebSocket transport relied solely on the Lark SDK's built-in autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path. Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), matching the pattern used by the Mattermost channel. The loop: - Retries on both client creation failures and runtime disconnects - Resets backoff on successful connections (normal close) - Respects the abort signal for clean shutdown - Cleans up wsClient/botOpenIds/botNames state on each cycle Fixes openclaw#68766
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20f7900759
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| void wsClient.start({ eventDispatcher }); | ||
| log(`feishu[${accountId}]: WebSocket client started`); |
There was a problem hiding this comment.
Settle the reconnect loop when WS client stops
The new outer while loop never gets a chance to reconnect after runtime disconnects because the inner promise only resolves via handleAbort, and wsClient.start is fire-and-forget. In the failure mode this patch is targeting (SDK auto-reconnect budget exhausted), no abort is triggered, so this await never completes and the code at reconnecting in ... is never reached; the transport still stays permanently offline instead of recreating the client.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR wraps the Feishu WebSocket transport in an application-level reconnection loop with exponential backoff (2s–60s) to recover when the Lark SDK exhausts its internal retry budget. The overall structure is sound — abort signal handling, state cleanup in
Confidence Score: 4/5Hold for the P1 fix — the runtime-disconnect reconnection path doesn't fire due to One P1 finding: extensions/feishu/src/monitor.transport.ts — specifically line 140 where Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 140
Comment:
**`void` discards the disconnect signal — runtime reconnection loop never iterates**
The inner `Promise<void>` can only resolve via `handleAbort()` (abort signal) or reject from a synchronous throw of `wsClient.start()`. When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), `start()` eventually resolves or rejects asynchronously. Because `void` discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The `while` loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.
An additional concern: if `start()` rejects asynchronously, `void` silently swallows it, producing an unhandled promise rejection.
The fix is to feed `start()`'s settlement into the inner promise:
```suggestion
wsClient.start({ eventDispatcher }).then(resolve, reject);
```
With this change, when the SDK gives up and resolves `start()`, `resolve()` is called, the `finally` cleanup runs, and the `while` loop retries with backoff as intended.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 122-129
Comment:
**Redundant cleanup in `handleAbort` — `finally` already covers it**
`handleAbort` calls `wsClient.close()`, deletes from all three maps, then `resolve()`s. The `finally` block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and `close()` is guarded by try/catch, so there's no correctness issue — but the three deletions inside `handleAbort` are unnecessary since `finally` always runs after the `await`.
Simplifying `handleAbort` to just `wsClient.close(); resolve();` (or even just `resolve()` if `close()` moves solely to `finally`) reduces the duplication.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(feishu): add application-level WebSo..." | Re-trigger Greptile |
| abortSignal?.addEventListener("abort", handleAbort, { once: true }); | ||
|
|
||
| try { | ||
| void wsClient.start({ eventDispatcher }); |
There was a problem hiding this comment.
void discards the disconnect signal — runtime reconnection loop never iterates
The inner Promise<void> can only resolve via handleAbort() (abort signal) or reject from a synchronous throw of wsClient.start(). When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), start() eventually resolves or rejects asynchronously. Because void discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The while loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.
An additional concern: if start() rejects asynchronously, void silently swallows it, producing an unhandled promise rejection.
The fix is to feed start()'s settlement into the inner promise:
| void wsClient.start({ eventDispatcher }); | |
| wsClient.start({ eventDispatcher }).then(resolve, reject); |
With this change, when the SDK gives up and resolves start(), resolve() is called, the finally cleanup runs, and the while loop retries with backoff as intended.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 140
Comment:
**`void` discards the disconnect signal — runtime reconnection loop never iterates**
The inner `Promise<void>` can only resolve via `handleAbort()` (abort signal) or reject from a synchronous throw of `wsClient.start()`. When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), `start()` eventually resolves or rejects asynchronously. Because `void` discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The `while` loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.
An additional concern: if `start()` rejects asynchronously, `void` silently swallows it, producing an unhandled promise rejection.
The fix is to feed `start()`'s settlement into the inner promise:
```suggestion
wsClient.start({ eventDispatcher }).then(resolve, reject);
```
With this change, when the SDK gives up and resolves `start()`, `resolve()` is called, the `finally` cleanup runs, and the `while` loop retries with backoff as intended.
How can I resolve this? If you propose a fix, please make it concise.| function handleAbort() { | ||
| try { | ||
| wsClient.close(); | ||
| } catch {} | ||
| wsClients.delete(accountId); | ||
| botOpenIds.delete(accountId); | ||
| botNames.delete(accountId); | ||
| resolve(); |
There was a problem hiding this comment.
Redundant cleanup in
handleAbort — finally already covers it
handleAbort calls wsClient.close(), deletes from all three maps, then resolve()s. The finally block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and close() is guarded by try/catch, so there's no correctness issue — but the three deletions inside handleAbort are unnecessary since finally always runs after the await.
Simplifying handleAbort to just wsClient.close(); resolve(); (or even just resolve() if close() moves solely to finally) reduces the duplication.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 122-129
Comment:
**Redundant cleanup in `handleAbort` — `finally` already covers it**
`handleAbort` calls `wsClient.close()`, deletes from all three maps, then `resolve()`s. The `finally` block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and `close()` is guarded by try/catch, so there's no correctness issue — but the three deletions inside `handleAbort` are unnecessary since `finally` always runs after the `await`.
Simplifying `handleAbort` to just `wsClient.close(); resolve();` (or even just `resolve()` if `close()` moves solely to `finally`) reduces the duplication.
How can I resolve this? If you propose a fix, please make it concise.The abort handler already calls wsClient.close() and cleans up state. The finally block was calling close() again unconditionally, causing the cleanup test to see 2 close() calls instead of 1. Add a closedByAbort guard so the finally block skips cleanup when the abort handler already handled it.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aca2a86303
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| void wsClient.start({ eventDispatcher }); | ||
| log(`feishu[${accountId}]: WebSocket client started`); |
There was a problem hiding this comment.
Settle reconnect loop on WS runtime termination
The outer reconnection loop cannot progress after runtime disconnects because monitorWebSocket blocks on an inner await new Promise(...) that only resolves in handleAbort, while wsClient.start is launched fire-and-forget. In the failure mode this change is trying to fix (SDK auto-reconnect budget exhausted without process abort), no abort event is emitted, so the promise never settles and the code that sleeps/recreates the client is never reached, leaving the transport permanently offline.
Useful? React with 👍 / 👎.
|
Closing this as duplicate or superseded after Codex automated review. This PR should close as superseded by #72411. Current main still has the Feishu one-shot WebSocket monitor, and this PR is not merge-ready because its retry loop still launches Best possible solution: Close this PR as superseded, keep the useful credit to @tianhaocui, and finish the Feishu WebSocket reconnect repair in #72411 or a maintainer-approved successor that handles the Lark SDK What I checked:
So I’m closing this here and keeping the remaining discussion on the canonical linked item. Codex Review notes: model gpt-5.5, reasoning high; reviewed against f0b327cf686e. |
Fixes #68766
Summary
The Feishu WebSocket transport relied solely on the Lark SDK's built-in
autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path.Root Cause
cleanup()callswsClient.close()which permanently kills the SDK's reconnection loopFix
Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), following the same pattern used by the Mattermost channel (
runWithReconnect).The loop:
wsClient/botOpenIds/botNamesstate on each cycleTest Plan