fix(feishu): add application-level WebSocket reconnection with backoff by tianhaocui · Pull Request #68865 · openclaw/openclaw

tianhaocui · 2026-04-19T07:34:10Z

Summary

The Feishu WebSocket transport relied solely on the Lark SDK's built-in autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path.

Root Cause

The Lark SDK's internal reconnection has a limited retry budget
cleanup() calls wsClient.close() which permanently kills the SDK's reconnection loop
No application-level recovery exists — once the SDK gives up, the bot stays offline

Fix

Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), following the same pattern used by the Mattermost channel (runWithReconnect).

The loop:

Retries on both client creation failures (e.g. token refresh timeout) and runtime disconnects
Resets backoff on successful connections (normal close)
Respects the abort signal for clean shutdown
Cleans up wsClient/botOpenIds/botNames state on each cycle

Test Plan

Start Feishu WebSocket transport
Simulate token refresh failure (e.g. network interruption during refresh)
Verify bot reconnects with exponential backoff instead of going permanently offline
Verify abort signal still cleanly stops the transport

The Feishu WebSocket transport relied solely on the Lark SDK's built-in autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path. Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), matching the pattern used by the Mattermost channel. The loop: - Retries on both client creation failures and runtime disconnects - Resets backoff on successful connections (normal close) - Respects the abort signal for clean shutdown - Cleans up wsClient/botOpenIds/botNames state on each cycle Fixes openclaw#68766

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20f7900759

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-19T07:37:30Z

+          void wsClient.start({ eventDispatcher });
+          log(`feishu[${accountId}]: WebSocket client started`);


Settle the reconnect loop when WS client stops

The new outer while loop never gets a chance to reconnect after runtime disconnects because the inner promise only resolves via handleAbort, and wsClient.start is fire-and-forget. In the failure mode this patch is targeting (SDK auto-reconnect budget exhausted), no abort is triggered, so this await never completes and the code at reconnecting in ... is never reached; the transport still stays permanently offline instead of recreating the client.

Useful? React with 👍 / 👎.

greptile-apps · 2026-04-19T07:38:20Z

Greptile Summary

This PR wraps the Feishu WebSocket transport in an application-level reconnection loop with exponential backoff (2s–60s) to recover when the Lark SDK exhausts its internal retry budget. The overall structure is sound — abort signal handling, state cleanup in finally, and backoff reset on clean close all work correctly.

P1: void wsClient.start({ eventDispatcher }) discards the promise returned by start(). When the Lark SDK gives up reconnecting and settles that promise, the inner await new Promise<void> never learns about it and hangs indefinitely — meaning the app-level reconnection loop cannot iterate for runtime WS disconnects, which is the primary scenario this PR addresses. Changing to wsClient.start({ eventDispatcher }).then(resolve, reject) closes this gap.

Confidence Score: 4/5

Hold for the P1 fix — the runtime-disconnect reconnection path doesn't fire due to void discarding the start() promise.

One P1 finding: void wsClient.start() means the inner Promise only ever resolves on abort, so the while-loop reconnection never triggers for the exact scenario (SDK retry exhaustion) this PR aims to fix. The token-refresh-failure path (createFeishuWSClient throws) does work correctly. All other aspects — abort handling, backoff math, state cleanup, sleepAbortable — are correct.

extensions/feishu/src/monitor.transport.ts — specifically line 140 where void wsClient.start() needs to become wsClient.start({ eventDispatcher }).then(resolve, reject).

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 140

Comment:
**`void` discards the disconnect signal — runtime reconnection loop never iterates**

The inner `Promise<void>` can only resolve via `handleAbort()` (abort signal) or reject from a synchronous throw of `wsClient.start()`. When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), `start()` eventually resolves or rejects asynchronously. Because `void` discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The `while` loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.

An additional concern: if `start()` rejects asynchronously, `void` silently swallows it, producing an unhandled promise rejection.

The fix is to feed `start()`'s settlement into the inner promise:

```suggestion
          wsClient.start({ eventDispatcher }).then(resolve, reject);
```

With this change, when the SDK gives up and resolves `start()`, `resolve()` is called, the `finally` cleanup runs, and the `while` loop retries with backoff as intended.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 122-129

Comment:
**Redundant cleanup in `handleAbort` — `finally` already covers it**

`handleAbort` calls `wsClient.close()`, deletes from all three maps, then `resolve()`s. The `finally` block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and `close()` is guarded by try/catch, so there's no correctness issue — but the three deletions inside `handleAbort` are unnecessary since `finally` always runs after the `await`.

Simplifying `handleAbort` to just `wsClient.close(); resolve();` (or even just `resolve()` if `close()` moves solely to `finally`) reduces the duplication.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(feishu): add application-level WebSo..." | Re-trigger Greptile}

greptile-apps · 2026-04-19T07:38:24Z

+        abortSignal?.addEventListener("abort", handleAbort, { once: true });
+
+        try {
+          void wsClient.start({ eventDispatcher });


void discards the disconnect signal — runtime reconnection loop never iterates

The inner Promise<void> can only resolve via handleAbort() (abort signal) or reject from a synchronous throw of wsClient.start(). When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), start() eventually resolves or rejects asynchronously. Because void discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The while loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.

An additional concern: if start() rejects asynchronously, void silently swallows it, producing an unhandled promise rejection.

The fix is to feed start()'s settlement into the inner promise:

Suggested change

void wsClient.start({ eventDispatcher });

wsClient.start({ eventDispatcher }).then(resolve, reject);

With this change, when the SDK gives up and resolves start(), resolve() is called, the finally cleanup runs, and the while loop retries with backoff as intended.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 140 Comment: **`void` discards the disconnect signal — runtime reconnection loop never iterates** The inner `Promise<void>` can only resolve via `handleAbort()` (abort signal) or reject from a synchronous throw of `wsClient.start()`. When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), `start()` eventually resolves or rejects asynchronously. Because `void` discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The `while` loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects. An additional concern: if `start()` rejects asynchronously, `void` silently swallows it, producing an unhandled promise rejection. The fix is to feed `start()`'s settlement into the inner promise: ```suggestion wsClient.start({ eventDispatcher }).then(resolve, reject); ``` With this change, when the SDK gives up and resolves `start()`, `resolve()` is called, the `finally` cleanup runs, and the `while` loop retries with backoff as intended. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-19T07:38:25Z

+        function handleAbort() {
+          try {
+            wsClient.close();
+          } catch {}
+          wsClients.delete(accountId);
+          botOpenIds.delete(accountId);
+          botNames.delete(accountId);
+          resolve();


Redundant cleanup in handleAbort — finally already covers it

handleAbort calls wsClient.close(), deletes from all three maps, then resolve()s. The finally block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and close() is guarded by try/catch, so there's no correctness issue — but the three deletions inside handleAbort are unnecessary since finally always runs after the await.

Simplifying handleAbort to just wsClient.close(); resolve(); (or even just resolve() if close() moves solely to finally) reduces the duplication.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 122-129 Comment: **Redundant cleanup in `handleAbort` — `finally` already covers it** `handleAbort` calls `wsClient.close()`, deletes from all three maps, then `resolve()`s. The `finally` block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and `close()` is guarded by try/catch, so there's no correctness issue — but the three deletions inside `handleAbort` are unnecessary since `finally` always runs after the `await`. Simplifying `handleAbort` to just `wsClient.close(); resolve();` (or even just `resolve()` if `close()` moves solely to `finally`) reduces the duplication. How can I resolve this? If you propose a fix, please make it concise.

The abort handler already calls wsClient.close() and cleans up state. The finally block was calling close() again unconditionally, causing the cleanup test to see 2 close() calls instead of 1. Add a closedByAbort guard so the finally block skips cleanup when the abort handler already handled it.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aca2a86303

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-19T07:49:59Z

+          void wsClient.start({ eventDispatcher });
+          log(`feishu[${accountId}]: WebSocket client started`);


Settle reconnect loop on WS runtime termination

The outer reconnection loop cannot progress after runtime disconnects because monitorWebSocket blocks on an inner await new Promise(...) that only resolves in handleAbort, while wsClient.start is launched fire-and-forget. In the failure mode this change is trying to fix (SDK auto-reconnect budget exhausted without process abort), no abort event is emitted, so the promise never settles and the code that sleeps/recreates the client is never reached, leaving the transport permanently offline.

Useful? React with 👍 / 👎.

clawsweeper · 2026-04-27T10:49:58Z

Closing this as duplicate or superseded after Codex automated review.

This PR should close as superseded by #72411. Current main still has the Feishu one-shot WebSocket monitor, and this PR is not merge-ready because its retry loop still launches wsClient.start() fire-and-forget. The newer maintainer PR explicitly carries this work forward with credit, targets the same #68766 bug, and is the better canonical place to finish the Feishu reconnect repair.

Best possible solution:

Close this PR as superseded, keep the useful credit to @tianhaocui, and finish the Feishu WebSocket reconnect repair in #72411 or a maintainer-approved successor that handles the Lark SDK WSClient.start() contract and adds targeted regression coverage.

What I checked:

Current main still has the original one-shot Feishu WebSocket path: At current main f0b327c, monitorWebSocket creates one WS client, stores it, then calls void wsClient.start({ eventDispatcher }); the returned promise only resolves on abort/cleanup, so the underlying Feishu reconnect problem is not already implemented on main. (extensions/feishu/src/monitor.transport.ts:96, f0b327cf686e)
This PR has an unresolved runtime reconnect defect: The PR patch at head 65c0393 adds an outer retry loop but still calls void wsClient.start({ eventDispatcher }) inside the inner promise. Multiple review comments identify this as a P1 because the app-level loop cannot observe SDK runtime termination and therefore cannot iterate for the main failure mode. (extensions/feishu/src/monitor.transport.ts:134, 65c03936c249)
Maintainer replacement PR tracks the same remaining work: fix(feishu): repair WebSocket reconnect and heartbeat config #72411 is open, maintainer-authored, labeled maintainer, and its body says it repairs the Feishu WebSocket reconnect path using the narrow app-layer approach from this PR, addresses the unresolved wsClient.start({ eventDispatcher }) review finding, fixes [Bug]: Feishu WebSocket connection does not recover after transient token refresh failure #68766, and credits @tianhaocui for the app-layer reconnect loop implementation. (7dfabab60859)
Security review pass: This PR changes only extensions/feishu/src/monitor.transport.ts. It does not touch CI workflows, GitHub Action refs, dependency sources, lockfiles, install/build/release scripts, package publishing metadata, downloaded artifacts, generated/vendor/minified files, permissions, or other supply-chain/code-execution surfaces. (extensions/feishu/src/monitor.transport.ts:91, 65c03936c249)

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against f0b327cf686e.

openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: S labels Apr 19, 2026

chatgpt-codex-connector Bot reviewed Apr 19, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 19, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 19, 2026

View reviewed changes

tianhaocui added 2 commits April 20, 2026 14:18

ci: retrigger checks

f83f2d8

ci: retrigger checks

65c0393

clawsweeper Bot mentioned this pull request Apr 26, 2026

Feishu WebSocket: No exponential backoff on reconnect, causes rate limit amplification #55531

Closed

vincentkoc mentioned this pull request Apr 26, 2026

fix(feishu): repair WebSocket reconnect and heartbeat config #72411

Merged

clawsweeper Bot closed this Apr 27, 2026

clawsweeper Bot mentioned this pull request Apr 27, 2026

[Bug]: Feishu WebSocket connection does not recover after transient token refresh failure #68766

Closed

This was referenced Apr 29, 2026

fix(feishu): supervise WebSocket retry exhaustion #73940

Closed

fix(feishu): reconcile WebSocket reconnect backoff #73998

Open

This was referenced Apr 29, 2026

fix(feishu): reconcile WebSocket reconnect backoff #73945

Closed

[Bug]: [Feishu/Lark] WebSocket long connection fails - code: 1000040351, system busy #42354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(feishu): add application-level WebSocket reconnection with backoff#68865

fix(feishu): add application-level WebSocket reconnection with backoff#68865
tianhaocui wants to merge 4 commits intoopenclaw:mainfrom
tianhaocui:fix-feishu-ws-reconnect

tianhaocui commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 19, 2026

Uh oh!

greptile-apps Bot commented Apr 19, 2026

Uh oh!

greptile-apps Bot Apr 19, 2026

Uh oh!

greptile-apps Bot Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 19, 2026

Uh oh!

clawsweeper Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		void wsClient.start({ eventDispatcher });
		log(`feishu[${accountId}]: WebSocket client started`);

	void wsClient.start({ eventDispatcher });
	wsClient.start({ eventDispatcher }).then(resolve, reject);

Uh oh!

Conversation

tianhaocui commented Apr 19, 2026

Summary

Root Cause

Fix

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 19, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant