Skip to content

fix(feishu): add application-level WebSocket reconnection with backoff#68865

Closed
tianhaocui wants to merge 4 commits intoopenclaw:mainfrom
tianhaocui:fix-feishu-ws-reconnect
Closed

fix(feishu): add application-level WebSocket reconnection with backoff#68865
tianhaocui wants to merge 4 commits intoopenclaw:mainfrom
tianhaocui:fix-feishu-ws-reconnect

Conversation

@tianhaocui
Copy link
Copy Markdown
Contributor

Fixes #68766

Summary

The Feishu WebSocket transport relied solely on the Lark SDK's built-in autoReconnect, which silently gives up after exhausting its internal retry budget. When this happens the bot goes permanently offline with no recovery path.

Root Cause

  • The Lark SDK's internal reconnection has a limited retry budget
  • cleanup() calls wsClient.close() which permanently kills the SDK's reconnection loop
  • No application-level recovery exists — once the SDK gives up, the bot stays offline

Fix

Wrap the WebSocket lifecycle in an application-level reconnection loop with exponential backoff (2s initial, 60s max), following the same pattern used by the Mattermost channel (runWithReconnect).

The loop:

  • Retries on both client creation failures (e.g. token refresh timeout) and runtime disconnects
  • Resets backoff on successful connections (normal close)
  • Respects the abort signal for clean shutdown
  • Cleans up wsClient/botOpenIds/botNames state on each cycle

Test Plan

  • Start Feishu WebSocket transport
  • Simulate token refresh failure (e.g. network interruption during refresh)
  • Verify bot reconnects with exponential backoff instead of going permanently offline
  • Verify abort signal still cleanly stops the transport

The Feishu WebSocket transport relied solely on the Lark SDK's built-in
autoReconnect, which silently gives up after exhausting its internal
retry budget. When this happens the bot goes permanently offline with
no recovery path.

Wrap the WebSocket lifecycle in an application-level reconnection loop
with exponential backoff (2s initial, 60s max), matching the pattern
used by the Mattermost channel. The loop:
- Retries on both client creation failures and runtime disconnects
- Resets backoff on successful connections (normal close)
- Respects the abort signal for clean shutdown
- Cleans up wsClient/botOpenIds/botNames state on each cycle

Fixes openclaw#68766
@openclaw-barnacle openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: S labels Apr 19, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20f7900759

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +140 to +141
void wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Settle the reconnect loop when WS client stops

The new outer while loop never gets a chance to reconnect after runtime disconnects because the inner promise only resolves via handleAbort, and wsClient.start is fire-and-forget. In the failure mode this patch is targeting (SDK auto-reconnect budget exhausted), no abort is triggered, so this await never completes and the code at reconnecting in ... is never reached; the transport still stays permanently offline instead of recreating the client.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 19, 2026

Greptile Summary

This PR wraps the Feishu WebSocket transport in an application-level reconnection loop with exponential backoff (2s–60s) to recover when the Lark SDK exhausts its internal retry budget. The overall structure is sound — abort signal handling, state cleanup in finally, and backoff reset on clean close all work correctly.

  • P1: void wsClient.start({ eventDispatcher }) discards the promise returned by start(). When the Lark SDK gives up reconnecting and settles that promise, the inner await new Promise<void> never learns about it and hangs indefinitely — meaning the app-level reconnection loop cannot iterate for runtime WS disconnects, which is the primary scenario this PR addresses. Changing to wsClient.start({ eventDispatcher }).then(resolve, reject) closes this gap.

Confidence Score: 4/5

Hold for the P1 fix — the runtime-disconnect reconnection path doesn't fire due to void discarding the start() promise.

One P1 finding: void wsClient.start() means the inner Promise only ever resolves on abort, so the while-loop reconnection never triggers for the exact scenario (SDK retry exhaustion) this PR aims to fix. The token-refresh-failure path (createFeishuWSClient throws) does work correctly. All other aspects — abort handling, backoff math, state cleanup, sleepAbortable — are correct.

extensions/feishu/src/monitor.transport.ts — specifically line 140 where void wsClient.start() needs to become wsClient.start({ eventDispatcher }).then(resolve, reject).

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 140

Comment:
**`void` discards the disconnect signal — runtime reconnection loop never iterates**

The inner `Promise<void>` can only resolve via `handleAbort()` (abort signal) or reject from a synchronous throw of `wsClient.start()`. When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), `start()` eventually resolves or rejects asynchronously. Because `void` discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The `while` loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.

An additional concern: if `start()` rejects asynchronously, `void` silently swallows it, producing an unhandled promise rejection.

The fix is to feed `start()`'s settlement into the inner promise:

```suggestion
          wsClient.start({ eventDispatcher }).then(resolve, reject);
```

With this change, when the SDK gives up and resolves `start()`, `resolve()` is called, the `finally` cleanup runs, and the `while` loop retries with backoff as intended.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 122-129

Comment:
**Redundant cleanup in `handleAbort``finally` already covers it**

`handleAbort` calls `wsClient.close()`, deletes from all three maps, then `resolve()`s. The `finally` block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and `close()` is guarded by try/catch, so there's no correctness issue — but the three deletions inside `handleAbort` are unnecessary since `finally` always runs after the `await`.

Simplifying `handleAbort` to just `wsClient.close(); resolve();` (or even just `resolve()` if `close()` moves solely to `finally`) reduces the duplication.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(feishu): add application-level WebSo..." | Re-trigger Greptile

abortSignal?.addEventListener("abort", handleAbort, { once: true });

try {
void wsClient.start({ eventDispatcher });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 void discards the disconnect signal — runtime reconnection loop never iterates

The inner Promise<void> can only resolve via handleAbort() (abort signal) or reject from a synchronous throw of wsClient.start(). When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), start() eventually resolves or rejects asynchronously. Because void discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The while loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.

An additional concern: if start() rejects asynchronously, void silently swallows it, producing an unhandled promise rejection.

The fix is to feed start()'s settlement into the inner promise:

Suggested change
void wsClient.start({ eventDispatcher });
wsClient.start({ eventDispatcher }).then(resolve, reject);

With this change, when the SDK gives up and resolves start(), resolve() is called, the finally cleanup runs, and the while loop retries with backoff as intended.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 140

Comment:
**`void` discards the disconnect signal — runtime reconnection loop never iterates**

The inner `Promise<void>` can only resolve via `handleAbort()` (abort signal) or reject from a synchronous throw of `wsClient.start()`. When the Lark SDK exhausts its internal retry budget (logging "unable to connect after N times"), `start()` eventually resolves or rejects asynchronously. Because `void` discards that promise, neither outcome reaches the inner promise — it stays pending indefinitely. The `while` loop is stuck, and the app-level reconnection the PR is meant to add never triggers for runtime disconnects.

An additional concern: if `start()` rejects asynchronously, `void` silently swallows it, producing an unhandled promise rejection.

The fix is to feed `start()`'s settlement into the inner promise:

```suggestion
          wsClient.start({ eventDispatcher }).then(resolve, reject);
```

With this change, when the SDK gives up and resolves `start()`, `resolve()` is called, the `finally` cleanup runs, and the `while` loop retries with backoff as intended.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +122 to +129
function handleAbort() {
try {
wsClient.close();
} catch {}
wsClients.delete(accountId);
botOpenIds.delete(accountId);
botNames.delete(accountId);
resolve();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant cleanup in handleAbortfinally already covers it

handleAbort calls wsClient.close(), deletes from all three maps, then resolve()s. The finally block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and close() is guarded by try/catch, so there's no correctness issue — but the three deletions inside handleAbort are unnecessary since finally always runs after the await.

Simplifying handleAbort to just wsClient.close(); resolve(); (or even just resolve() if close() moves solely to finally) reduces the duplication.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 122-129

Comment:
**Redundant cleanup in `handleAbort``finally` already covers it**

`handleAbort` calls `wsClient.close()`, deletes from all three maps, then `resolve()`s. The `finally` block (lines 152–158) immediately re-runs the same close + delete sequence unconditionally. The deletions are idempotent and `close()` is guarded by try/catch, so there's no correctness issue — but the three deletions inside `handleAbort` are unnecessary since `finally` always runs after the `await`.

Simplifying `handleAbort` to just `wsClient.close(); resolve();` (or even just `resolve()` if `close()` moves solely to `finally`) reduces the duplication.

How can I resolve this? If you propose a fix, please make it concise.

The abort handler already calls wsClient.close() and cleans up state.
The finally block was calling close() again unconditionally, causing
the cleanup test to see 2 close() calls instead of 1.

Add a closedByAbort guard so the finally block skips cleanup when the
abort handler already handled it.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aca2a86303

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +142 to +143
void wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Settle reconnect loop on WS runtime termination

The outer reconnection loop cannot progress after runtime disconnects because monitorWebSocket blocks on an inner await new Promise(...) that only resolves in handleAbort, while wsClient.start is launched fire-and-forget. In the failure mode this change is trying to fix (SDK auto-reconnect budget exhausted without process abort), no abort event is emitted, so the promise never settles and the code that sleeps/recreates the client is never reached, leaving the transport permanently offline.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 27, 2026

Closing this as duplicate or superseded after Codex automated review.

This PR should close as superseded by #72411. Current main still has the Feishu one-shot WebSocket monitor, and this PR is not merge-ready because its retry loop still launches wsClient.start() fire-and-forget. The newer maintainer PR explicitly carries this work forward with credit, targets the same #68766 bug, and is the better canonical place to finish the Feishu reconnect repair.

Best possible solution:

Close this PR as superseded, keep the useful credit to @tianhaocui, and finish the Feishu WebSocket reconnect repair in #72411 or a maintainer-approved successor that handles the Lark SDK WSClient.start() contract and adds targeted regression coverage.

What I checked:

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against f0b327cf686e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Feishu WebSocket connection does not recover after transient token refresh failure

1 participant