fix(feishu): exponential backoff + PingInterval guard for WS reconnect#55619
fix(feishu): exponential backoff + PingInterval guard for WS reconnect#55619sirfengyu wants to merge 8 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR monkey-patches Confidence Score: 5/5Safe to merge — both bug fixes are well-scoped, the error filter is appropriately narrow, and the backoff math is correct. Both fixes address real production crashes/loops with targeted, low-risk prototype patches. The PingInterval guard re-throws everything else, and the backoff resets properly on fresh starts. The only actionable gap (no jitter) is a best-practice improvement that does not affect correctness. Prior P1 concerns from an earlier review were already addressed in this revision. No files require special attention.
|
| Filename | Overview |
|---|---|
| extensions/feishu/src/client.ts | Adds a module-load-time monkey-patch to Lark.WSClient.prototype that (1) wraps handleControlData in a try/catch to silently swallow the PingInterval crash on system_busy responses, and (2) replaces reConnect with a version that applies exponential backoff (2→4→8→15 min) with counter reset on isStart=true. Logic is sound; only suggestion is adding jitter to avoid synchronized retries across multiple clients. |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/feishu/src/client.ts
Line: 85-90
Comment:
**No jitter in exponential backoff**
The backoff is deterministic — every reconnecting `WSClient` instance waits the exact same duration at each attempt. During a platform-wide Feishu rate-limit event, all connected bots (or multiple accounts) will reach their retry deadline simultaneously and fire reconnect requests in lock-step, recreating the burst you're trying to avoid.
Adding a small random jitter (e.g. ±20 %) spreads retries across time and is standard practice for distributed backoff:
```suggestion
const jitter = 1 + (Math.random() - 0.5) * 0.4; // ±20%
const backoff = Math.min(reconnectInterval * Math.pow(2, count - 1) * jitter, MAX_RECONNECT_BACKOFF_MS);
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(feishu): narrow PingInterval error f..." | Re-trigger Greptile
| const backoff = Math.min(reconnectInterval * Math.pow(2, count - 1), MAX_RECONNECT_BACKOFF_MS); | ||
| this.logger?.info?.( | ||
| "[feishu-ws-patch]", | ||
| `reconnect backoff: ${Math.round(backoff / 1000)}s (attempt ${count})`, | ||
| ); | ||
| await new Promise<void>((resolve) => setTimeout(resolve, backoff)); |
There was a problem hiding this comment.
No jitter in exponential backoff
The backoff is deterministic — every reconnecting WSClient instance waits the exact same duration at each attempt. During a platform-wide Feishu rate-limit event, all connected bots (or multiple accounts) will reach their retry deadline simultaneously and fire reconnect requests in lock-step, recreating the burst you're trying to avoid.
Adding a small random jitter (e.g. ±20 %) spreads retries across time and is standard practice for distributed backoff:
| const backoff = Math.min(reconnectInterval * Math.pow(2, count - 1), MAX_RECONNECT_BACKOFF_MS); | |
| this.logger?.info?.( | |
| "[feishu-ws-patch]", | |
| `reconnect backoff: ${Math.round(backoff / 1000)}s (attempt ${count})`, | |
| ); | |
| await new Promise<void>((resolve) => setTimeout(resolve, backoff)); | |
| const jitter = 1 + (Math.random() - 0.5) * 0.4; // ±20% | |
| const backoff = Math.min(reconnectInterval * Math.pow(2, count - 1) * jitter, MAX_RECONNECT_BACKOFF_MS); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/client.ts
Line: 85-90
Comment:
**No jitter in exponential backoff**
The backoff is deterministic — every reconnecting `WSClient` instance waits the exact same duration at each attempt. During a platform-wide Feishu rate-limit event, all connected bots (or multiple accounts) will reach their retry deadline simultaneously and fire reconnect requests in lock-step, recreating the burst you're trying to avoid.
Adding a small random jitter (e.g. ±20 %) spreads retries across time and is standard practice for distributed backoff:
```suggestion
const jitter = 1 + (Math.random() - 0.5) * 0.4; // ±20%
const backoff = Math.min(reconnectInterval * Math.pow(2, count - 1) * jitter, MAX_RECONNECT_BACKOFF_MS);
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64c2d087a1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (isStart) { | ||
| // Reset backoff counter on a fresh start | ||
| (this as Record<string, unknown>)._feishuReconnectCount = 0; | ||
| } |
There was a problem hiding this comment.
Reset backoff after any successful reconnect
The reconnect counter is only reset in the isStart branch, but runtime reconnects are triggered via reConnect(false) from the SDK close handler. After one outage recovers, _feishuReconnectCount stays incremented, so later unrelated disconnects start at 4/8/15 minute delays instead of 2 minutes and can remain capped at 15 minutes for the rest of the process lifetime.
Useful? React with 👍 / 👎.
| "[feishu-ws-patch]", | ||
| `reconnect backoff: ${Math.round(backoff / 1000)}s (attempt ${count})`, | ||
| ); | ||
| await new Promise<void>((resolve) => setTimeout(resolve, backoff)); |
There was a problem hiding this comment.
Cancel delayed reconnect sleep when client closes
The new pre-reconnect backoff uses a standalone setTimeout promise that is not tracked by WSClient, so calling client.close() during this wait cannot cancel it. In that case the delayed callback still invokes origReConnect, which can re-open a connection after the monitor has been stopped/disabled, causing unexpected background reconnects.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb7236d161
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| await new Promise<void>((resolve) => setTimeout(resolve, backoff)); | ||
| } | ||
| return origReConnect.call(this, isStart); |
There was a problem hiding this comment.
Bypass SDK retry timer after external backoff
This wrapper adds await setTimeout(backoff) and then calls origReConnect with isStart=false, so the Lark SDK's own non-start reconnect scheduling still runs afterward. In WSClient.reConnect(false), the SDK already waits (reconnectNonce/reconnectInterval) before retrying, so this change stacks two delays and makes reconnect outages much longer than intended (first retry becomes backoff + SDK delay instead of just one backoff window).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1db52de678
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| | ((data: { headers: Array<{ key: string; value: string }>; payload?: Uint8Array }) => Promise<void>) | ||
| | undefined; | ||
| if (origHandleControlData) { | ||
| (proto as unknown as Record<string, unknown>).handleControlData = async function ( |
There was a problem hiding this comment.
Replace global WSClient prototype mutation
Root AGENTS.md explicitly says not to share behavior via prototype mutation without explicit approval, but this commit assigns new functions onto WSClient.prototype. That creates process-wide side effects (all SDK WSClient instances are altered, not just Feishu clients from this module) and makes reconnect/control-flow behavior depend on untyped monkey patches that can silently break on upstream SDK updates. Please move this to an explicit wrapper/composition seam or document the required approval before shipping.
Useful? React with 👍 / 👎.
Fix two bugs in the Feishu WebSocket reconnect logic: 1. PingInterval crash: When Feishu returns 'system busy' (1000040345), the server does not send a Pong frame with PingInterval. The Lark SDK crashes with 'Cannot read properties of undefined (reading PingInterval)' because it assumes PingInterval is always present. Added a guard to catch and swallow this specific error. 2. Fixed reconnect interval: The Lark SDK reconnects at a fixed reconnectInterval (default 120s) with no backoff. During a Feishu rate-limit event, this causes a tight loop of failed reconnects that amplifies the problem. Added exponential backoff (2min → 4min → 8min → 15min cap) on non-start reconnects. Both fixes are applied as a one-time monkey-patch on the WSClient prototype at module load time, so they affect all Feishu accounts.
…er on success P1 fixes per greptile review: - Narrow error filter from 'PingInterval || undefined' to 'PingInterval' only to avoid swallowing unrelated errors - Reset _feishuReconnectCount to 0 on successful (isStart=true) connection so future failures start fresh at 2min instead of jumping to 16min Fixes openclaw#55532
…ript compliance Replace direct property access with index signature (Record<string, unknown>) to bypass TypeScript private member checks without using 'as any'. This fixes CI build errors while maintaining the same runtime behavior. Changes: - proto.handleControlData -> proto["handleControlData"] - proto.reConnect -> proto["reConnect"] - Remove 'as any' casts for prototype patching
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6edc665713
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (isStart) { | ||
| // Reset backoff counter on a fresh start | ||
| self._feishuReconnectCount = 0; | ||
| } |
There was a problem hiding this comment.
Reset backoff counter after any successful reconnect
The counter is only reset in the isStart branch, but runtime recoveries happen through reConnect(false) from the SDK close handler. In the SDK’s non-start reconnect loop, a successful reconnect returns directly without re-entering reConnect(true), so _feishuReconnectCount keeps growing across unrelated outages and later disconnects start at higher delays (eventually pinned near the cap) instead of restarting from the base interval.
Useful? React with 👍 / 👎.
| "[feishu-ws-patch]", | ||
| `reconnect backoff: ${Math.round(backoff / 1000)}s (attempt ${count}, jitter ${Math.round((jitter - 1) * 100)}%)`, | ||
| ); | ||
| await new Promise<void>((resolve) => setTimeout(resolve, backoff)); |
There was a problem hiding this comment.
Cancel pre-reconnect timer when WS client closes
The added setTimeout-based sleep is not tracked on the SDK instance, so client.close() cannot clear it. If the monitor is stopped during this wait, the delayed callback still runs and calls origReConnect, which can re-open a connection after shutdown because only the SDK’s internal timers are canceled by close().
Useful? React with 👍 / 👎.
| await new Promise<void>((resolve) => setTimeout(resolve, backoff)); | ||
| } | ||
| return origReConnect.call(this, isStart); |
There was a problem hiding this comment.
Avoid stacking custom backoff with SDK reconnect delay
This wrapper waits backoff and then calls origReConnect(false), but the SDK’s non-start reconnect path already applies its own nonce/interval timers before attempting reconnection. That means reconnect latency becomes custom backoff + SDK delay, making outages materially longer than intended and deviating from the documented retry schedule.
Useful? React with 👍 / 👎.
|
Codex review: keeping this open for maintainer follow-up; there is still a little grit to resolve. Keep this PR open. It is paired with same-author open bug #55532, and current main does not yet prove the PR’s intended Feishu reconnect/PingInterval fix is implemented. Main has later Feishu reconnect work from #72411, but the locked Lark SDK contract still makes the replacement uncertain: Best possible solution: Keep this PR open for maintainer reconciliation with #55532 and the merged #72411 direction. The best path is to verify the real Lark SDK 1.62.0 What I checked:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 6d323ee73640. |
Summary
Fix two bugs in the Feishu WebSocket reconnect logic that cause extended outages during Feishu rate-limiting events.
Bug 1: PingInterval undefined crash
When Feishu returns
system busy(error code 1000040345), the server does not send a Pong frame withPingInterval. The Lark SDK crashes with:Fix: Wrap
handleControlDatain a try/catch that silently swallows this specific error.Bug 2: Fixed reconnect interval amplifies rate limiting
The Lark SDK reconnects at a fixed interval (default 120s) with no backoff. During a Feishu rate-limit event, this creates a tight loop of failed reconnects.
Fix: Added exponential backoff on non-start reconnects: 2min → 4min → 8min → 15min (capped).
P1 fixes (per greptile review)
msg.includes("PingInterval") || msg.includes("undefined")tomsg.includes("PingInterval")only — avoids swallowing unrelated errors_feishuReconnectCountto 0 on successful (isStart=true) connection — so future failures start fresh at 2min instead of jumping to 16minTesting
[feishu-ws-patch] reconnect backoff: Xs (attempt N)Issue
Fixes #55532