fix(feishu): repair WebSocket reconnect and heartbeat config#72411
fix(feishu): repair WebSocket reconnect and heartbeat config#72411vincentkoc merged 1 commit intomainfrom
Conversation
Greptile SummaryFixes two issues in the Feishu WebSocket monitor: replaces the fire-and-forget Confidence Score: 4/5Safe to merge with awareness of two P2 edge cases in the reconnect/abort handling. No P0/P1 defects. Two P2 observations: (1) if
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 149-154
Comment:
**Reconnect loop doesn't fire on a clean `start()` resolution**
When `wsClient.start()` resolves without throwing, the code resets `attempt` (dead code — never read again) and then blocks in `waitForFeishuWsAbort` until the `abortSignal` fires. If the SDK's `start()` can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.
This is safe only if the SDK guarantees that `start()` either:
1. blocks forever and only settles by throwing on an unrecoverable failure, or
2. handles reconnects internally so the connection is still live when `start()` resolves.
A defensive approach would be to fall through to the retry loop (rather than `return`) when `start()` resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 114-121
Comment:
**`waitForFeishuWsAbort` never resolves when `abortSignal` is omitted**
If `abortSignal` is `undefined`, `abortSignal?.addEventListener(...)` is a no-op and the returned Promise is never resolved. Because `abortSignal` is typed as optional in `MonitorTransportParams`, a caller that omits it will hang the monitor forever after a successful `wsClient.start()`. The same gap existed in the prior implementation, but the new `while (true)` loop makes the hang less visible. Adding a short-circuit `return Promise.resolve()` when `abortSignal` is undefined would make the function safe for unconstrained callers.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(feishu): repair WebSocket reconnect ..." | Re-trigger Greptile |
| attempt = 0; | ||
| log(`feishu[${accountId}]: WebSocket client started`); | ||
| await waitForFeishuWsAbort(abortSignal); | ||
| log(`feishu[${accountId}]: abort signal received, stopping`); | ||
| cleanupFeishuWsClient({ accountId, wsClient, error }); | ||
| return; |
There was a problem hiding this comment.
Reconnect loop doesn't fire on a clean
start() resolution
When wsClient.start() resolves without throwing, the code resets attempt (dead code — never read again) and then blocks in waitForFeishuWsAbort until the abortSignal fires. If the SDK's start() can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.
This is safe only if the SDK guarantees that start() either:
- blocks forever and only settles by throwing on an unrecoverable failure, or
- handles reconnects internally so the connection is still live when
start()resolves.
A defensive approach would be to fall through to the retry loop (rather than return) when start() resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 149-154
Comment:
**Reconnect loop doesn't fire on a clean `start()` resolution**
When `wsClient.start()` resolves without throwing, the code resets `attempt` (dead code — never read again) and then blocks in `waitForFeishuWsAbort` until the `abortSignal` fires. If the SDK's `start()` can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.
This is safe only if the SDK guarantees that `start()` either:
1. blocks forever and only settles by throwing on an unrecoverable failure, or
2. handles reconnects internally so the connection is still live when `start()` resolves.
A defensive approach would be to fall through to the retry loop (rather than `return`) when `start()` resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.
How can I resolve this? If you propose a fix, please make it concise.| function waitForFeishuWsAbort(abortSignal?: AbortSignal): Promise<void> { | ||
| if (abortSignal?.aborted) { | ||
| return Promise.resolve(); | ||
| } | ||
| return new Promise((resolve) => { | ||
| abortSignal?.addEventListener("abort", () => resolve(), { once: true }); | ||
| }); | ||
| } |
There was a problem hiding this comment.
waitForFeishuWsAbort never resolves when abortSignal is omitted
If abortSignal is undefined, abortSignal?.addEventListener(...) is a no-op and the returned Promise is never resolved. Because abortSignal is typed as optional in MonitorTransportParams, a caller that omits it will hang the monitor forever after a successful wsClient.start(). The same gap existed in the prior implementation, but the new while (true) loop makes the hang less visible. Adding a short-circuit return Promise.resolve() when abortSignal is undefined would make the function safe for unconstrained callers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 114-121
Comment:
**`waitForFeishuWsAbort` never resolves when `abortSignal` is omitted**
If `abortSignal` is `undefined`, `abortSignal?.addEventListener(...)` is a no-op and the returned Promise is never resolved. Because `abortSignal` is typed as optional in `MonitorTransportParams`, a caller that omits it will hang the monitor forever after a successful `wsClient.start()`. The same gap existed in the prior implementation, but the new `while (true)` loop makes the hang less visible. Adding a short-circuit `return Promise.resolve()` when `abortSignal` is undefined would make the function safe for unconstrained callers.
How can I resolve this? If you propose a fix, please make it concise.7dfabab to
ba03f55
Compare
ba03f55 to
e5e574a
Compare
e5e574a to
217a8fb
Compare
🔒 Aisle Security AnalysisWe found 3 potential security issue(s) in this PR:
1. 🟠 Incomplete secret redaction in Feishu WebSocket error logging (app_access_token may leak)
Description
In particular, Feishu commonly uses the field name
Vulnerable code: .replace(
/\b((?:app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi,
"$1[redacted]",
)This does not match RecommendationExpand and harden redaction to cover Feishu/Lark token field names (including Example fix (add missing key and broaden matching for token-like keys): const redacted = singleLine
// ...existing replaces...
.replace(
/\b((?:app[_-]?access[_-]?token|app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi,
"$1[redacted]",
);Additional hardening ideas:
2. 🟡 Unbounded error-message processing in WebSocket error logging can cause CPU/memory DoS
Description
An attacker who can trigger failures that include very large messages can force expensive CPU work and large temporary allocations, degrading availability. Vulnerable code: const raw = err instanceof Error ? err.message || err.name : String(err);
const singleLine = Array.from(raw, (char) => { /* ... */ }).join("");
const redacted = singleLine
.replace(/.../g, ...)
.replace(/.../g, ...)
.replace(/.../g, ...)
.replace(/.../g, ...)
.replace(/\s+/g, " ")
.trim();
// truncation happens only here
return `${redacted.slice(0, FEISHU_WS_LOG_ERROR_MAX_LENGTH)}...`;RecommendationApply a hard cap to the input string before expensive transformations, and avoid repeated full-string allocations. Example (truncate early, then redact): const MAX = FEISHU_WS_LOG_ERROR_MAX_LENGTH;
const OVERSCAN = 200; // small buffer so redaction still works around boundaries
function formatFeishuWsErrorForLog(err: unknown): string {
const raw0 = err instanceof Error ? (err.message || err.name) : String(err);
const raw = raw0.length > MAX + OVERSCAN ? raw0.slice(0, MAX + OVERSCAN) : raw0;
// normalize without Array.from (fewer allocations)
const singleLine = raw.replace(/[\x00-\x1F\x7F]/g, " ");
const redacted = singleLine
.replace(/:\/\/[^:@\/\s]+:[^@\/\s]+@/g, "://[redacted]@")
.replace(/\b(authorization\s*[:=]\s*Bearer\s+)[^\s,;]+/gi, "$1[redacted]")
.replace(/\b(Bearer\s+)[A-Za-z0-9._~+\/-]+=*/g, "$1[redacted]")
.replace(/\b((?:app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi, "$1[redacted]")
.replace(/\s+/g, " ")
.trim();
if (!redacted) return "unknown error";
return redacted.length <= MAX ? redacted : `${redacted.slice(0, MAX)}...`;
}This bounds worst-case runtime/memory even if upstream error messages are huge. 3. 🟡 Race condition in Feishu WebSocket monitor cleanup can orphan active connections and corrupt global state
DescriptionThe new reconnect/cleanup logic in If
This can orphan a live WebSocket client (it remains connected but is no longer tracked/closeable via Vulnerable code: function cleanupFeishuWsClient(...) {
...
wsClients.delete(accountId);
botOpenIds.delete(accountId);
botNames.delete(accountId);
}RecommendationMake cleanup conditional on ownership so one monitor instance cannot delete another instance’s state. Option A (compare-and-delete): only delete if the current map entry matches the client being cleaned up. function cleanupFeishuWsClient({ accountId, wsClient, error }: { ... }) {
if (wsClient) {
try { wsClient.close(); } catch (err) { error(...); }
if (wsClients.get(accountId) === wsClient) {
wsClients.delete(accountId);
}
}
// Only clear bot identity if this monitor instance is the active one,
// or move bot identity lifecycle management out of WS cleanup.
if (wsClients.has(accountId) === false) {
botOpenIds.delete(accountId);
botNames.delete(accountId);
}
}Option B (generation token / per-account mutex): store Analyzed PR: #72411 at commit Last updated on: 2026-04-28T00:30:09Z |
|
Updated and rebased this branch onto current Addressed the review findings:
Validation:
|
Summary
Repair the Feishu WebSocket reconnect path using the narrow app-layer approach from #68865, address the unresolved review finding by wiring
wsClient.start({ eventDispatcher })settlement into the reconnect loop, and include the minimalwsConfigPingInterval/PingTimeout constructor fix from #45674 if it is not already present.Credit
This carries forward work from @alex-xuweilong in #45674, @120106835 in #46472, @sirfengyu in #55619, and @tianhaocui in #68865. The #55619 prototype-mutation approach is not reused because automated review flagged process-wide WSClient side effects and reconnect timing/cancellation blockers.
Validation
Fixes #68766. Related to #42354 and #55532.
ProjectClownfish replacement details: