Skip to content

fix(feishu): repair WebSocket reconnect and heartbeat config#72411

Merged
vincentkoc merged 1 commit intomainfrom
clownfish/ghcrawl-207048-agentic-merge
Apr 28, 2026
Merged

fix(feishu): repair WebSocket reconnect and heartbeat config#72411
vincentkoc merged 1 commit intomainfrom
clownfish/ghcrawl-207048-agentic-merge

Conversation

@vincentkoc
Copy link
Copy Markdown
Member

Summary

Repair the Feishu WebSocket reconnect path using the narrow app-layer approach from #68865, address the unresolved review finding by wiring wsClient.start({ eventDispatcher }) settlement into the reconnect loop, and include the minimal wsConfig PingInterval/PingTimeout constructor fix from #45674 if it is not already present.

Credit

This carries forward work from @alex-xuweilong in #45674, @120106835 in #46472, @sirfengyu in #55619, and @tianhaocui in #68865. The #55619 prototype-mutation approach is not reused because automated review flagged process-wide WSClient side effects and reconnect timing/cancellation blockers.

Validation

  • pnpm check:changed

Fixes #68766. Related to #42354 and #55532.

ProjectClownfish replacement details:

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

Fixes two issues in the Feishu WebSocket monitor: replaces the fire-and-forget void wsClient.start() with an awaited call inside an exponential-backoff retry loop (1 s–30 s), and adds explicit PingInterval/PingTimeout heartbeat defaults to the WSClient constructor. Tests cover the retry-with-backoff path and the new heartbeat config assertion.

Confidence Score: 4/5

Safe to merge with awareness of two P2 edge cases in the reconnect/abort handling.

No P0/P1 defects. Two P2 observations: (1) if wsClient.start() resolves normally (clean SDK disconnect rather than throw), the reconnect loop doesn't fire and the monitor hangs on waitForFeishuWsAbort with no active connection; (2) omitting abortSignal causes waitForFeishuWsAbort to never resolve. Both are edge cases and the second is pre-existing, but they represent observable silent-hang scenarios worth addressing.

extensions/feishu/src/monitor.transport.ts — reconnect-on-clean-resolve and absent-signal paths.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 149-154

Comment:
**Reconnect loop doesn't fire on a clean `start()` resolution**

When `wsClient.start()` resolves without throwing, the code resets `attempt` (dead code — never read again) and then blocks in `waitForFeishuWsAbort` until the `abortSignal` fires. If the SDK's `start()` can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.

This is safe only if the SDK guarantees that `start()` either:
1. blocks forever and only settles by throwing on an unrecoverable failure, or
2. handles reconnects internally so the connection is still live when `start()` resolves.

A defensive approach would be to fall through to the retry loop (rather than `return`) when `start()` resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 114-121

Comment:
**`waitForFeishuWsAbort` never resolves when `abortSignal` is omitted**

If `abortSignal` is `undefined`, `abortSignal?.addEventListener(...)` is a no-op and the returned Promise is never resolved. Because `abortSignal` is typed as optional in `MonitorTransportParams`, a caller that omits it will hang the monitor forever after a successful `wsClient.start()`. The same gap existed in the prior implementation, but the new `while (true)` loop makes the hang less visible. Adding a short-circuit `return Promise.resolve()` when `abortSignal` is undefined would make the function safe for unconstrained callers.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(feishu): repair WebSocket reconnect ..." | Re-trigger Greptile

Comment on lines +149 to +154
attempt = 0;
log(`feishu[${accountId}]: WebSocket client started`);
await waitForFeishuWsAbort(abortSignal);
log(`feishu[${accountId}]: abort signal received, stopping`);
cleanupFeishuWsClient({ accountId, wsClient, error });
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Reconnect loop doesn't fire on a clean start() resolution

When wsClient.start() resolves without throwing, the code resets attempt (dead code — never read again) and then blocks in waitForFeishuWsAbort until the abortSignal fires. If the SDK's start() can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.

This is safe only if the SDK guarantees that start() either:

  1. blocks forever and only settles by throwing on an unrecoverable failure, or
  2. handles reconnects internally so the connection is still live when start() resolves.

A defensive approach would be to fall through to the retry loop (rather than return) when start() resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 149-154

Comment:
**Reconnect loop doesn't fire on a clean `start()` resolution**

When `wsClient.start()` resolves without throwing, the code resets `attempt` (dead code — never read again) and then blocks in `waitForFeishuWsAbort` until the `abortSignal` fires. If the SDK's `start()` can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.

This is safe only if the SDK guarantees that `start()` either:
1. blocks forever and only settles by throwing on an unrecoverable failure, or
2. handles reconnects internally so the connection is still live when `start()` resolves.

A defensive approach would be to fall through to the retry loop (rather than `return`) when `start()` resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +114 to +121
function waitForFeishuWsAbort(abortSignal?: AbortSignal): Promise<void> {
if (abortSignal?.aborted) {
return Promise.resolve();
}
return new Promise((resolve) => {
abortSignal?.addEventListener("abort", () => resolve(), { once: true });
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 waitForFeishuWsAbort never resolves when abortSignal is omitted

If abortSignal is undefined, abortSignal?.addEventListener(...) is a no-op and the returned Promise is never resolved. Because abortSignal is typed as optional in MonitorTransportParams, a caller that omits it will hang the monitor forever after a successful wsClient.start(). The same gap existed in the prior implementation, but the new while (true) loop makes the hang less visible. Adding a short-circuit return Promise.resolve() when abortSignal is undefined would make the function safe for unconstrained callers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 114-121

Comment:
**`waitForFeishuWsAbort` never resolves when `abortSignal` is omitted**

If `abortSignal` is `undefined`, `abortSignal?.addEventListener(...)` is a no-op and the returned Promise is never resolved. Because `abortSignal` is typed as optional in `MonitorTransportParams`, a caller that omits it will hang the monitor forever after a successful `wsClient.start()`. The same gap existed in the prior implementation, but the new `while (true)` loop makes the hang less visible. Adding a short-circuit `return Promise.resolve()` when `abortSignal` is undefined would make the function safe for unconstrained callers.

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc vincentkoc force-pushed the clownfish/ghcrawl-207048-agentic-merge branch from e5e574a to 217a8fb Compare April 28, 2026 00:28
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 28, 2026

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Incomplete secret redaction in Feishu WebSocket error logging (app_access_token may leak)
2 🟡 Medium Unbounded error-message processing in WebSocket error logging can cause CPU/memory DoS
3 🟡 Medium Race condition in Feishu WebSocket monitor cleanup can orphan active connections and corrupt global state
1. 🟠 Incomplete secret redaction in Feishu WebSocket error logging (app_access_token may leak)
Property Value
Severity High
CWE CWE-532
Location extensions/feishu/src/monitor.transport.ts:103-110

Description

monitorWebSocket() logs WebSocket start/close errors using formatFeishuWsErrorForLog(err). The sanitizer attempts to redact common secret fields, but the allowlist of keys is incomplete for Feishu/Lark credentials.

In particular, Feishu commonly uses the field name app_access_token, but the redaction regex only matches tenant_access_token, access_token, etc. As a result, an SDK error message that includes app_access_token=<token> (or similar) will be logged in plaintext.

  • Input: err (from wsClient.start() / wsClient.close()), potentially containing HTTP request/response details and tokens
  • Sink: runtime.error(...) at the retry log line
  • Impact: Secret/token disclosure into logs (often shipped to centralized logging/monitoring), enabling impersonation and unauthorized API calls

Vulnerable code:

.replace(/\b((?:app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi,
  "$1[redacted]",
)

This does not match app_access_token.

Recommendation

Expand and harden redaction to cover Feishu/Lark token field names (including app_access_token) and prefer a more general approach when possible.

Example fix (add missing key and broaden matching for token-like keys):

const redacted = singleLine// ...existing replaces...
  .replace(/\b((?:app[_-]?access[_-]?token|app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi,
    "$1[redacted]",
  );

Additional hardening ideas:

  • Redact common header forms beyond Bearer (e.g., Authorization: Basic ..., X-Token: ...).
  • Consider redacting any key that ends with token/secret/password in JSON-like strings.
  • Keep truncation after redaction (already done) and add tests for app_access_token leakage.
2. 🟡 Unbounded error-message processing in WebSocket error logging can cause CPU/memory DoS
Property Value
Severity Medium
CWE CWE-400
Location extensions/feishu/src/monitor.transport.ts:97-121

Description

formatFeishuWsErrorForLog() performs multiple full-string passes (character normalization + several global regex replacements) on Error.message/String(err) before applying any length limit.

  • Input: err from wsClient.start()/wsClient.close() failures can contain large, upstream-controlled strings (e.g., embedded HTTP responses, stack traces, or payloads).
  • Processing: Array.from(raw)…join("") and chained .replace(/.../g, ...) each scan/allocate over the full message.
  • Only after all passes is the output truncated to FEISHU_WS_LOG_ERROR_MAX_LENGTH (500).

An attacker who can trigger failures that include very large messages can force expensive CPU work and large temporary allocations, degrading availability.

Vulnerable code:

const raw = err instanceof Error ? err.message || err.name : String(err);
const singleLine = Array.from(raw, (char) => { /* ... */ }).join("");
const redacted = singleLine
  .replace(/.../g, ...)
  .replace(/.../g, ...)
  .replace(/.../g, ...)
  .replace(/.../g, ...)
  .replace(/\s+/g, " ")
  .trim();// truncation happens only here
return `${redacted.slice(0, FEISHU_WS_LOG_ERROR_MAX_LENGTH)}...`;

Recommendation

Apply a hard cap to the input string before expensive transformations, and avoid repeated full-string allocations.

Example (truncate early, then redact):

const MAX = FEISHU_WS_LOG_ERROR_MAX_LENGTH;
const OVERSCAN = 200; // small buffer so redaction still works around boundaries

function formatFeishuWsErrorForLog(err: unknown): string {
  const raw0 = err instanceof Error ? (err.message || err.name) : String(err);
  const raw = raw0.length > MAX + OVERSCAN ? raw0.slice(0, MAX + OVERSCAN) : raw0;// normalize without Array.from (fewer allocations)
  const singleLine = raw.replace(/[\x00-\x1F\x7F]/g, " ");

  const redacted = singleLine
    .replace(/:\/\/[^:@\/\s]+:[^@\/\s]+@/g, "://[redacted]@")
    .replace(/\b(authorization\s*[:=]\s*Bearer\s+)[^\s,;]+/gi, "$1[redacted]")
    .replace(/\b(Bearer\s+)[A-Za-z0-9._~+\/-]+=*/g, "$1[redacted]")
    .replace(/\b((?:app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi, "$1[redacted]")
    .replace(/\s+/g, " ")
    .trim();

  if (!redacted) return "unknown error";
  return redacted.length <= MAX ? redacted : `${redacted.slice(0, MAX)}...`;
}

This bounds worst-case runtime/memory even if upstream error messages are huge.

3. 🟡 Race condition in Feishu WebSocket monitor cleanup can orphan active connections and corrupt global state
Property Value
Severity Medium
CWE CWE-362
Location extensions/feishu/src/monitor.transport.ts:123-141

Description

The new reconnect/cleanup logic in monitorWebSocket() uses global Maps (wsClients, botOpenIds, botNames) keyed only by accountId. The helper cleanupFeishuWsClient() unconditionally deletes these entries for an accountId regardless of whether the wsClient being cleaned up is still the one currently registered.

If monitorWebSocket() is ever started concurrently for the same accountId (e.g., duplicate monitor start, overlapping restarts, or a previous loop iteration still unwinding while a new one begins), one instance can:

  • call wsClients.set(accountId, wsClient) for the new client
  • while another instance (or earlier attempt) calls cleanupFeishuWsClient({ accountId, wsClient: oldClient })
  • which then executes wsClients.delete(accountId) and clears bot identity state

This can orphan a live WebSocket client (it remains connected but is no longer tracked/closeable via stopFeishuMonitorState()), and can cause repeated leaked connections and state corruption, leading to resource exhaustion / denial of service over time.

Vulnerable code:

function cleanupFeishuWsClient(...) {
  ...
  wsClients.delete(accountId);
  botOpenIds.delete(accountId);
  botNames.delete(accountId);
}

Recommendation

Make cleanup conditional on ownership so one monitor instance cannot delete another instance’s state.

Option A (compare-and-delete): only delete if the current map entry matches the client being cleaned up.

function cleanupFeishuWsClient({ accountId, wsClient, error }: { ... }) {
  if (wsClient) {
    try { wsClient.close(); } catch (err) { error(...); }
    if (wsClients.get(accountId) === wsClient) {
      wsClients.delete(accountId);
    }
  }// Only clear bot identity if this monitor instance is the active one,// or move bot identity lifecycle management out of WS cleanup.
  if (wsClients.has(accountId) === false) {
    botOpenIds.delete(accountId);
    botNames.delete(accountId);
  }
}

Option B (generation token / per-account mutex): store {client, generation} and only cleanup when generations match, or ensure only one monitorWebSocket() runs per account by guarding starts with a per-account lock/singleton promise.


Analyzed PR: #72411 at commit 217a8fb

Last updated on: 2026-04-28T00:30:09Z

@vincentkoc
Copy link
Copy Markdown
Member Author

Updated and rebased this branch onto current main.

Addressed the review findings:

  • sanitized Feishu WebSocket startup/close error logs so token-shaped values and control characters are not emitted
  • made abortable backoff cleanup race-safe
  • kept the Lark SDK-managed persistent connection alive after WSClient.start() resolves, while still closing cleanly on monitor abort
  • preserved the heartbeat wsConfig defaults and added regression coverage

Validation:

  • Blacksmith Testbox: pnpm test:serial extensions/feishu/src/async.test.ts extensions/feishu/src/monitor.cleanup.test.ts extensions/feishu/src/client.test.ts passed, 22 tests
  • Blacksmith Testbox: pnpm check:changed passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu clawsweeper Tracked by ClawSweeper automation maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Feishu WebSocket connection does not recover after transient token refresh failure

1 participant