fix(feishu): repair WebSocket reconnect and heartbeat config by vincentkoc · Pull Request #72411 · openclaw/openclaw

vincentkoc · 2026-04-26T23:09:45Z

Summary

Repair the Feishu WebSocket reconnect path using the narrow app-layer approach from #68865, address the unresolved review finding by wiring wsClient.start({ eventDispatcher }) settlement into the reconnect loop, and include the minimal wsConfig PingInterval/PingTimeout constructor fix from #45674 if it is not already present.

Credit

This carries forward work from @alex-xuweilong in #45674, @120106835 in #46472, @sirfengyu in #55619, and @tianhaocui in #68865. The #55619 prototype-mutation approach is not reused because automated review flagged process-wide WSClient side effects and reconnect timing/cancellation blockers.

Validation

pnpm check:changed

Fixes #68766. Related to #42354 and #55532.

ProjectClownfish replacement details:

Cluster: ghcrawl-207048-agentic-merge
Source PRs: fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674, fix(feishu): add WebSocket heartbeat config to prevent silent disconnection #46472, fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619, fix(feishu): add application-level WebSocket reconnection with backoff #68865
Credit: Credit alex-xuweilong for the minimal wsConfig PingInterval/PingTimeout fix in fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674.; Credit 120106835 for the equivalent heartbeat wsConfig fix in fix(feishu): add WebSocket heartbeat config to prevent silent disconnection #46472.; Credit sirfengyu for the reconnect/backoff investigation and review iteration in fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619, while avoiding its global prototype mutation approach.; Credit tianhaocui for the app-layer reconnect loop implementation in fix(feishu): add application-level WebSocket reconnection with backoff #68865.
Validation: pnpm check:changed
Repair fallback: validation command failed (pnpm check:changed): undefined ERR_PNPM_RECURSIVE_EXEC_FIRST_FAIL Command "check:changed" not found Did you mean "pnpm check:docs"?

greptile-apps · 2026-04-26T23:13:02Z

Greptile Summary

Fixes two issues in the Feishu WebSocket monitor: replaces the fire-and-forget void wsClient.start() with an awaited call inside an exponential-backoff retry loop (1 s–30 s), and adds explicit PingInterval/PingTimeout heartbeat defaults to the WSClient constructor. Tests cover the retry-with-backoff path and the new heartbeat config assertion.

Confidence Score: 4/5

Safe to merge with awareness of two P2 edge cases in the reconnect/abort handling.

No P0/P1 defects. Two P2 observations: (1) if wsClient.start() resolves normally (clean SDK disconnect rather than throw), the reconnect loop doesn't fire and the monitor hangs on waitForFeishuWsAbort with no active connection; (2) omitting abortSignal causes waitForFeishuWsAbort to never resolve. Both are edge cases and the second is pre-existing, but they represent observable silent-hang scenarios worth addressing.

extensions/feishu/src/monitor.transport.ts — reconnect-on-clean-resolve and absent-signal paths.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 149-154

Comment:
**Reconnect loop doesn't fire on a clean `start()` resolution**

When `wsClient.start()` resolves without throwing, the code resets `attempt` (dead code — never read again) and then blocks in `waitForFeishuWsAbort` until the `abortSignal` fires. If the SDK's `start()` can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.

This is safe only if the SDK guarantees that `start()` either:
1. blocks forever and only settles by throwing on an unrecoverable failure, or
2. handles reconnects internally so the connection is still live when `start()` resolves.

A defensive approach would be to fall through to the retry loop (rather than `return`) when `start()` resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 114-121

Comment:
**`waitForFeishuWsAbort` never resolves when `abortSignal` is omitted**

If `abortSignal` is `undefined`, `abortSignal?.addEventListener(...)` is a no-op and the returned Promise is never resolved. Because `abortSignal` is typed as optional in `MonitorTransportParams`, a caller that omits it will hang the monitor forever after a successful `wsClient.start()`. The same gap existed in the prior implementation, but the new `while (true)` loop makes the hang less visible. Adding a short-circuit `return Promise.resolve()` when `abortSignal` is undefined would make the function safe for unconstrained callers.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(feishu): repair WebSocket reconnect ..." | Re-trigger Greptile}

greptile-apps · 2026-04-26T23:13:06Z

+      attempt = 0;
      log(`feishu[${accountId}]: WebSocket client started`);
+      await waitForFeishuWsAbort(abortSignal);
+      log(`feishu[${accountId}]: abort signal received, stopping`);
+      cleanupFeishuWsClient({ accountId, wsClient, error });
+      return;


Reconnect loop doesn't fire on a clean start() resolution

When wsClient.start() resolves without throwing, the code resets attempt (dead code — never read again) and then blocks in waitForFeishuWsAbort until the abortSignal fires. If the SDK's start() can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect.

This is safe only if the SDK guarantees that start() either:

blocks forever and only settles by throwing on an unrecoverable failure, or

handles reconnects internally so the connection is still live when start() resolves.

A defensive approach would be to fall through to the retry loop (rather than return) when start() resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 149-154 Comment: **Reconnect loop doesn't fire on a clean `start()` resolution** When `wsClient.start()` resolves without throwing, the code resets `attempt` (dead code — never read again) and then blocks in `waitForFeishuWsAbort` until the `abortSignal` fires. If the SDK's `start()` can resolve after a clean server-side disconnect (rather than only on a fatal error), the monitor will sit idle with no active WebSocket until an explicit abort arrives — it will not reconnect. This is safe only if the SDK guarantees that `start()` either: 1. blocks forever and only settles by throwing on an unrecoverable failure, or 2. handles reconnects internally so the connection is still live when `start()` resolves. A defensive approach would be to fall through to the retry loop (rather than `return`) when `start()` resolves and the signal is not yet aborted, mirroring the SDK-agnostic reconnect semantics already in place for the error path. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-26T23:13:07Z

+function waitForFeishuWsAbort(abortSignal?: AbortSignal): Promise<void> {
+  if (abortSignal?.aborted) {
+    return Promise.resolve();
+  }
+  return new Promise((resolve) => {
+    abortSignal?.addEventListener("abort", () => resolve(), { once: true });
+  });
+}


waitForFeishuWsAbort never resolves when abortSignal is omitted

If abortSignal is undefined, abortSignal?.addEventListener(...) is a no-op and the returned Promise is never resolved. Because abortSignal is typed as optional in MonitorTransportParams, a caller that omits it will hang the monitor forever after a successful wsClient.start(). The same gap existed in the prior implementation, but the new while (true) loop makes the hang less visible. Adding a short-circuit return Promise.resolve() when abortSignal is undefined would make the function safe for unconstrained callers.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 114-121 Comment: **`waitForFeishuWsAbort` never resolves when `abortSignal` is omitted** If `abortSignal` is `undefined`, `abortSignal?.addEventListener(...)` is a no-op and the returned Promise is never resolved. Because `abortSignal` is typed as optional in `MonitorTransportParams`, a caller that omits it will hang the monitor forever after a successful `wsClient.start()`. The same gap existed in the prior implementation, but the new `while (true)` loop makes the hang less visible. Adding a short-circuit `return Promise.resolve()` when `abortSignal` is undefined would make the function safe for unconstrained callers. How can I resolve this? If you propose a fix, please make it concise.

aisle-research-bot · 2026-04-28T00:28:17Z

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

#	Severity	Title
1	🟠 High	Incomplete secret redaction in Feishu WebSocket error logging (app_access_token may leak)
2	🟡 Medium	Unbounded error-message processing in WebSocket error logging can cause CPU/memory DoS
3	🟡 Medium	Race condition in Feishu WebSocket monitor cleanup can orphan active connections and corrupt global state

1. 🟠 Incomplete secret redaction in Feishu WebSocket error logging (app_access_token may leak)

Property	Value
Severity	High
CWE	CWE-532
Location	`extensions/feishu/src/monitor.transport.ts:103-110`

Description

monitorWebSocket() logs WebSocket start/close errors using formatFeishuWsErrorForLog(err). The sanitizer attempts to redact common secret fields, but the allowlist of keys is incomplete for Feishu/Lark credentials.

In particular, Feishu commonly uses the field name app_access_token, but the redaction regex only matches tenant_access_token, access_token, etc. As a result, an SDK error message that includes app_access_token=<token> (or similar) will be logged in plaintext.

Input: err (from wsClient.start() / wsClient.close()), potentially containing HTTP request/response details and tokens
Sink: runtime.error(...) at the retry log line
Impact: Secret/token disclosure into logs (often shipped to centralized logging/monitoring), enabling impersonation and unauthorized API calls

Vulnerable code:

.replace(
  /\b((?:app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi,
  "$1[redacted]",
)

This does not match app_access_token.

Recommendation

Expand and harden redaction to cover Feishu/Lark token field names (including app_access_token) and prefer a more general approach when possible.

Example fix (add missing key and broaden matching for token-like keys):

const redacted = singleLine
  // ...existing replaces...
  .replace(
    /\b((?:app[_-]?access[_-]?token|app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi,
    "$1[redacted]",
  );

Additional hardening ideas:

Redact common header forms beyond Bearer (e.g., Authorization: Basic ..., X-Token: ...).
Consider redacting any key that ends with token/secret/password in JSON-like strings.
Keep truncation after redaction (already done) and add tests for app_access_token leakage.

2. 🟡 Unbounded error-message processing in WebSocket error logging can cause CPU/memory DoS

Property	Value
Severity	Medium
CWE	CWE-400
Location	`extensions/feishu/src/monitor.transport.ts:97-121`

Description

formatFeishuWsErrorForLog() performs multiple full-string passes (character normalization + several global regex replacements) on Error.message/String(err) before applying any length limit.

Input: err from wsClient.start()/wsClient.close() failures can contain large, upstream-controlled strings (e.g., embedded HTTP responses, stack traces, or payloads).
Processing: Array.from(raw)…join("") and chained .replace(/.../g, ...) each scan/allocate over the full message.
Only after all passes is the output truncated to FEISHU_WS_LOG_ERROR_MAX_LENGTH (500).

An attacker who can trigger failures that include very large messages can force expensive CPU work and large temporary allocations, degrading availability.

Vulnerable code:

const raw = err instanceof Error ? err.message || err.name : String(err);
const singleLine = Array.from(raw, (char) => { /* ... */ }).join("");
const redacted = singleLine
  .replace(/.../g, ...)
  .replace(/.../g, ...)
  .replace(/.../g, ...)
  .replace(/.../g, ...)
  .replace(/\s+/g, " ")
  .trim();
// truncation happens only here
return `${redacted.slice(0, FEISHU_WS_LOG_ERROR_MAX_LENGTH)}...`;

Recommendation

Apply a hard cap to the input string before expensive transformations, and avoid repeated full-string allocations.

Example (truncate early, then redact):

const MAX = FEISHU_WS_LOG_ERROR_MAX_LENGTH;
const OVERSCAN = 200; // small buffer so redaction still works around boundaries

function formatFeishuWsErrorForLog(err: unknown): string {
  const raw0 = err instanceof Error ? (err.message || err.name) : String(err);
  const raw = raw0.length > MAX + OVERSCAN ? raw0.slice(0, MAX + OVERSCAN) : raw0;

  // normalize without Array.from (fewer allocations)
  const singleLine = raw.replace(/[\x00-\x1F\x7F]/g, " ");

  const redacted = singleLine
    .replace(/:\/\/[^:@\/\s]+:[^@\/\s]+@/g, "://[redacted]@")
    .replace(/\b(authorization\s*[:=]\s*Bearer\s+)[^\s,;]+/gi, "$1[redacted]")
    .replace(/\b(Bearer\s+)[A-Za-z0-9._~+\/-]+=*/g, "$1[redacted]")
    .replace(/\b((?:app[_-]?secret|tenant[_-]?access[_-]?token|access[_-]?token|refresh[_-]?token|token|secret|password)\s*[:=]\s*)[^\s&;,]+/gi, "$1[redacted]")
    .replace(/\s+/g, " ")
    .trim();

  if (!redacted) return "unknown error";
  return redacted.length <= MAX ? redacted : `${redacted.slice(0, MAX)}...`;
}

This bounds worst-case runtime/memory even if upstream error messages are huge.

3. 🟡 Race condition in Feishu WebSocket monitor cleanup can orphan active connections and corrupt global state

Property	Value
Severity	Medium
CWE	CWE-362
Location	`extensions/feishu/src/monitor.transport.ts:123-141`

Description

The new reconnect/cleanup logic in monitorWebSocket() uses global Maps (wsClients, botOpenIds, botNames) keyed only by accountId. The helper cleanupFeishuWsClient() unconditionally deletes these entries for an accountId regardless of whether the wsClient being cleaned up is still the one currently registered.

If monitorWebSocket() is ever started concurrently for the same accountId (e.g., duplicate monitor start, overlapping restarts, or a previous loop iteration still unwinding while a new one begins), one instance can:

call wsClients.set(accountId, wsClient) for the new client
while another instance (or earlier attempt) calls cleanupFeishuWsClient({ accountId, wsClient: oldClient })
which then executes wsClients.delete(accountId) and clears bot identity state

This can orphan a live WebSocket client (it remains connected but is no longer tracked/closeable via stopFeishuMonitorState()), and can cause repeated leaked connections and state corruption, leading to resource exhaustion / denial of service over time.

Vulnerable code:

function cleanupFeishuWsClient(...) {
  ...
  wsClients.delete(accountId);
  botOpenIds.delete(accountId);
  botNames.delete(accountId);
}

Recommendation

Make cleanup conditional on ownership so one monitor instance cannot delete another instance’s state.

Option A (compare-and-delete): only delete if the current map entry matches the client being cleaned up.

function cleanupFeishuWsClient({ accountId, wsClient, error }: { ... }) {
  if (wsClient) {
    try { wsClient.close(); } catch (err) { error(...); }
    if (wsClients.get(accountId) === wsClient) {
      wsClients.delete(accountId);
    }
  }
  // Only clear bot identity if this monitor instance is the active one,
  // or move bot identity lifecycle management out of WS cleanup.
  if (wsClients.has(accountId) === false) {
    botOpenIds.delete(accountId);
    botNames.delete(accountId);
  }
}

Option B (generation token / per-account mutex): store {client, generation} and only cleanup when generations match, or ensure only one monitorWebSocket() runs per account by guarding starts with a per-account lock/singleton promise.

Analyzed PR: #72411 at commit 217a8fb

_{Last updated on: 2026-04-28T00:30:09Z}

vincentkoc · 2026-04-28T00:31:03Z

Updated and rebased this branch onto current main.

Addressed the review findings:

sanitized Feishu WebSocket startup/close error logs so token-shaped values and control characters are not emitted
made abortable backoff cleanup race-safe
kept the Lark SDK-managed persistent connection alive after WSClient.start() resolves, while still closing cleanly on monitor abort
preserved the heartbeat wsConfig defaults and added regression coverage

Validation:

Blacksmith Testbox: pnpm test:serial extensions/feishu/src/async.test.ts extensions/feishu/src/monitor.cleanup.test.ts extensions/feishu/src/client.test.ts passed, 22 tests
Blacksmith Testbox: pnpm check:changed passed

…w#72411)

vincentkoc mentioned this pull request Apr 26, 2026

fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674

Closed

openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: S maintainer Maintainer-authored PR labels Apr 26, 2026

greptile-apps Bot reviewed Apr 26, 2026

View reviewed changes

vincentkoc added the clawsweeper Tracked by ClawSweeper automation label Apr 27, 2026

vincentkoc force-pushed the clownfish/ghcrawl-207048-agentic-merge branch from 7dfabab to ba03f55 Compare April 28, 2026 00:06

openclaw-barnacle Bot added size: M and removed size: S labels Apr 28, 2026

vincentkoc force-pushed the clownfish/ghcrawl-207048-agentic-merge branch from ba03f55 to e5e574a Compare April 28, 2026 00:09

fix(feishu): repair WebSocket reconnect and heartbeat config

217a8fb

vincentkoc force-pushed the clownfish/ghcrawl-207048-agentic-merge branch from e5e574a to 217a8fb Compare April 28, 2026 00:28

vincentkoc self-assigned this Apr 28, 2026

vincentkoc merged commit d93e6f6 into main Apr 28, 2026
66 checks passed

vincentkoc deleted the clownfish/ghcrawl-207048-agentic-merge branch April 28, 2026 00:32

This was referenced Apr 28, 2026

fix(feishu): add WebSocket heartbeat config to prevent silent disconnection #46472

Closed

fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619

Closed

vincentkoc mentioned this pull request Apr 28, 2026

fix(feishu): recover WebSocket after SDK retry exhaustion #73739

Merged

clawsweeper Bot mentioned this pull request Apr 28, 2026

[Bug] Feishu plugin causes Gateway crash #48066

Closed

This was referenced Apr 29, 2026

fix(feishu): supervise WebSocket retry exhaustion #73940

Closed

fix(feishu): reconcile WebSocket reconnect backoff #73945

Closed

fix(feishu): reconcile WebSocket reconnect backoff #73998

Open

This was referenced Apr 29, 2026

feat(feishu): surface connection health state #53533

Closed

fix(feishu): disable ambient proxy inheritance for websocket by default #65802

Open

clawsweeper Bot mentioned this pull request Apr 30, 2026

Feishu WebSocket reconnect exhausted causes permanent connection freeze #72828

Closed

zqchris pushed a commit to zqchris/openclaw that referenced this pull request Apr 30, 2026

fix(feishu): repair WebSocket reconnect and heartbeat config (opencla…

9f26bc4

…w#72411)

zqchris pushed a commit to zqchris/openclaw that referenced this pull request Apr 30, 2026

fix(feishu): repair WebSocket reconnect and heartbeat config (opencla…

cd0a946

…w#72411)

ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026

fix(feishu): repair WebSocket reconnect and heartbeat config (opencla…

15caad8

…w#72411)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(feishu): repair WebSocket reconnect and heartbeat config#72411

fix(feishu): repair WebSocket reconnect and heartbeat config#72411
vincentkoc merged 1 commit intomainfrom
clownfish/ghcrawl-207048-agentic-merge

vincentkoc commented Apr 26, 2026

Uh oh!

greptile-apps Bot commented Apr 26, 2026

Uh oh!

greptile-apps Bot Apr 26, 2026

Uh oh!

greptile-apps Bot Apr 26, 2026

Uh oh!

aisle-research-bot Bot commented Apr 28, 2026 •

edited

Loading

Description

Recommendation

Description

Recommendation

Description

Recommendation

Uh oh!

vincentkoc commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vincentkoc commented Apr 26, 2026

Summary

Credit

Validation

Uh oh!

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

aisle-research-bot Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

Description

Recommendation

Description

Recommendation

Description

Recommendation

Uh oh!

vincentkoc commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aisle-research-bot Bot commented Apr 28, 2026 •

edited

Loading