feat(acp): add process-liveness keepalive to prevent false prompt timeouts by JerryLiu369 · Pull Request #1702 · iOfficeAI/AionUi

JerryLiu369 · 2026-03-25T06:00:47Z

Summary

Add a periodic keepalive (every 60s) during pending session/prompt requests that checks child process liveness and resets the 300s timeout timer
Prevents false LLM request timed out after 300 seconds errors when an ACP agent is executing a long-running silent tool (e.g. multi-minute build, test suite, or heavy computation that produces no streaming output)
Only touches AcpConnection.ts; no external bridge changes needed

Motivation

The existing timeout resets only on session.update notifications from the ACP bridge. But when an agent runs a tool that produces no output for >5 minutes, no updates are emitted and the timeout fires — even though the agent process is alive and working. The child process exit handler already covers actual crashes, so the timeout was acting as a redundant (and overly aggressive) liveness detector.

Design

sendPrompt()
  ├── startPromptKeepalive()    // setInterval every 60s
  │     └── if child alive → resetSessionPromptTimeouts()
  ├── sendRequest('session/prompt', ...)
  └── finally: stopPromptKeepalive()

Also cleaned up in:
  ├── handleProcessExit()
  └── disconnect()

The keepalive reuses the existing resetSessionPromptTimeouts() method — the same one called on every streaming update — so the timeout semantics are unchanged: "300s since last sign of life". The only difference is that a live process now counts as a sign of life.

Test plan

Run an ACP agent (Codex/Claude) with a task that triggers a long-running bash command (>5 min, e.g. sleep 400)
Verify no false timeout — previously would show "LLM request timed out after 300 seconds"
Verify that killing the agent process still triggers the exit handler (not the timeout)
Verify normal short tasks still complete without delay
Verify disconnect() stops the keepalive cleanly (no leaked intervals)

🤖 Generated with Claude Code

sentry · 2026-03-25T06:03:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…eouts The existing 300-second timeout for session/prompt requests resets only when a session.update notification arrives from the ACP bridge. When an agent runs a long, silent tool (e.g. a multi-minute build or test suite that produces no streaming output), no session.update is emitted and the timeout fires even though the agent is still actively working. Add a lightweight keepalive interval (every 60 s) that checks whether the child process is still alive. If it is, the prompt timeout timer is reset — the same operation that a streaming update would perform. The process-exit handler already covers the case where the child truly crashes, so the timeout is no longer needed as a liveness detector; it now serves purely as a stale-request safety net. Lifecycle: - Started in sendPrompt(), stopped in its finally block - Also stopped in disconnect() and handleProcessExit() for clean teardown Co-Authored-By: Claude Opus 4.6 <[email protected]>

Cover the keepalive lifecycle (start/stop/idempotent), verify it resets prompt timeouts only when the child process is alive and only for session/prompt requests, and check cleanup in disconnect/processExit. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The keepalive fires every 60 s and resets pending prompt timers while the child process is alive. The existing test configured a 60 s timeout then advanced timers to 61 s, but the keepalive reset the timer at exactly 60 s — so cancelPrompt was never called. Fix: mark child.killed = true at 59 s so the keepalive skips the reset, letting the original 60 s timeout fire as expected. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

JiwaniZakir

The liveness check in startPromptKeepalive() — this.child && !this.child.killed — has a subtle gap: ChildProcess.killed is only set to true when the process was terminated via Node's .kill() method. If the child exits naturally (e.g., crashes, returns exit code 1), killed remains false until the 'exit' event is processed and handleProcessExit runs. In that narrow window, the keepalive interval could fire, see killed === false, and call resetSessionPromptTimeouts() on a process that has already died — potentially extending timeouts right before the rejection cascade. A more reliable guard would be to also check this.child.exitCode !== null || this.child.signalCode !== null, or to maintain an explicit isChildAlive boolean that is set to false at the top of handleProcessExit (before stopPromptKeepalive is called).

Also worth noting: the setInterval handle is never .unref()'d, so it will hold the Node.js event loop open for the duration of any prompt. This is probably fine in production, but it can cause test runners to hang if a test fails before the finally block clears the interval — the fake timers in the test suite mitigate this currently, but it's fragile for any future tests that don't use vi.useFakeTimers().

…ted child ChildProcess.killed is only set to true when the process is terminated via Node's .kill() method. A child that crashes naturally (exit code 1, OOM, etc.) leaves killed=false until the exit event is processed by the JS event loop. In that narrow window the keepalive interval could fire, see killed===false, and call resetSessionPromptTimeouts() on a process that has already died. Fix: extract a private isChildAlive() helper that additionally checks exitCode and signalCode. These are set by the Node.js runtime as soon as the process exits — before any JS event handlers run — so they reliably detect a dead child even before handleProcessExit is called.

JerryLiu369 · 2026-03-26T00:35:46Z

Thanks @JiwaniZakir — good catch on the killed gap.

I've added a private isChildAlive() helper that checks all three conditions:

private isChildAlive(): boolean {
  return (
    this.child !== null &&
    !this.child.killed &&
    this.child.exitCode === null &&
    this.child.signalCode === null
  );
}

exitCode and signalCode are set by the Node.js runtime as soon as the process exits — before any JS event handlers run — so they reliably detect a dead child even in the window before handleProcessExit is called. The keepalive now uses this.isChildAlive() instead of the raw !this.child.killed check.

…ct doc order

…led=false)

JiwaniZakir · 2026-03-26T00:48:12Z

The approach makes sense — polling liveness every 60s is a reasonable interval given the 300s timeout window, giving you ~4 checks before a false fire. One thing worth considering: if startPromptKeepalive resets the timer unconditionally as long as the process is alive, a hung process (alive but not making progress) would never time out. You may want to pair this with a max wall-clock cap or a separate "no output received in N minutes" heuristic to distinguish "alive and working" from "alive and stuck." Also confirm the interval is cleared on both normal resolution and error paths to avoid leaking timers across sessions.

A process that hangs (alive but making no progress) would cause the keepalive to reset timeouts indefinitely. Fix by tracking promptOriginTime on each PendingRequest (set at creation, never updated), and only resetting the timeout from the keepalive when Date.now() - promptOriginTime is still within the request's timeoutDuration budget. This means a hung process will time out after at most one full timeout window beyond the point where the wall-clock budget runs out, while legitimately long-running tools are still protected from false timeouts.

JerryLiu369 · 2026-03-26T01:24:14Z

Good points — here's the breakdown:

On timer leaks: All exit paths are covered. The keepalive is stopped in a finally block wrapping sendRequest('session/prompt', ...) (so both normal completion and thrown exceptions clean it up), in handleProcessExit at the top before any rejection logic, and in disconnect() before terminating the child.

On hung process / wall-clock cap: Added promptOriginTime to PendingRequest — set at creation and never updated. The keepalive now only resets timeouts when Date.now() - request.promptOriginTime < request.timeoutDuration, reusing the existing 300 s prompt timeout as the budget. Once the wall-clock budget is exhausted the keepalive stops resetting, and the request times out naturally within one additional timeout window. Added a test case covering this scenario (hung process with promptOriginTime exceeding the budget — keepalive does not reset).

JiwaniZakir · 2026-03-26T01:47:37Z

The keepalive interval approach makes sense here, but make sure startPromptKeepalive() always clears the interval on both successful resolution and rejection of the prompt promise — if the interval isn't cleaned up on error paths, you'll leak timers and potentially keep a dead session alive in the process table. Also worth confirming the child liveness check is synchronous/non-throwing, since an exception inside setInterval can silently swallow errors and leave the keepalive running in a broken state.

JerryLiu369 · 2026-03-26T02:21:59Z

Both points confirmed:

On timer cleanup: stopPromptKeepalive() is called in a finally block wrapping sendRequest('session/prompt', ...), so it fires on both normal resolution and any thrown exception. It is also called at the top of handleProcessExit() and at the start of disconnect(), covering all other exit paths.

On liveness check safety: isChildAlive() is a pure synchronous function — it only reads this.child, this.child.killed, this.child.exitCode, and this.child.signalCode. All four are non-throwing property accesses on a ChildProcess object. No exceptions can originate from this call, so setInterval will never silently swallow an error from the liveness check.

piorpua · 2026-03-26T02:47:33Z

Code Review：feat(acp): add process-liveness keepalive to prevent false prompt timeouts (#1702)

变更概述

此 PR 在 AcpConnection.ts 中新增了一个周期性 keepalive 机制（每 60 秒触发），在 session/prompt 请求 pending 期间检查子进程是否存活，并在存活时重置超时计时器。同时添加了 promptOriginTime 字段作为 wall-clock 上限，防止 hung 进程被无限续期。修改仅涉及 AcpConnection.ts 一个源文件，并配套了完整的单元测试。

方案评估

结论：✅ 方案合理

方案正确解决了 PR 描述的问题：现有超时仅在收到 session.update 流式通知时重置，而长时间静默运行的工具（如多分钟 build/测试）不产生任何通知，导致误超时。keepalive 通过复用已有的 resetSessionPromptTimeouts() 方法，以最小代价解决了问题，且与项目现有的超时暂停/恢复机制（permission dialog）兼容。isChildAlive() 额外检查 exitCode/signalCode 填补了 killed 的已知盲区，设计细节到位。finally 块确保 keepalive 在所有路径下都能清理，handleProcessExit() 和 disconnect() 也均已调用 stopPromptKeepalive()，无泄漏风险。

问题清单

🔵 LOW — 新增测试文件中 `vi.mock` 路径可能不生效

文件：tests/unit/acpConnectionKeepalive.test.ts，第 26–45 行

问题代码：

vi.mock('./acpConnectors', () => ({...}));
vi.mock('./utils', () => ({...}));
vi.mock('./modelInfo', () => ({...}));

问题说明：vi.mock 的相对路径相对于调用文件解析（即 tests/unit/），而 AcpConnection.ts 中的对应 import 相对于 src/process/agent/acp/ 解析，两者是不同的模块 ID。这些 mock 实际上可能没有拦截到 AcpConnection.ts 依赖的真实模块。

CI 能通过是因为这些测试用例通过直接操作私有字段调用私有方法，完全绕开了 connectClaude、writeJsonRpcMessage 等函数，所以即使 mock 无效也不影响测试结果。对比 acpTimeout.test.ts：该文件同样测试 AcpConnection 但完全没有这些 mock，同样通过。

修复建议：删除这 5 个 vi.mock 调用，保持与 acpTimeout.test.ts 一致的做法。如果将来有测试用例触发了 connect() 等路径，再按需补充正确路径格式（如 vi.mock('../../src/process/agent/acp/acpConnectors', ...)）。

🔵 LOW — `priv()` 辅助函数触发 lint 警告

文件：tests/unit/acpConnectionKeepalive.test.ts，第 50–52 行

问题代码：

function priv(conn: AcpConnection): Record<string, any> {
  return conn as unknown as Record<string, any>;
}

问题说明：no-explicit-any lint 规则报了 2 个警告。虽然测试文件通常对 any 有更大容忍度，但现有 acpTimeout.test.ts 中也存在类似写法，属于统一的历史问题，本 PR 新增的这两个是新引入的。

修复建议：

function priv(conn: AcpConnection): Record<string, unknown> {
  return conn as unknown as Record<string, unknown>;
}

汇总

#	严重级别	文件	问题
1	🔵 LOW	`tests/unit/acpConnectionKeepalive.test.ts:26-45`	`vi.mock` 路径相对于测试文件解析，未能拦截源文件中的模块
2	🔵 LOW	`tests/unit/acpConnectionKeepalive.test.ts:50-51`	`priv()` 辅助函数引入 2 个 `no-explicit-any` lint 警告

结论

⚠️ 有条件批准 — 无阻塞性问题，两处均为测试文件的低优先级 lint/结构问题，处理后可合并。

本报告由本地 pr-review skill 生成，包含完整项目上下文，无截断限制。

- Remove ineffective vi.mock calls with wrong relative paths that did not intercept source module dependencies - Replace Record<string, any> with Record<string, unknown> in priv() helper to fix no-explicit-any lint warnings Review follow-up for #1702

JerryLiu369 force-pushed the feat/acp-prompt-keepalive branch 2 times, most recently from b327cfc to b20e857 Compare March 25, 2026 11:50

JerryLiu369 and others added 2 commits March 25, 2026 22:06

JerryLiu369 force-pushed the feat/acp-prompt-keepalive branch from b20e857 to b86e30e Compare March 25, 2026 14:07

JiwaniZakir reviewed Mar 25, 2026

View reviewed changes

JerryLiu369 added 2 commits March 26, 2026 08:37

fix(acp): move isChildAlive() before startPromptKeepalive() for corre…

eece509

…ct doc order

test(acp): add coverage for naturally-exited child (exitCode set, kil…

5ca5224

…led=false)

style: apply oxfmt formatting to AcpConnection.ts

a03c12b

piorpua self-assigned this Mar 26, 2026

piorpua approved these changes Mar 26, 2026

View reviewed changes

piorpua merged commit 6eab65b into iOfficeAI:main Mar 26, 2026
14 checks passed

piorpua mentioned this pull request Mar 26, 2026

fix(acp): address review issues from PR #1702 #1725

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(acp): add process-liveness keepalive to prevent false prompt timeouts#1702

feat(acp): add process-liveness keepalive to prevent false prompt timeouts#1702
piorpua merged 8 commits intoiOfficeAI:mainfrom
JerryLiu369:feat/acp-prompt-keepalive

JerryLiu369 commented Mar 25, 2026

Uh oh!

sentry bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

JiwaniZakir left a comment

Uh oh!

JerryLiu369 commented Mar 26, 2026

Uh oh!

JiwaniZakir commented Mar 26, 2026

Uh oh!

JerryLiu369 commented Mar 26, 2026

Uh oh!

JiwaniZakir commented Mar 26, 2026

Uh oh!

JerryLiu369 commented Mar 26, 2026

Uh oh!

piorpua commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JerryLiu369 commented Mar 25, 2026

Summary

Motivation

Design

Test plan

Uh oh!

sentry bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

JerryLiu369 commented Mar 26, 2026

Uh oh!

JiwaniZakir commented Mar 26, 2026

Uh oh!

JerryLiu369 commented Mar 26, 2026

Uh oh!

JiwaniZakir commented Mar 26, 2026

Uh oh!

JerryLiu369 commented Mar 26, 2026

Uh oh!

piorpua commented Mar 26, 2026

Code Review：feat(acp): add process-liveness keepalive to prevent false prompt timeouts (#1702)

变更概述

方案评估

问题清单

🔵 LOW — 新增测试文件中 vi.mock 路径可能不生效

🔵 LOW — priv() 辅助函数触发 lint 警告

汇总

结论

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sentry bot commented Mar 25, 2026 •

edited

Loading

🔵 LOW — 新增测试文件中 `vi.mock` 路径可能不生效

🔵 LOW — `priv()` 辅助函数触发 lint 警告