Skip to content

feat(acp): add process-liveness keepalive to prevent false prompt timeouts#1702

Merged
piorpua merged 8 commits intoiOfficeAI:mainfrom
JerryLiu369:feat/acp-prompt-keepalive
Mar 26, 2026
Merged

feat(acp): add process-liveness keepalive to prevent false prompt timeouts#1702
piorpua merged 8 commits intoiOfficeAI:mainfrom
JerryLiu369:feat/acp-prompt-keepalive

Conversation

@JerryLiu369
Copy link
Copy Markdown
Member

Summary

  • Add a periodic keepalive (every 60s) during pending session/prompt requests that checks child process liveness and resets the 300s timeout timer
  • Prevents false LLM request timed out after 300 seconds errors when an ACP agent is executing a long-running silent tool (e.g. multi-minute build, test suite, or heavy computation that produces no streaming output)
  • Only touches AcpConnection.ts; no external bridge changes needed

Motivation

The existing timeout resets only on session.update notifications from the ACP bridge. But when an agent runs a tool that produces no output for >5 minutes, no updates are emitted and the timeout fires — even though the agent process is alive and working. The child process exit handler already covers actual crashes, so the timeout was acting as a redundant (and overly aggressive) liveness detector.

Design

sendPrompt()
  ├── startPromptKeepalive()    // setInterval every 60s
  │     └── if child alive → resetSessionPromptTimeouts()
  ├── sendRequest('session/prompt', ...)
  └── finally: stopPromptKeepalive()

Also cleaned up in:
  ├── handleProcessExit()
  └── disconnect()

The keepalive reuses the existing resetSessionPromptTimeouts() method — the same one called on every streaming update — so the timeout semantics are unchanged: "300s since last sign of life". The only difference is that a live process now counts as a sign of life.

Test plan

  • Run an ACP agent (Codex/Claude) with a task that triggers a long-running bash command (>5 min, e.g. sleep 400)
  • Verify no false timeout — previously would show "LLM request timed out after 300 seconds"
  • Verify that killing the agent process still triggers the exit handler (not the timeout)
  • Verify normal short tasks still complete without delay
  • Verify disconnect() stops the keepalive cleanly (no leaked intervals)

🤖 Generated with Claude Code

@sentry
Copy link
Copy Markdown

sentry bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@JerryLiu369 JerryLiu369 force-pushed the feat/acp-prompt-keepalive branch 2 times, most recently from b327cfc to b20e857 Compare March 25, 2026 11:50
JerryLiu369 and others added 2 commits March 25, 2026 22:06
…eouts

The existing 300-second timeout for session/prompt requests resets only
when a session.update notification arrives from the ACP bridge.  When
an agent runs a long, silent tool (e.g. a multi-minute build or test
suite that produces no streaming output), no session.update is emitted
and the timeout fires even though the agent is still actively working.

Add a lightweight keepalive interval (every 60 s) that checks whether
the child process is still alive.  If it is, the prompt timeout timer
is reset — the same operation that a streaming update would perform.
The process-exit handler already covers the case where the child truly
crashes, so the timeout is no longer needed as a liveness detector;
it now serves purely as a stale-request safety net.

Lifecycle:
- Started in sendPrompt(), stopped in its finally block
- Also stopped in disconnect() and handleProcessExit() for clean teardown

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Cover the keepalive lifecycle (start/stop/idempotent), verify it resets
prompt timeouts only when the child process is alive and only for
session/prompt requests, and check cleanup in disconnect/processExit.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@JerryLiu369 JerryLiu369 force-pushed the feat/acp-prompt-keepalive branch from b20e857 to b86e30e Compare March 25, 2026 14:07
The keepalive fires every 60 s and resets pending prompt timers while
the child process is alive. The existing test configured a 60 s timeout
then advanced timers to 61 s, but the keepalive reset the timer at
exactly 60 s — so cancelPrompt was never called.

Fix: mark child.killed = true at 59 s so the keepalive skips the reset,
letting the original 60 s timeout fire as expected.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The liveness check in startPromptKeepalive()this.child && !this.child.killed — has a subtle gap: ChildProcess.killed is only set to true when the process was terminated via Node's .kill() method. If the child exits naturally (e.g., crashes, returns exit code 1), killed remains false until the 'exit' event is processed and handleProcessExit runs. In that narrow window, the keepalive interval could fire, see killed === false, and call resetSessionPromptTimeouts() on a process that has already died — potentially extending timeouts right before the rejection cascade. A more reliable guard would be to also check this.child.exitCode !== null || this.child.signalCode !== null, or to maintain an explicit isChildAlive boolean that is set to false at the top of handleProcessExit (before stopPromptKeepalive is called).

Also worth noting: the setInterval handle is never .unref()'d, so it will hold the Node.js event loop open for the duration of any prompt. This is probably fine in production, but it can cause test runners to hang if a test fails before the finally block clears the interval — the fake timers in the test suite mitigate this currently, but it's fragile for any future tests that don't use vi.useFakeTimers().

…ted child

ChildProcess.killed is only set to true when the process is terminated
via Node's .kill() method. A child that crashes naturally (exit code 1,
OOM, etc.) leaves killed=false until the exit event is processed by the
JS event loop.

In that narrow window the keepalive interval could fire, see killed===false,
and call resetSessionPromptTimeouts() on a process that has already died.

Fix: extract a private isChildAlive() helper that additionally checks
exitCode and signalCode. These are set by the Node.js runtime as soon as
the process exits — before any JS event handlers run — so they reliably
detect a dead child even before handleProcessExit is called.
@JerryLiu369
Copy link
Copy Markdown
Member Author

Thanks @JiwaniZakir — good catch on the killed gap.

I've added a private isChildAlive() helper that checks all three conditions:

private isChildAlive(): boolean {
  return (
    this.child !== null &&
    !this.child.killed &&
    this.child.exitCode === null &&
    this.child.signalCode === null
  );
}

exitCode and signalCode are set by the Node.js runtime as soon as the process exits — before any JS event handlers run — so they reliably detect a dead child even in the window before handleProcessExit is called. The keepalive now uses this.isChildAlive() instead of the raw !this.child.killed check.

@JiwaniZakir
Copy link
Copy Markdown

The approach makes sense — polling liveness every 60s is a reasonable interval given the 300s timeout window, giving you ~4 checks before a false fire. One thing worth considering: if startPromptKeepalive resets the timer unconditionally as long as the process is alive, a hung process (alive but not making progress) would never time out. You may want to pair this with a max wall-clock cap or a separate "no output received in N minutes" heuristic to distinguish "alive and working" from "alive and stuck." Also confirm the interval is cleared on both normal resolution and error paths to avoid leaking timers across sessions.

A process that hangs (alive but making no progress) would cause the
keepalive to reset timeouts indefinitely. Fix by tracking promptOriginTime
on each PendingRequest (set at creation, never updated), and only resetting
the timeout from the keepalive when Date.now() - promptOriginTime is still
within the request's timeoutDuration budget.

This means a hung process will time out after at most one full timeout
window beyond the point where the wall-clock budget runs out, while
legitimately long-running tools are still protected from false timeouts.
@JerryLiu369
Copy link
Copy Markdown
Member Author

Good points — here's the breakdown:

On timer leaks: All exit paths are covered. The keepalive is stopped in a finally block wrapping sendRequest('session/prompt', ...) (so both normal completion and thrown exceptions clean it up), in handleProcessExit at the top before any rejection logic, and in disconnect() before terminating the child.

On hung process / wall-clock cap: Added promptOriginTime to PendingRequest — set at creation and never updated. The keepalive now only resets timeouts when Date.now() - request.promptOriginTime < request.timeoutDuration, reusing the existing 300 s prompt timeout as the budget. Once the wall-clock budget is exhausted the keepalive stops resetting, and the request times out naturally within one additional timeout window. Added a test case covering this scenario (hung process with promptOriginTime exceeding the budget — keepalive does not reset).

@JiwaniZakir
Copy link
Copy Markdown

The keepalive interval approach makes sense here, but make sure startPromptKeepalive() always clears the interval on both successful resolution and rejection of the prompt promise — if the interval isn't cleaned up on error paths, you'll leak timers and potentially keep a dead session alive in the process table. Also worth confirming the child liveness check is synchronous/non-throwing, since an exception inside setInterval can silently swallow errors and leave the keepalive running in a broken state.

@JerryLiu369
Copy link
Copy Markdown
Member Author

Both points confirmed:

On timer cleanup: stopPromptKeepalive() is called in a finally block wrapping sendRequest('session/prompt', ...), so it fires on both normal resolution and any thrown exception. It is also called at the top of handleProcessExit() and at the start of disconnect(), covering all other exit paths.

On liveness check safety: isChildAlive() is a pure synchronous function — it only reads this.child, this.child.killed, this.child.exitCode, and this.child.signalCode. All four are non-throwing property accesses on a ChildProcess object. No exceptions can originate from this call, so setInterval will never silently swallow an error from the liveness check.

@piorpua piorpua self-assigned this Mar 26, 2026
@piorpua
Copy link
Copy Markdown
Contributor

piorpua commented Mar 26, 2026

Code Review:feat(acp): add process-liveness keepalive to prevent false prompt timeouts (#1702)

变更概述

此 PR 在 AcpConnection.ts 中新增了一个周期性 keepalive 机制(每 60 秒触发),在 session/prompt 请求 pending 期间检查子进程是否存活,并在存活时重置超时计时器。同时添加了 promptOriginTime 字段作为 wall-clock 上限,防止 hung 进程被无限续期。修改仅涉及 AcpConnection.ts 一个源文件,并配套了完整的单元测试。


方案评估

结论:✅ 方案合理

方案正确解决了 PR 描述的问题:现有超时仅在收到 session.update 流式通知时重置,而长时间静默运行的工具(如多分钟 build/测试)不产生任何通知,导致误超时。keepalive 通过复用已有的 resetSessionPromptTimeouts() 方法,以最小代价解决了问题,且与项目现有的超时暂停/恢复机制(permission dialog)兼容。isChildAlive() 额外检查 exitCode/signalCode 填补了 killed 的已知盲区,设计细节到位。finally 块确保 keepalive 在所有路径下都能清理,handleProcessExit()disconnect() 也均已调用 stopPromptKeepalive(),无泄漏风险。


问题清单

🔵 LOW — 新增测试文件中 vi.mock 路径可能不生效

文件tests/unit/acpConnectionKeepalive.test.ts,第 26–45 行

问题代码

vi.mock('./acpConnectors', () => ({...}));
vi.mock('./utils', () => ({...}));
vi.mock('./modelInfo', () => ({...}));

问题说明vi.mock 的相对路径相对于调用文件解析(即 tests/unit/),而 AcpConnection.ts 中的对应 import 相对于 src/process/agent/acp/ 解析,两者是不同的模块 ID。这些 mock 实际上可能没有拦截到 AcpConnection.ts 依赖的真实模块。

CI 能通过是因为这些测试用例通过直接操作私有字段调用私有方法,完全绕开了 connectClaudewriteJsonRpcMessage 等函数,所以即使 mock 无效也不影响测试结果。对比 acpTimeout.test.ts:该文件同样测试 AcpConnection 但完全没有这些 mock,同样通过。

修复建议:删除这 5 个 vi.mock 调用,保持与 acpTimeout.test.ts 一致的做法。如果将来有测试用例触发了 connect() 等路径,再按需补充正确路径格式(如 vi.mock('../../src/process/agent/acp/acpConnectors', ...))。


🔵 LOW — priv() 辅助函数触发 lint 警告

文件tests/unit/acpConnectionKeepalive.test.ts,第 50–52 行

问题代码

function priv(conn: AcpConnection): Record<string, any> {
  return conn as unknown as Record<string, any>;
}

问题说明no-explicit-any lint 规则报了 2 个警告。虽然测试文件通常对 any 有更大容忍度,但现有 acpTimeout.test.ts 中也存在类似写法,属于统一的历史问题,本 PR 新增的这两个是新引入的。

修复建议

function priv(conn: AcpConnection): Record<string, unknown> {
  return conn as unknown as Record<string, unknown>;
}

汇总

# 严重级别 文件 问题
1 🔵 LOW tests/unit/acpConnectionKeepalive.test.ts:26-45 vi.mock 路径相对于测试文件解析,未能拦截源文件中的模块
2 🔵 LOW tests/unit/acpConnectionKeepalive.test.ts:50-51 priv() 辅助函数引入 2 个 no-explicit-any lint 警告

结论

⚠️ 有条件批准 — 无阻塞性问题,两处均为测试文件的低优先级 lint/结构问题,处理后可合并。


本报告由本地 pr-review skill 生成,包含完整项目上下文,无截断限制。

@piorpua piorpua merged commit 6eab65b into iOfficeAI:main Mar 26, 2026
14 checks passed
piorpua pushed a commit that referenced this pull request Mar 26, 2026
- Remove ineffective vi.mock calls with wrong relative paths that did
  not intercept source module dependencies
- Replace Record<string, any> with Record<string, unknown> in priv()
  helper to fix no-explicit-any lint warnings

Review follow-up for #1702
kaizhou-lab pushed a commit that referenced this pull request Mar 26, 2026
- Remove ineffective vi.mock calls with wrong relative paths that did
  not intercept source module dependencies
- Replace Record<string, any> with Record<string, unknown> in priv()
  helper to fix no-explicit-any lint warnings

Review follow-up for #1702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants