feat(acp): add process-liveness keepalive to prevent false prompt timeouts#1702
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
b327cfc to
b20e857
Compare
…eouts The existing 300-second timeout for session/prompt requests resets only when a session.update notification arrives from the ACP bridge. When an agent runs a long, silent tool (e.g. a multi-minute build or test suite that produces no streaming output), no session.update is emitted and the timeout fires even though the agent is still actively working. Add a lightweight keepalive interval (every 60 s) that checks whether the child process is still alive. If it is, the prompt timeout timer is reset — the same operation that a streaming update would perform. The process-exit handler already covers the case where the child truly crashes, so the timeout is no longer needed as a liveness detector; it now serves purely as a stale-request safety net. Lifecycle: - Started in sendPrompt(), stopped in its finally block - Also stopped in disconnect() and handleProcessExit() for clean teardown Co-Authored-By: Claude Opus 4.6 <[email protected]>
Cover the keepalive lifecycle (start/stop/idempotent), verify it resets prompt timeouts only when the child process is alive and only for session/prompt requests, and check cleanup in disconnect/processExit. Co-Authored-By: Claude Opus 4.6 <[email protected]>
b20e857 to
b86e30e
Compare
The keepalive fires every 60 s and resets pending prompt timers while the child process is alive. The existing test configured a 60 s timeout then advanced timers to 61 s, but the keepalive reset the timer at exactly 60 s — so cancelPrompt was never called. Fix: mark child.killed = true at 59 s so the keepalive skips the reset, letting the original 60 s timeout fire as expected. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
JiwaniZakir
left a comment
There was a problem hiding this comment.
The liveness check in startPromptKeepalive() — this.child && !this.child.killed — has a subtle gap: ChildProcess.killed is only set to true when the process was terminated via Node's .kill() method. If the child exits naturally (e.g., crashes, returns exit code 1), killed remains false until the 'exit' event is processed and handleProcessExit runs. In that narrow window, the keepalive interval could fire, see killed === false, and call resetSessionPromptTimeouts() on a process that has already died — potentially extending timeouts right before the rejection cascade. A more reliable guard would be to also check this.child.exitCode !== null || this.child.signalCode !== null, or to maintain an explicit isChildAlive boolean that is set to false at the top of handleProcessExit (before stopPromptKeepalive is called).
Also worth noting: the setInterval handle is never .unref()'d, so it will hold the Node.js event loop open for the duration of any prompt. This is probably fine in production, but it can cause test runners to hang if a test fails before the finally block clears the interval — the fake timers in the test suite mitigate this currently, but it's fragile for any future tests that don't use vi.useFakeTimers().
…ted child ChildProcess.killed is only set to true when the process is terminated via Node's .kill() method. A child that crashes naturally (exit code 1, OOM, etc.) leaves killed=false until the exit event is processed by the JS event loop. In that narrow window the keepalive interval could fire, see killed===false, and call resetSessionPromptTimeouts() on a process that has already died. Fix: extract a private isChildAlive() helper that additionally checks exitCode and signalCode. These are set by the Node.js runtime as soon as the process exits — before any JS event handlers run — so they reliably detect a dead child even before handleProcessExit is called.
|
Thanks @JiwaniZakir — good catch on the I've added a private private isChildAlive(): boolean {
return (
this.child !== null &&
!this.child.killed &&
this.child.exitCode === null &&
this.child.signalCode === null
);
}
|
|
The approach makes sense — polling liveness every 60s is a reasonable interval given the 300s timeout window, giving you ~4 checks before a false fire. One thing worth considering: if |
A process that hangs (alive but making no progress) would cause the keepalive to reset timeouts indefinitely. Fix by tracking promptOriginTime on each PendingRequest (set at creation, never updated), and only resetting the timeout from the keepalive when Date.now() - promptOriginTime is still within the request's timeoutDuration budget. This means a hung process will time out after at most one full timeout window beyond the point where the wall-clock budget runs out, while legitimately long-running tools are still protected from false timeouts.
|
Good points — here's the breakdown: On timer leaks: All exit paths are covered. The keepalive is stopped in a On hung process / wall-clock cap: Added |
|
The keepalive interval approach makes sense here, but make sure |
|
Both points confirmed: On timer cleanup: On liveness check safety: |
Code Review:feat(acp): add process-liveness keepalive to prevent false prompt timeouts (#1702)变更概述此 PR 在 方案评估结论:✅ 方案合理 方案正确解决了 PR 描述的问题:现有超时仅在收到 问题清单🔵 LOW — 新增测试文件中
|
| # | 严重级别 | 文件 | 问题 |
|---|---|---|---|
| 1 | 🔵 LOW | tests/unit/acpConnectionKeepalive.test.ts:26-45 |
vi.mock 路径相对于测试文件解析,未能拦截源文件中的模块 |
| 2 | 🔵 LOW | tests/unit/acpConnectionKeepalive.test.ts:50-51 |
priv() 辅助函数引入 2 个 no-explicit-any lint 警告 |
结论
本报告由本地 pr-review skill 生成,包含完整项目上下文,无截断限制。
- Remove ineffective vi.mock calls with wrong relative paths that did not intercept source module dependencies - Replace Record<string, any> with Record<string, unknown> in priv() helper to fix no-explicit-any lint warnings Review follow-up for #1702
- Remove ineffective vi.mock calls with wrong relative paths that did not intercept source module dependencies - Replace Record<string, any> with Record<string, unknown> in priv() helper to fix no-explicit-any lint warnings Review follow-up for #1702
Summary
session/promptrequests that checks child process liveness and resets the 300s timeout timerLLM request timed out after 300 secondserrors when an ACP agent is executing a long-running silent tool (e.g. multi-minute build, test suite, or heavy computation that produces no streaming output)AcpConnection.ts; no external bridge changes neededMotivation
The existing timeout resets only on
session.updatenotifications from the ACP bridge. But when an agent runs a tool that produces no output for >5 minutes, no updates are emitted and the timeout fires — even though the agent process is alive and working. The child processexithandler already covers actual crashes, so the timeout was acting as a redundant (and overly aggressive) liveness detector.Design
The keepalive reuses the existing
resetSessionPromptTimeouts()method — the same one called on every streaming update — so the timeout semantics are unchanged: "300s since last sign of life". The only difference is that a live process now counts as a sign of life.Test plan
sleep 400)disconnect()stops the keepalive cleanly (no leaked intervals)🤖 Generated with Claude Code