Skip to content

fix(acp): clear cronBusyGuard and status on sendMessage failure#1700

Merged
piorpua merged 7 commits intoiOfficeAI:mainfrom
JerryLiu369:fix/acp-timeout-no-finish-signal
Mar 26, 2026
Merged

fix(acp): clear cronBusyGuard and status on sendMessage failure#1700
piorpua merged 7 commits intoiOfficeAI:mainfrom
JerryLiu369:fix/acp-timeout-no-finish-signal

Conversation

@JerryLiu369
Copy link
Copy Markdown
Member

@JerryLiu369 JerryLiu369 commented Mar 25, 2026

Problem

When AcpAgent.sendMessage() fails (e.g. after a prompt timeout), AcpAgentManager never cleared cronBusyGuard or reset this.status. This left the conversation stuck in a perpetually "busy" state, causing:

  • Cron jobs to be skipped (cronBusyGuard thinks the conversation is still processing)
  • Subsequent user messages to not be processed

Fix

After agent.sendMessage() returns {success: false}, immediately clear cronBusyGuard and reset status to 'finished' at both call sites in AcpAgentManager.sendMessage().

Note: The finish-signal-on-timeout fix originally in this PR has been removed — PR #1709 covers that with a more complete session/cancel approach. This PR now focuses solely on the cronBusyGuard cleanup which #1709 does not address.

Test plan

  • After a prompt timeout, verify a new message can be sent without the conversation being stuck
  • Verify cron jobs resume normally after a timeout
  • Unit tests added: tests/unit/acpAgentTimeoutFinish.test.ts

🤖 Generated with Claude Code

@sentry
Copy link
Copy Markdown

sentry bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@JerryLiu369 JerryLiu369 force-pushed the fix/acp-timeout-no-finish-signal branch from a405920 to 64feea7 Compare March 25, 2026 09:49
@JerryLiu369 JerryLiu369 changed the title fix(acp): emit finish signal and clear busy state on sendMessage failure fix(acp): clear cronBusyGuard and status on sendMessage failure Mar 25, 2026
@JerryLiu369 JerryLiu369 force-pushed the fix/acp-timeout-no-finish-signal branch from 7bffd5c to 456e511 Compare March 25, 2026 11:49
JerryLiu369 and others added 2 commits March 25, 2026 22:06
Remove finish signal emission (covered by PR iOfficeAI#1709) and keep only the
cronBusyGuard/status cleanup on sendMessage failure, which iOfficeAI#1709 does
not address.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@JerryLiu369 JerryLiu369 force-pushed the fix/acp-timeout-no-finish-signal branch from 456e511 to 54356eb Compare March 25, 2026 14:06
JerryLiu369 and others added 2 commits March 26, 2026 01:03
…p branches

Tests instantiate the real AcpAgentManager (with BaseAgentManager mocked to
avoid ForkTask process spawning) and verify that both sendMessage paths
(first-message and subsequent-message) correctly clear cronBusyGuard and set
status='finished' when agent.sendMessage() returns {success:false}.

Covers lines 645-648 and 657-660 added in the preceding commit to fix
the perpetual busy/running state on timeout.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix covers the !result.success return path, but the diff shows a catch (e) block immediately after the second cleanup — it's not visible whether that handler also calls cronBusyGuard.setProcessing(this.conversation_id, false) and resets this.status. If agent.sendMessage throws rather than returning {success: false}, the guard could still end up stuck, which is the same class of bug this PR is fixing.

The two new blocks are also identical — cronBusyGuard.setProcessing(this.conversation_id, false); this.status = 'finished'; — and a small private helper (e.g. private clearBusyState()) would reduce the duplication and make future changes (like adding telemetry or logging on failure) easier to apply consistently.

On the test side, the "subsequent-message path" comment says it's "triggered when msg_id is absent," but the production code distinction appears to be about isFirstMessage, not msg_id presence — the comment could mislead someone maintaining these tests later. It would also strengthen coverage to add a test asserting that a thrown exception from agent.sendMessage (not just a {success: false} return) still results in setProcessing(false) being called, assuming the catch block is intended to do that.

- Replace three identical cronBusyGuard/status cleanup blocks with a
  private clearBusyState() helper method
- Fix misleading test comment: path is triggered by isFirstMessage,
  not msg_id absence
- Add two tests covering the thrown-exception (catch) path to ensure
  setProcessing(false) and status='finished' are called on throws
@JerryLiu369
Copy link
Copy Markdown
Member Author

Thanks for the detailed review, @JiwaniZakir!

On the catch block: The catch block already had cronBusyGuard.setProcessing(this.conversation_id, false) and this.status = 'finished' before this PR — it just wasn't visible in the diff since those lines weren't changed. The thrown-exception path was always covered.

On duplication: Good call. I've extracted a private clearBusyState() helper that consolidates all three cleanup sites (first-message path, subsequent-message path, and catch block). Much cleaner.

On the test comment: Fixed — updated to reference isFirstMessage instead of msg_id absence.

On throw-path test coverage: Added two new tests asserting that a thrown exception from agent.sendMessage also results in setProcessing(false) being called and status being reset to 'finished'.

@JiwaniZakir
Copy link
Copy Markdown

The fix looks correct — clearing cronBusyGuard and resetting status to 'finished' at both call sites is the right approach to unblock subsequent messages and cron execution after a failure. One thing worth double-checking: ensure the cleanup happens in a finally-style block or is guaranteed to run even if the sendMessage call throws rather than just returning {success: false}, so an unexpected exception path doesn't leave the same stuck state. Also confirm the unit tests cover the case where a second message is dispatched immediately after the failure to verify the guard is truly cleared before the next invocation.

@JerryLiu369
Copy link
Copy Markdown
Member Author

Good questions — here's the breakdown:

On the finally-style guarantee: The entire sendMessage body is wrapped in one try/catch. Any exception thrown anywhere inside — including from agent.sendMessage — falls into the catch (e) block, which calls clearBusyState() as its very first action (line 662), before any error-message construction or emitting. So the cleanup is unconditional on the exception path. The two if (!result.success) blocks handle the non-throwing failure path. All three exit paths call clearBusyState().

On second-message coverage: Added a new test (guard is cleared before second sendMessage so it can set busy again) that sends a first message that returns {success:false}, confirms the guard is cleared, then sends a second message and asserts that setProcessing(conv, true) is called again — proving the guard is fully reset before the next invocation.

@JiwaniZakir
Copy link
Copy Markdown

The cleanup logic looks correct — resetting cronBusyGuard and status on failure is the right place to do it since those two state values are what gate subsequent processing. One thing worth confirming: are both call sites handling the {success: false} path symmetrically, or is there any async timing difference between them that could cause a race if a cron job fires between the sendMessage return and the guard reset? Given that #1709 handles the session/cancel path separately, it's also worth documenting (even in a comment) that this fix assumes the agent itself is no longer processing when success: false is returned — otherwise resetting cronBusyGuard early could let a cron job overlap with a still-running agent.

@piorpua piorpua requested review from piorpua and removed request for piorpua March 26, 2026 02:13
@piorpua piorpua self-assigned this Mar 26, 2026
@piorpua
Copy link
Copy Markdown
Contributor

piorpua commented Mar 26, 2026

Code Review:fix(acp): clear cronBusyGuard and status on sendMessage failure (#1700)

变更概述

本 PR 修复了 AcpAgentManager.sendMessage() 在返回 {success: false}(如超时)时未清理 cronBusyGuardstatus 的问题,导致会话永久卡在忙碌状态。修复方式是提取 clearBusyState() 私有方法,并在两个消息路径的失败分支(以及已有的 catch 块)中调用它。同时新增了两个测试文件:acpAgentManagerCronGuard.test.ts(测试真实类)和 acpAgentTimeoutFinish.test.ts(测试手写 stub)。


方案评估

结论:✅ 方案合理

提取 clearBusyState() 消除了重复代码,两条消息路径(带 msg_id 的首条消息路径和后续消息路径)均已覆盖,catch 块也已统一。success: false 时流式输出尚未开始,因此立即清理不会引入竞态;即使出现双重清理(finish 事件 + 本次修复),setProcessing(false) 也是幂等的,不会有副作用。方案与已有架构一致,没有过度设计。


问题清单

🟡 MEDIUM — acpAgentTimeoutFinish.test.ts 测试的是手写 stub 而非真实类

文件tests/unit/acpAgentTimeoutFinish.test.ts,第 66–97 行

问题代码

// 用相对路径 mock 模块(实际上这些路径在 tests/unit/ 下根本不存在)
vi.mock('./BaseAgentManager', () => ({}));
vi.mock('./IpcAgentEventEmitter', () => ({ IpcAgentEventEmitter: class {} }));
vi.mock('./CronCommandDetector', () => ({ hasCronCommands: vi.fn(() => false) }));
...

// 然后测试一个手写的 makeSendMessageLogic stub,从未 import AcpAgentManager
function makeSendMessageLogic(conversationId: string) {
  // 手工复制了生产逻辑
}

问题说明:该文件的相对路径 mock(./BaseAgentManager 等)解析到 tests/unit/ 目录下,并不指向生产代码的模块(应为 @process/task/BaseAgentManager),所以这些 vi.mock 调用实际是空操作。文件从未导入 AcpAgentManager,而是通过 makeSendMessageLogic 测试一个手写 stub——本质上是在测试 stub 本身,而非生产代码。如果有人修改了 AcpAgentManager 的真实逻辑但忘记更新这里的 stub,所有测试仍然通过,产生错误的信心。acpAgentManagerCronGuard.test.ts 已经测试了真实类,建议移除 acpAgentTimeoutFinish.test.ts 或将其改为测试真实类。

修复建议:直接删除该文件,由 acpAgentManagerCronGuard.test.ts 提供覆盖;或者改写为导入真实 AcpAgentManager 类(参考 acpAgentManagerCronGuard.test.ts 的结构)。


🔵 LOW — 新测试文件中有大量 no-explicit-any lint 警告

文件tests/unit/acpAgentManagerCronGuard.test.ts(12 处)、tests/unit/acpAgentTimeoutFinish.test.ts(3 处)

问题代码示例acpAgentManagerCronGuard.test.ts:72):

constructor(_type: string, data: any, _emitter: any) {

以及(第 107、114、115、118 行):

backend: 'claude' as any,
(manager as any).agent = mockAgent;
(manager as any).bootstrap = Promise.resolve(mockAgent);
(manager as any).isFirstMessage = false;

问题说明:oxlint 在新增的测试文件中报告了 15 处 no-explicit-any WARNING(以及 2 处 no-extraneous-class)。测试文件通常对 any 有更宽松的惯例,但项目的 CI Code Quality 已启用此规则,这些警告是真实的 lint 违规。

修复建议:对于 mock 类型,使用 unknown 或更精确的类型替代 any;对于访问私有属性,可用类型转换辅助函数或使用 as unknown as TargetType;对于空类的 mock,改用 {} 直接赋值而非 class {}


汇总

# 严重级别 文件 问题
1 🟡 MEDIUM tests/unit/acpAgentTimeoutFinish.test.ts 测试手写 stub 而非真实 AcpAgentManager,存在错误的覆盖信心
2 🔵 LOW tests/unit/acpAgentManagerCronGuard.test.ts 15 处 no-explicit-any / no-extraneous-class lint 警告
3 🔵 LOW tests/unit/acpAgentTimeoutFinish.test.ts 3 处 no-explicit-any / no-extraneous-class lint 警告

结论

⚠️ 有条件批准 — 核心 bugfix 逻辑正确,clearBusyState() 的提取干净合理,acpAgentManagerCronGuard.test.ts 对真实类的测试质量良好。建议移除或修正 acpAgentTimeoutFinish.test.ts(MEDIUM 问题),并补充修复 lint 警告,之后可合并。


本报告由本地 pr-review skill 生成,包含完整项目上下文,无截断限制。

@piorpua piorpua merged commit 2f81e12 into iOfficeAI:main Mar 26, 2026
14 checks passed
@JerryLiu369
Copy link
Copy Markdown
Member Author

Good questions — here's the analysis:

On race conditions: There is no async gap. JS's single-threaded event loop means that from the point agent.sendMessage() returns to the point clearBusyState() is called, no other code (including cron schedulers) can execute. Both call sites are identical in this respect — one sync statement immediately follows the await.

On the assumption that {success:false} means the agent is done: Confirmed and now documented in a comment. Tracing the call chain:

  • On timeout, AcpConnection.handlePromptTimeout() sends session/cancel to the backend and rejects the pending promise.
  • AcpAgent.sendMessage() catches that rejection, emits a finish signal to reset the frontend, then returns {success:false} — it does not re-throw.
  • By the time AcpAgentManager sees success:false, the agent has already cancelled and signalled completion.

This means the catch block in AcpAgentManager only fires for lower-level errors that bypass AcpAgent's own error handling. Added comments at both {success:false} sites to document this invariant.

@JiwaniZakir
Copy link
Copy Markdown

The isChildAlive() approach is solid for detecting terminated processes, but the hung-process gap JiwaniZakir flagged is real — if the child is alive but wedged, exitCode === null and killed === false will both pass, so the keepalive timer never fires. Worth pairing this with a response deadline on the child side: if no output is received within some threshold (say, 90s into the 300s window), treat it as hung and force-kill, which will then trigger the clearBusyState() path cleanly. That way liveness and responsiveness are both covered rather than just process existence.

piorpua pushed a commit that referenced this pull request Mar 26, 2026
- Remove acpAgentTimeoutFinish.test.ts: tested a hand-written stub instead
  of the real AcpAgentManager class, providing false coverage confidence
- Fix no-explicit-any lint warnings in acpAgentManagerCronGuard.test.ts:
  replace `any` with `unknown`/precise types, use `as unknown as T` for
  private property access, replace empty `class {}` mock with `vi.fn()`

Review follow-up for #1700
kaizhou-lab pushed a commit that referenced this pull request Mar 26, 2026
- Remove acpAgentTimeoutFinish.test.ts: tested a hand-written stub instead
  of the real AcpAgentManager class, providing false coverage confidence
- Fix no-explicit-any lint warnings in acpAgentManagerCronGuard.test.ts:
  replace `any` with `unknown`/precise types, use `as unknown as T` for
  private property access, replace empty `class {}` mock with `vi.fn()`

Review follow-up for #1700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants