Skip to content

fix(agents): add drain deadline and reduce announce timeouts#10355

Closed
pycckuu wants to merge 3 commits intoopenclaw:mainfrom
pycckuu:pycckuu/fix-announce-drain-blocking
Closed

fix(agents): add drain deadline and reduce announce timeouts#10355
pycckuu wants to merge 3 commits intoopenclaw:mainfrom
pycckuu:pycckuu/fix-announce-drain-blocking

Conversation

@pycckuu
Copy link
Contributor

@pycckuu pycckuu commented Feb 6, 2026

Summary

Prevents the gateway from becoming unresponsive when multiple sub-agents complete and try to announce results back to the main session (#10334). Three targeted changes:

  • Add 120s deadline to announce queue drain loop to prevent unbounded blocking
  • Reduce per-announce timeout from 60s to 30s for faster failure recovery
  • Discard remaining items on timeout to prevent infinite reschedule loop
  • Add diagnostic logging when drain times out

Changes

src/agents/subagent-announce-queue.ts

  • Add DRAIN_DEADLINE_MS = 120_000 constant and Date.now() < deadline check to the while loop in scheduleAnnounceDrain()
  • Log warning via defaultRuntime.error() when drain exits due to timeout
  • Discard remaining items, dropped count, and summary lines after timeout to prevent the finally block from rescheduling a fresh drain with a new 120s window (infinite loop)

src/agents/subagent-announce.ts

  • Reduce timeoutMs from 60_000 to 30_000 in both sendAnnounce() and runSubagentAnnounceFlow() direct send path

src/agents/subagent-announce-queue.test.ts (new)

  • Happy path: all items drain when send completes within deadline
  • Deadline exceeded: exactly 3 of 5 items process, rest discarded
  • Error recovery: send throws, error logged, subsequent enqueue works
  • No reschedule after timeout: verifies items discarded, no further sends
  • Single item: boundary case drains cleanly

Test plan

  • New tests pass (5/5): pnpm test src/agents/subagent-announce-queue.test.ts
  • Existing announce format tests pass (11/11): pnpm test src/agents/subagent-announce.format.test.ts
  • Manual: spawn 3+ sub-agents concurrently, verify gateway remains responsive during announce phase

Closes #10334

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +92 to 95
const DRAIN_DEADLINE_MS = 120_000;
const deadline = Date.now() + DRAIN_DEADLINE_MS;
while ((queue.items.length > 0 || queue.droppedCount > 0) && Date.now() < deadline) {
await waitForQueueDebounce(queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadline causes busy loop

When the drain hits the deadline, the finally block immediately calls scheduleAnnounceDrain(key) again if there are still queued/dropped items. Since each new drain recomputes deadline = Date.now() + 120_000, this never actually stops draining and can spin a tight loop if waitForQueueDebounce(queue) returns quickly (e.g., debounceMs=0). This defeats the goal of preventing unbounded blocking and can peg the event loop while the underlying send() is slow/failing.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-queue.ts
Line: 92:95

Comment:
**Deadline causes busy loop**

When the drain hits the `deadline`, the `finally` block immediately calls `scheduleAnnounceDrain(key)` again if there are still queued/dropped items. Since each new drain recomputes `deadline = Date.now() + 120_000`, this never actually stops draining and can spin a tight loop if `waitForQueueDebounce(queue)` returns quickly (e.g., `debounceMs=0`). This defeats the goal of preventing unbounded blocking and can peg the event loop while the underlying `send()` is slow/failing.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1df48c671 — after drain deadline, remaining items are now discarded (lines 159-162) instead of rescheduling, which prevents the infinite loop.

Comment on lines 76 to 80
// Advance time in steps so the async drain loop can interleave properly.
// Each step resolves one send (50s) and lets the loop iterate.
for (let step = 0; step < 6; step++) {
await vi.advanceTimersByTimeAsync(25_000);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flaky fake-timer test

This test doesn't reliably allow the drain loop to reach the timeout path: send() awaits a 50s timer, but you only advance 25_000ms per step (total 150s). Depending on how many times the loop is awaiting the 50s timer vs. debounce, sendCount may end up being 0 or 5, making the >0 && <5 assertion flaky. Consider advancing enough time to deterministically complete N sends (e.g., advance 50s per expected send) and then advance past the 120s deadline, or explicitly await drain completion before asserting.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-queue.test.ts
Line: 76:80

Comment:
**Flaky fake-timer test**

This test doesn't reliably allow the drain loop to reach the timeout path: `send()` awaits a 50s timer, but you only advance `25_000ms` per step (total 150s). Depending on how many times the loop is awaiting the 50s timer vs. debounce, `sendCount` may end up being 0 or 5, making the `>0 && <5` assertion flaky. Consider advancing enough time to deterministically complete N sends (e.g., advance 50s per expected send) and then advance past the 120s deadline, or explicitly await drain completion before asserting.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1df48c671 and 6faafc4a1 — added vi.resetModules() for test isolation, tightened assertion to exact toHaveBeenCalledTimes(3), and added a dedicated "discards remaining items on timeout without rescheduling" test that verifies no reschedule fires after deadline.

@pycckuu
Copy link
Contributor Author

pycckuu commented Feb 6, 2026

Pushed follow-up commit 1df48c671 addressing review feedback:

  1. Fixed infinite reschedule loop — after drain deadline, remaining items are now discarded instead of rescheduling a fresh drain with a new 120s window
  2. Improved test isolation — added vi.resetModules() in beforeEach to prevent cross-test state leakage
  3. Tightened test assertions — exact toHaveBeenCalledTimes(3) instead of loose range checks
  4. Added timeout budget comment — documents the relationship between 120s drain deadline and 30s per-send timeout
  5. Fixed test comments — accurate description of time advancement steps

When multiple sub-agents complete simultaneously, the announce queue
drain loop could block the main session lane indefinitely — each
announce triggers a full agent turn (deliver: true, expectFinal: true)
and they process sequentially with no overall time limit.

- Add 120s deadline to scheduleAnnounceDrain() while loop
- Reduce per-announce callGateway timeout from 60s to 30s
- Log diagnostic when drain times out with items remaining

Refs openclaw#10334
After drain deadline, clear remaining items instead of rescheduling
a fresh drain with a new 120s window. Also improve test isolation
with vi.resetModules() and tighten assertions to exact send count.

Refs openclaw#10334
- send throws: verifies error logging and queue recovery for subsequent enqueues
- timeout discards: confirms no reschedule loop after deadline (items cleared)
- single item: boundary case for minimal queue

Refs openclaw#10334
@pycckuu pycckuu force-pushed the pycckuu/fix-announce-drain-blocking branch from 6faafc4 to 43e2463 Compare February 9, 2026 14:09
@pycckuu pycckuu closed this Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sub-agent announce queue drain blocks main session lane, causing gateway unresponsiveness

1 participant

Comments