fix(cron): eliminate double-announce and replace delivery polling with push-based flow#39089
fix(cron): eliminate double-announce and replace delivery polling with push-based flow#39089
Conversation
🔒 Aisle Security AnalysisWe found 2 potential security issue(s) in this PR:
1. 🟡 Tight-loop gateway fanout in descendant subagent wait can amplify failures (DoS)
DescriptionThe new push-based descendant waiting loop in Key points:
This is an availability risk (internal DoS) that can be triggered by common infra failure modes (gateway unavailable) and becomes worse with higher configured subagent limits or multiple concurrent cron sessions. Vulnerable code: while (pendingRunIds.size > 0 && Date.now() < deadline) {
const remainingMs = Math.max(1, deadline - Date.now());
await Promise.allSettled(
[...pendingRunIds].map((runId) =>
callGateway({
method: "agent.wait",
params: { runId, timeoutMs: remainingMs },
timeoutMs: remainingMs + 2_000,
}).catch(() => undefined),
),
);
pendingRunIds = new Set<string>(getActiveRuns().map((e) => e.runId));
}RecommendationAdd bounded retry/backoff and concurrency limiting for the push-wait rounds, and avoid immediate re-tries when the gateway is unhealthy. Recommended changes:
Example mitigation (sketch): import pLimit from "p-limit";
const limit = pLimit(10); // or config-driven
let backoffMs = 200;
let consecutiveGatewayFailures = 0;
while (pendingRunIds.size > 0 && Date.now() < deadline) {
const remainingMs = Math.max(1, deadline - Date.now());
const results = await Promise.allSettled(
[...pendingRunIds].map((runId) =>
limit(() =>
callGateway({
method: "agent.wait",
params: { runId, timeoutMs: remainingMs },
timeoutMs: remainingMs + 2_000,
})
)
)
);
const anyRejected = results.some((r) => r.status === "rejected");
if (anyRejected) {
consecutiveGatewayFailures += 1;
await new Promise((r) => setTimeout(r, backoffMs));
backoffMs = Math.min(backoffMs * 2, 5_000);
if (consecutiveGatewayFailures >= 3) {
break; // or switch to slower poll mode
}
} else {
consecutiveGatewayFailures = 0;
backoffMs = 200;
}
pendingRunIds = new Set(getActiveRuns().map((e) => e.runId));
}Also consider enforcing a maximum descendant run count to wait on per request (even if spawning is limited elsewhere) to prevent registry corruption/misconfiguration from causing unbounded fanout. 2. 🟡 Cron delivery can be silently suppressed by misusing
|
| Property | Value |
|---|---|
| Severity | Medium |
| CWE | CWE-440 |
| Location | src/cron/isolated-agent/delivery-dispatch.ts:319-349 |
Description
In dispatchCronDelivery(), two early-return suppression paths now set deliveryAttempted = true even though no outbound delivery was attempted.
Because deliveryAttempted is used by the cron timer’s fallback guard (shouldEnqueueCronMainSummary) to decide whether to enqueue a system-event summary, this change can suppress the only remaining delivery mechanism in some cases.
Impact:
- When
activeSubagentRuns > 0after waiting, dispatch returns without sending any message, butdeliveryAttempted=trueprevents the timer from enqueueingenqueueSystemEventfallback. - When the "stale interim" suppression triggers (descendants existed but no synthesized update arrived), dispatch returns without delivering, but again marks
deliveryAttempted=trueand suppresses the timer fallback. - If a descendant run is buggy/malicious and never ends (or if descendant completion announcements cannot reach the user due to missing/invalid delivery origin), the user may receive no completion signal and no fallback summary.
Vulnerable logic (suppression paths marked as “attempted”):
if (activeSubagentRuns > 0) {
deliveryAttempted = true;
return params.withRunSession({ status: "ok", summary, outputText, deliveryAttempted, ... });
}
if (hadDescendants && /* stale interim */) {
deliveryAttempted = true;
return params.withRunSession({ status: "ok", summary, outputText, deliveryAttempted, ... });
}Why this is availability-relevant:
- The timer fallback only triggers when
deliveryAttempted !== true. - Setting
deliveryAttempted=truehere changes the meaning from “an outbound send was attempted” to “we intentionally suppressed delivery”, which downstream logic does not distinguish.
This enables a notification DoS where cron runs can end with no user-visible message if descendants keep activeSubagentRuns > 0 (or keep the stale-interim condition true) long enough.
Recommendation
Treat “suppressed” as a distinct state from “attempted delivery”, so the timer can apply an appropriate fallback policy.
Recommended fixes (pick one):
- Introduce a separate flag (e.g.
deliverySuppressed/suppressionReason) and keepdeliveryAttemptedreserved for true outbound attempts.
// suppression path
const deliverySuppressed = true;
return params.withRunSession({
status: "ok",
summary,
outputText,
deliveryAttempted: false,
deliverySuppressed,
...params.telemetry,
});Then update the timer fallback logic to consider deliverySuppressed (e.g., enqueue a non-user-facing internal event, or enqueue a user-facing “still running” / “timed out waiting for descendants” message after a cap).
-
Add a bounded suppression timeout: if descendants remain active beyond a maximum, enqueue a fallback system event (or schedule a follow-up dispatch) so the user receives some completion/timeout signal.
-
If
deliveryAttemptedmust remaintrueto prevent double-announces, add an explicit mechanism that guarantees a later delivery attempt (e.g., on descendant settle) and/or emits an internal alert when suppression happens for too long.
Analyzed PR: #39089 at commit cfe819e
Last updated on: 2026-03-07T21:06:02Z
Greptile SummaryThis PR fixes a double-announce bug in cron job delivery where two early-return paths inside Key changes:
Implementation detail: The fix correctly distinguishes between the Confidence Score: 5/5
Last reviewed commit: bb765a3 |
bb765a3 to
5d9bbd1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d9bbd1390
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let pendingRunIds = new Set<string>(initialActiveRuns.map((e) => e.runId)); | ||
|
|
||
| while (pendingRunIds.size > 0 && Date.now() < deadline) { | ||
| const remainingMs = Math.max(1, deadline - Date.now()); |
There was a problem hiding this comment.
Keep polling for late-starting descendants
When observedActiveDescendants is true but initialActiveRuns is empty (a race that happens when the parent emits a follow-up hint before the registry marks child runs active), pendingRunIds starts empty and this loop is skipped entirely. The new implementation then never re-checks listDescendantRunsForRequester for descendants that become active a moment later, so dispatchCronDelivery can proceed with stale/interim text instead of waiting for the real subagent result.
Useful? React with 👍 / 👎.
| callGateway<{ status?: string }>({ | ||
| method: "agent.wait", | ||
| params: { runId, timeoutMs: remainingMs }, | ||
| timeoutMs: remainingMs + 2_000, | ||
| }).catch(() => undefined), |
There was a problem hiding this comment.
Add backoff after agent.wait failures
This call path swallows agent.wait errors and immediately continues the outer loop; if the gateway is temporarily unavailable and callGateway fails fast, the same active run IDs are retried in a tight loop until deadline. That can create high CPU usage and rapid reconnect churn during outages, unlike the previous fixed polling delay behavior.
Useful? React with 👍 / 👎.
5d9bbd1 to
3ddd0a0
Compare
3ddd0a0 to
c7a7241
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7a7241ff6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let pendingRunIds = new Set<string>(initialActiveRuns.map((e) => e.runId)); | ||
|
|
||
| while (pendingRunIds.size > 0 && Date.now() < deadline) { |
There was a problem hiding this comment.
Keep discovering descendants after an empty initial snapshot
When observedActiveDescendants is true but initialActiveRuns is empty (a common registry timing race), this code initializes pendingRunIds to an empty set and skips the wait loop entirely, so it never refreshes descendant activity before moving to grace-period transcript polling. That can return undefined/interim text even though descendants become active moments later, which regresses the old behavior that re-checked activity each cycle. Fresh evidence in this commit is that pendingRunIds is seeded once from initialActiveRuns and the while (pendingRunIds.size > 0 ...) guard blocks any later getActiveRuns() refresh when the first snapshot is empty.
Useful? React with 👍 / 👎.
| callGateway<{ status?: string }>({ | ||
| method: "agent.wait", | ||
| params: { runId, timeoutMs: remainingMs }, | ||
| timeoutMs: remainingMs + 2_000, | ||
| }).catch(() => undefined), |
There was a problem hiding this comment.
Back off after fast agent.wait gateway failures
The per-run agent.wait call now catches and suppresses gateway errors, but the outer loop has no delay path when those failures are immediate; if the gateway is temporarily down and callGateway fails fast, the same active run IDs are retried in a tight loop until deadline. This can create avoidable CPU/reconnect churn during outages. Fresh evidence in this commit is the new callGateway(...).catch(() => undefined) branch inside the retry loop without any sleep/backoff before the next iteration.
Useful? React with 👍 / 👎.
…h push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits).
c7a7241 to
cfe819e
Compare
|
Merged via squash. Merge commit: e554c59 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cfe819eaaa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let pendingRunIds = new Set<string>(initialActiveRuns.map((e) => e.runId)); | ||
|
|
||
| while (pendingRunIds.size > 0 && Date.now() < deadline) { |
There was a problem hiding this comment.
Keep discovering descendants after empty initial snapshot
When observedActiveDescendants is true but initialActiveRuns is empty (a timing race where descendants are observed by the caller before the registry snapshot includes them), pendingRunIds is initialized empty and this loop never runs. That means getActiveRuns() is never refreshed before moving to grace-period transcript polling, so late-starting descendants can be missed and cron delivery can proceed with stale/interim text instead of waiting for the actual descendant completion.
Useful? React with 👍 / 👎.
| method: "agent.wait", | ||
| params: { runId, timeoutMs: remainingMs }, | ||
| timeoutMs: remainingMs + 2_000, | ||
| }).catch(() => undefined), |
There was a problem hiding this comment.
Add retry backoff after failed agent.wait calls
Each agent.wait call suppresses gateway errors via .catch(() => undefined), but the outer loop immediately retries the same active run IDs with no sleep. If the gateway is temporarily unavailable and failures are fast, this produces a tight retry loop until deadline, driving unnecessary CPU usage and connection churn in the cron worker.
Useful? React with 👍 / 👎.
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204) (cherry picked from commit e554c59)
…h push-based flow (openclaw#39089) * fix(cron): eliminate double-announce and replace delivery polling with push-based flow - Set deliveryAttempted=true in announce early-return paths (active-subagent suppression and stale-interim suppression) so the heartbeat timer no longer fires a redundant enqueueSystemEvent fallback (double-announce bug). - Refactor waitForDescendantSubagentSummary to use event-based agent.wait RPC calls instead of a 500ms busy-poll loop. Each active descendant run is now awaited concurrently via Promise.allSettled, and only a short bounded grace period (5s) remains to capture the cron agent's post-orchestration synthesis. Eliminates O(n*timeoutMs/500ms) gateway calls and wasted wall-clock time. - Add FAST_TEST_MODE (OPENCLAW_TEST_FAST=1) to subagent-followup.ts to keep the grace-period tests instant in CI. - Add comprehensive tests for the new waitForDescendantSubagentSummary behaviour (push-based wait, error resilience, NO_REPLY handling, multi-descendant waits). * fix: prep cron double-announce followup tests (openclaw#39089) (thanks @tyler6204) (cherry picked from commit e554c59)
Summary
Two related cron delivery bugs fixed in one commit:
1. Double-announce guard (original fix)
Early-return paths inside
deliverViaAnnouncereturned without settingdeliveryAttempted = true. The heartbeat timer would then seedeliveryAttempted = falseand fire a redundantenqueueSystemEventfallback, delivering the same message twice.Fix: Both early-return paths (active-subagent suppression and stale-interim suppression) now set
deliveryAttempted = truebefore returning.2. Replace 500ms polling loop with push-based
agent.waitwaitForDescendantSubagentSummaryinsubagent-followup.tspreviously polled every 500ms viacountActiveDescendantRunsuntil all descendants finished — wasting tokens and wall-clock time on each cron run that spawns subagents.Fix: The polling loop is replaced with concurrent
agent.waitRPC calls (one per active descendant run) viaPromise.allSettled. Iterations continue only if new descendants appear (e.g. spawned by first-level subagents). Only a short bounded grace period (5s, 200ms poll) remains to capture the cron agent's post-orchestration synthesis after descendants settle.Key improvements:
allSettledFAST_TEST_MODEshrinks the grace period to 50ms in testsTests
waitForDescendantSubagentSummarytest suite (8 tests) covering push-based waits, error handling, multi-descendant cases, NO_REPLY, and early-exit pathsThanks @tyler6204