Gateway: track background task lifecycle#52518
Conversation
e50fcee to
b9ccda0
Compare
Greptile SummaryThis PR makes background tasks first-class in the gateway by adding a durable task registry ( Key changes:
Issues found:
Confidence Score: 5/5Safe to merge; all remaining findings are P2 style or minor logic gaps that don't break the primary task lifecycle path. The core registry, delivery, cancel, and CLI paths are all covered by focused tests, live-verified, and structurally sound. The four issues found are all P2: one style inconsistency (var), one silent no-op (progress event ordering), one potential UX leak (typing indicator), and one missing CLI input guard. None affect data integrity, security, or primary task lifecycle outcomes. src/agents/acp-spawn-parent-stream.ts and src/auto-reply/reply/routing-policy.ts are worth revisiting for the Started. event ordering and typing-indicator suppression gap respectively.
|
| Filename | Overview |
|---|---|
| src/tasks/task-registry.ts | Core task registry: well-structured with idempotent create, index-based runId lookup, delivery guard, and persistence. One style issue: var restoreAttempted should be let. |
| src/tasks/task-registry.types.ts | Clean type definitions for TaskRecord, TaskStatus, TaskNotifyPolicy, TaskDeliveryStatus. No issues. |
| src/tasks/task-registry.store.ts | Thin persistence layer: serializes/deserializes TaskRecord map to JSON with a version guard. No issues. |
| src/tasks/task-registry.maintenance.ts | Sweep/reconcile logic: marks orphaned active tasks as lost after grace period, prunes terminal tasks after 7 days. Logic is correct and guarded by hasBackingSession. |
| src/tasks/task-registry-delivery-runtime.ts | Single re-export of sendMessage for dynamic import isolation. No issues. |
| src/acp/session-interaction-mode.ts | New module: classifies ACP sessions as interactive vs parent-owned-background. Clean logic gated on acp.mode === oneshot and presence of a parent session key. |
| src/acp/control-plane/manager.core.ts | Adds background task tracking to ACP turn execution: creates record at turn start, updates progress summary on each text delta, marks done/failed at turn completion. |
| src/agents/acp-spawn.ts | Adds createTaskRecord for session and run mode ACP spawns. Inline delivery disabled for all run-mode spawns. Ordering issue: notifyStarted fires before createTaskRecord causing Started. event to always be a no-op. |
| src/agents/acp-spawn-parent-stream.ts | Adds surfaceUpdates flag to suppress parent-channel event emission for background sessions. emitStartNotice update is always a no-op due to ordering (task not yet created). |
| src/agents/subagent-registry-lifecycle.ts | Hooks subagent lifecycle completion and give-up paths into the task registry. Status mapping (ok→done, timeout→timed_out, error→failed) is correct. |
| src/agents/subagent-registry-run-manager.ts | Creates task records for subagent runs at registration time. not_applicable delivery status when expectsCompletionMessage === false is correct. |
| src/auto-reply/reply/routing-policy.ts | New module extracting routing decision logic. suppressDirectUserDelivery correctly zeroes shouldRouteToOriginating but does not propagate into shouldSuppressTyping, potentially allowing typing indicators to leak for suppressed sessions. |
| src/auto-reply/reply/dispatch-from-config.ts | Refactors inline routing computation into resolveReplyRoutingDecision. Background ACP session suppression via isParentOwnedBackgroundAcpSession is cleanly integrated. |
| src/auto-reply/reply/dispatch-acp-delivery.ts | Adds suppressUserDelivery gate in the deliver function. Text accumulation still works while actual channel delivery is blocked. |
| src/commands/tasks.ts | New CLI commands: list, show, notify, cancel. Formatting is clear and JSON output is well-structured. Missing input validation for the notify policy argument. |
| src/cli/program/register.status-health-sessions.ts | Registers tasks parent command and four subcommands. Parent-opts forwarding is correct. No validation on the notify free argument. |
| src/gateway/server-methods/agent.ts | Creates background_cli task records for async gateway agent runs with deliveryStatus: not_applicable. Best-effort (errors swallowed), appropriate for non-blocking task tracking. |
Comments Outside Diff (2)
-
src/agents/acp-spawn-parent-stream.ts, line 517-523 (link)"Started." task event is always a no-op for spawned tasks
updateTaskStateByRunIdis called here insideemitStartNotice, which is triggered bynotifyStarted(). Inacp-spawn.ts,parentRelay.notifyStarted()is called before eithercreateTaskRecordcall:parentRelay?.notifyStarted(); // ← fires emitStartNotice → updateTaskStateByRunId(runId, "Started.") try { createTaskRecord({ ...streamLogPath... }); // ← task doesn't exist yet ... }Since
updateTaskStateByRunIdonly updates existing tasks (it looks uptaskIdsByRunId.get(runId.trim())which is empty), the"Started."progress event is silently dropped for every spawn-path task. The task lifecycle and terminal events are unaffected, but therecentEventsarray will never include this entry for spawn-path tasks.Consider calling
createTaskRecordbeforeparentRelay.notifyStarted(), or buffering the "Started." event and replaying it once the task is registered.Prompt To Fix With AI
This is a comment left during a code review. Path: src/agents/acp-spawn-parent-stream.ts Line: 517-523 Comment: **"Started." task event is always a no-op for spawned tasks** `updateTaskStateByRunId` is called here inside `emitStartNotice`, which is triggered by `notifyStarted()`. In `acp-spawn.ts`, `parentRelay.notifyStarted()` is called _before_ either `createTaskRecord` call: ``` parentRelay?.notifyStarted(); // ← fires emitStartNotice → updateTaskStateByRunId(runId, "Started.") try { createTaskRecord({ ...streamLogPath... }); // ← task doesn't exist yet ... } ``` Since `updateTaskStateByRunId` only updates _existing_ tasks (it looks up `taskIdsByRunId.get(runId.trim())` which is empty), the `"Started."` progress event is silently dropped for every spawn-path task. The task lifecycle and terminal events are unaffected, but the `recentEvents` array will never include this entry for spawn-path tasks. Consider calling `createTaskRecord` before `parentRelay.notifyStarted()`, or buffering the "Started." event and replaying it once the task is registered. How can I resolve this? If you propose a fix, please make it concise.
-
src/cli/program/register.status-health-sessions.ts, line 1477-1492 (link)No validation on the
notifyargument valueThe
<notify>argument is cast directly toTaskNotifyPolicywithout checking it against the valid set (done_only,state_changes,silent). Commander.js does not validate free-argument values, so passing an arbitrary string (e.g.openclaw tasks notify <id> typo) will silently store an invalid policy value in the registry.Consider validating before calling
tasksNotifyCommand:const VALID_NOTIFY_POLICIES = ["done_only", "state_changes", "silent"] as const; if (!VALID_NOTIFY_POLICIES.includes(notify)) { console.error(`Invalid notify policy: ${notify}. Valid values: ${VALID_NOTIFY_POLICIES.join(", ")}`); process.exit(1); }
Prompt To Fix With AI
This is a comment left during a code review. Path: src/cli/program/register.status-health-sessions.ts Line: 1477-1492 Comment: **No validation on the `notify` argument value** The `<notify>` argument is cast directly to `TaskNotifyPolicy` without checking it against the valid set (`done_only`, `state_changes`, `silent`). Commander.js does not validate free-argument values, so passing an arbitrary string (e.g. `openclaw tasks notify <id> typo`) will silently store an invalid policy value in the registry. Consider validating before calling `tasksNotifyCommand`: ```ts const VALID_NOTIFY_POLICIES = ["done_only", "state_changes", "silent"] as const; if (!VALID_NOTIFY_POLICIES.includes(notify)) { console.error(`Invalid notify policy: ${notify}. Valid values: ${VALID_NOTIFY_POLICIES.join(", ")}`); process.exit(1); } ``` How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/tasks/task-registry.ts
Line: 33
Comment:
**`var` should be `let`**
`restoreAttempted` is declared with `var` while every other module-level mutable is declared with `let`. This is an inconsistency that can be confusing (e.g. `var` hoists, `let` doesn't, even at module scope).
```suggestion
let restoreAttempted = false;
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 517-523
Comment:
**"Started." task event is always a no-op for spawned tasks**
`updateTaskStateByRunId` is called here inside `emitStartNotice`, which is triggered by `notifyStarted()`. In `acp-spawn.ts`, `parentRelay.notifyStarted()` is called _before_ either `createTaskRecord` call:
```
parentRelay?.notifyStarted(); // ← fires emitStartNotice → updateTaskStateByRunId(runId, "Started.")
try {
createTaskRecord({ ...streamLogPath... }); // ← task doesn't exist yet
...
}
```
Since `updateTaskStateByRunId` only updates _existing_ tasks (it looks up `taskIdsByRunId.get(runId.trim())` which is empty), the `"Started."` progress event is silently dropped for every spawn-path task. The task lifecycle and terminal events are unaffected, but the `recentEvents` array will never include this entry for spawn-path tasks.
Consider calling `createTaskRecord` before `parentRelay.notifyStarted()`, or buffering the "Started." event and replaying it once the task is registered.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/auto-reply/reply/routing-policy.ts
Line: 15-30
Comment:
**Typing indicator suppression not applied for suppressed-delivery background sessions**
When `suppressDirectUserDelivery` is `true`, `shouldRouteToOriginating` is correctly forced to `false`. However, `shouldSuppressTyping` is derived only from `shouldRouteToOriginating` and the originating channel:
```ts
shouldSuppressTyping: shouldRouteToOriginating || originatingChannel === INTERNAL_MESSAGE_CHANNEL,
```
This means for a parent-owned background ACP session (where `suppressDirectUserDelivery` is `true` but the channel is not `internal`), `shouldSuppressTyping` remains `false`. As confirmed by the routing-policy test:
```ts
// suppresses direct user delivery for parent-owned background ACP children
shouldSuppressTyping: false, // ← typing is still enabled
```
If the typing policy allows indicators when `shouldSuppressTyping = false`, users could see a "typing…" indicator from the child session with no message to follow — since message delivery is suppressed via `suppressUserDelivery` in the ACP delivery coordinator.
Consider including `suppressDirectUserDelivery` in the typing suppression logic:
```ts
shouldSuppressTyping:
params.suppressDirectUserDelivery ||
shouldRouteToOriginating ||
originatingChannel === INTERNAL_MESSAGE_CHANNEL,
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/cli/program/register.status-health-sessions.ts
Line: 1477-1492
Comment:
**No validation on the `notify` argument value**
The `<notify>` argument is cast directly to `TaskNotifyPolicy` without checking it against the valid set (`done_only`, `state_changes`, `silent`). Commander.js does not validate free-argument values, so passing an arbitrary string (e.g. `openclaw tasks notify <id> typo`) will silently store an invalid policy value in the registry.
Consider validating before calling `tasksNotifyCommand`:
```ts
const VALID_NOTIFY_POLICIES = ["done_only", "state_changes", "silent"] as const;
if (!VALID_NOTIFY_POLICIES.includes(notify)) {
console.error(`Invalid notify policy: ${notify}. Valid values: ${VALID_NOTIFY_POLICIES.join(", ")}`);
process.exit(1);
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Gateway: harden background task lifecycl..." | Re-trigger Greptile
|
Addressed the core correctness risks in this PR without broadening it into the SQLite/pluggable-store follow-up. What changed in the latest push:
Validated again on I kept the persistence backend JSON-backed on purpose for this PR so we can prove the lifecycle/delivery contract first. A SQLite/hooks abstraction still makes sense as a follow-up once maintainers are happy with the basic behavior. |
Merged via squash. Prepared head SHA: 7c45542 Co-authored-by: mbelinky <[email protected]> Co-authored-by: mbelinky <[email protected]> Reviewed-by: @mbelinky
Merged via squash. Prepared head SHA: 7c45542 Co-authored-by: mbelinky <[email protected]> Co-authored-by: mbelinky <[email protected]> Reviewed-by: @mbelinky
Merged via squash. Prepared head SHA: 7c45542 Co-authored-by: mbelinky <[email protected]> Co-authored-by: mbelinky <[email protected]> Reviewed-by: @mbelinky
Summary
what happened?let me know when you're donewas not trustworthy. Parent sessions could get noisy, channels could miss the real outcome, and operators had to inspect logs/session state manually.Change Type
Scope
Linked Issue/PR
ping me when you're doneUser-visible / Behavior Changes
openclaw tasks list/shownow gives a durable answer for ACP, subagent, and gateway background runs.done/failed/lostmessage when the task finishes.openclaw tasks notify <id> state_changes.openclaw tasks cancel <id>.how's task X going?from registry state instead of scraping session output.Security Impact
Root Cause / Design
The old behavior mixed three responsibilities:
That made background ACP sessions act partly like interactive child chats and partly like parent-owned tasks.
This PR fixes that structurally:
Repro + Verification
Environment
mb-serverdistjpclawhqprod after deploying the validated runtime filesFocused automated coverage
Ran on
mb-server:src/tasks/task-registry.test.tssrc/commands/tasks.test.tssrc/acp/control-plane/manager.test.tssrc/agents/acp-spawn-parent-stream.test.tssrc/agents/acp-spawn.test.tssrc/auto-reply/reply/dispatch-acp-delivery.test.tssrc/auto-reply/reply/routing-policy.test.tssrc/cli/program/register.status-health-sessions.test.tssrc/cli/program/command-registry.test.tssrc/gateway/server-methods/agent.test.tspnpm buildWhat those tests cover
done_only,state_changes, andsilentnotify policiesLive verification
Personally verified:
done/deliveredEvidence
Human Verification
Verified manually:
Validated in tests:
Compatibility / Migration
Failure Recovery
state/tasks/aside to discard task-registry stateRisks and Mitigations
done_only; state-change updates are opt-in and use concise structured summaries instead of raw deltas.