Skip to content

Gateway: track background task lifecycle#52518

Merged
mbelinky merged 5 commits intomainfrom
mariano/task-lifecycle-review
Mar 29, 2026
Merged

Gateway: track background task lifecycle#52518
mbelinky merged 5 commits intomainfrom
mariano/task-lifecycle-review

Conversation

@mbelinky
Copy link
Copy Markdown
Contributor

@mbelinky mbelinky commented Mar 22, 2026

Summary

  • Problem: background ACP/subagent work could keep state internally, leak raw child chatter into the user thread, or disappear without a durable answer to what happened?
  • Why it matters: let me know when you're done was not trustworthy. Parent sessions could get noisy, channels could miss the real outcome, and operators had to inspect logs/session state manually.
  • What changed: this PR makes background tasks first-class. It adds a durable task registry, explicit notify policy, cancel support, and a structural split between interactive child sessions and parent-owned background ACP runs.
  • Scope boundary: this is about background task lifecycle and delivery behavior. It does not add retries, a new UI, or a generic job queue.

Change Type

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related: RFC / maintainer discussion around reliable task lifecycle and ping me when you're done
  • Follow-up: message wording can still get smarter for successful ACP completions; this PR fixes routing/lifecycle first

User-visible / Behavior Changes

  • openclaw tasks list/show now gives a durable answer for ACP, subagent, and gateway background runs.
  • Background ACP runs are parent-owned by default: the child no longer writes raw setup/progress chatter directly into the user-facing channel.
  • Default delivery is now quiet terminal notify only: one concise done / failed / lost message when the task finishes.
  • Users can opt into concise state-change updates with openclaw tasks notify <id> state_changes.
  • Users can cancel running tasks with openclaw tasks cancel <id>.
  • The main agent can answer how's task X going? from registry state instead of scraping session output.

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? Yes
  • Data access scope changed? Yes
  • Risk + mitigation:
    • This adds operator-facing task inspection/cancel commands and persists task metadata under gateway state. It does not add new outbound network access or privilege escalation, and channel delivery now goes through the existing parent-thread routing path instead of raw child-session chatter.

Root Cause / Design

The old behavior mixed three responsibilities:

  1. child runtime state
  2. parent-session relay/system events
  3. user-channel delivery

That made background ACP sessions act partly like interactive child chats and partly like parent-owned tasks.

This PR fixes that structurally:

  • task state is durable and registry-owned
  • parent-owned background ACP sessions are explicitly marked as such
  • reply routing consults that mode and suppresses child direct-user delivery
  • user-channel delivery is owned by the task notifier, not by the child session

Repro + Verification

Environment

  • Development: isolated lifecycle branch + focused validation on mb-server
  • Runtime: Node 22 / built dist
  • Live channel verification: Quant -> Telegram -> Codex ACP on jpclawhq prod after deploying the validated runtime files

Focused automated coverage

Ran on mb-server:

  • src/tasks/task-registry.test.ts
  • src/commands/tasks.test.ts
  • src/acp/control-plane/manager.test.ts
  • src/agents/acp-spawn-parent-stream.test.ts
  • src/agents/acp-spawn.test.ts
  • src/auto-reply/reply/dispatch-acp-delivery.test.ts
  • src/auto-reply/reply/routing-policy.test.ts
  • src/cli/program/register.status-health-sessions.test.ts
  • src/cli/program/command-registry.test.ts
  • src/gateway/server-methods/agent.test.ts
  • pnpm build

What those tests cover

  • durable task creation/update for ACP, subagent, and gateway background runs
  • restore/reconcile/prune behavior
  • done_only, state_changes, and silent notify policies
  • terminal failure messages without leaking internal ACP chatter
  • background ACP progress staying off the foreground conversation lane
  • cancel support for ACP and subagent tasks
  • routing suppression for parent-owned background ACP children

Live verification

Personally verified:

  • ACP background task in a real Telegram thread now produces one terminal completion message instead of raw child chatter
  • a parallel foreground request in the same thread still gets answered cleanly while the background task runs
  • gateway-side task state for that run ended done / delivered

Evidence

  • Failing behavior reproduced before + passing behavior verified after
  • Focused regression tests
  • Real channel smoke verification
  • Perf numbers (not relevant)

Human Verification

Verified manually:

  • success path: background ACP task completes and notifies once
  • foreground/background separation: a normal foreground weather question remained clean while the background ACP task ran
  • operator inspection: the gateway task registry showed the expected task state and delivery result

Validated in tests:

  • failure path
  • stall/state-change updates
  • cancellation / kill path
  • routing suppression and task notify policy behavior

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery

  • Revert this PR and restart the gateway
  • If needed after rollback, move state/tasks/ aside to discard task-registry state
  • Symptoms to watch for:
    • duplicate terminal notifications
    • raw ACP child chatter leaking into end-user channels
    • missing terminal task notifications for parent-owned background runs

Risks and Mitigations

  • Risk: parent-owned and direct child ACP paths drift apart again.
    • Mitigation: routing is now centralized through the background-session interaction mode and shared routing policy.
  • Risk: task notifications become too noisy.
    • Mitigation: default policy is done_only; state-change updates are opt-in and use concise structured summaries instead of raw deltas.
  • Risk: task registry and backing runtime diverge after restart/partial failure.
    • Mitigation: restore/maintenance/reconcile paths keep the registry as a durable projection rather than a pure timeout guess.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime cli CLI command changes commands Command implementations agents Agent runtime and tooling size: XL maintainer Maintainer-authored PR labels Mar 22, 2026
@mbelinky mbelinky force-pushed the mariano/task-lifecycle-review branch from e50fcee to b9ccda0 Compare March 29, 2026 09:14
@mbelinky mbelinky marked this pull request as ready for review March 29, 2026 09:16
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 29, 2026

Greptile Summary

This PR makes background tasks first-class in the gateway by adding a durable task registry (TaskRecord + JSON persistence), explicit notify policies (done_only / state_changes / silent), cancel support, and a structural split between interactive ACP child sessions and parent-owned background ACP runs. It also centralises reply-routing decisions into a new resolveReplyRoutingDecision utility and adds four new openclaw tasks CLI subcommands (list, show, notify, cancel).

Key changes:

  • src/tasks/task-registry.ts — in-memory + on-disk registry; idempotent createTaskRecord, runId-indexed updates, automatic delivery via sendMessage or session event queue.
  • src/acp/session-interaction-mode.ts — classifies ACP sessions as interactive vs parent-owned-background; used to gate delivery suppression.
  • src/acp/control-plane/manager.core.ts — tracks parented direct ACP turns; progress summary accumulated from text_delta events and capped at 240 chars.
  • src/agents/acp-spawn.ts — run-mode spawns no longer use inline delivery; createTaskRecord called for both session and run mode.
  • src/agents/acp-spawn-parent-stream.ts — new surfaceUpdates flag suppresses parent-channel events for background sessions while still updating task state.
  • src/auto-reply/reply/routing-policy.ts — extracted routing helper; suppressDirectUserDelivery blocks content routing but does not yet suppress typing indicators.
  • src/commands/tasks.ts + CLI registration — new operator-facing task inspection and lifecycle management commands.

Issues found:

  • var restoreAttempted in task-registry.ts should be let (style).
  • emitStartNotice in the parent stream relay calls updateTaskStateByRunId("Started.") before createTaskRecord has run; the progress event is always silently dropped for spawn-path tasks.
  • shouldSuppressTyping is not set to true when suppressDirectUserDelivery is true, so background ACP children may emit typing indicators to the user channel even though message delivery is suppressed.
  • The tasks notify <lookup> <notify> CLI command does not validate that <notify> is one of the three accepted policy values.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style or minor logic gaps that don't break the primary task lifecycle path.

The core registry, delivery, cancel, and CLI paths are all covered by focused tests, live-verified, and structurally sound. The four issues found are all P2: one style inconsistency (var), one silent no-op (progress event ordering), one potential UX leak (typing indicator), and one missing CLI input guard. None affect data integrity, security, or primary task lifecycle outcomes.

src/agents/acp-spawn-parent-stream.ts and src/auto-reply/reply/routing-policy.ts are worth revisiting for the Started. event ordering and typing-indicator suppression gap respectively.

Important Files Changed

Filename Overview
src/tasks/task-registry.ts Core task registry: well-structured with idempotent create, index-based runId lookup, delivery guard, and persistence. One style issue: var restoreAttempted should be let.
src/tasks/task-registry.types.ts Clean type definitions for TaskRecord, TaskStatus, TaskNotifyPolicy, TaskDeliveryStatus. No issues.
src/tasks/task-registry.store.ts Thin persistence layer: serializes/deserializes TaskRecord map to JSON with a version guard. No issues.
src/tasks/task-registry.maintenance.ts Sweep/reconcile logic: marks orphaned active tasks as lost after grace period, prunes terminal tasks after 7 days. Logic is correct and guarded by hasBackingSession.
src/tasks/task-registry-delivery-runtime.ts Single re-export of sendMessage for dynamic import isolation. No issues.
src/acp/session-interaction-mode.ts New module: classifies ACP sessions as interactive vs parent-owned-background. Clean logic gated on acp.mode === oneshot and presence of a parent session key.
src/acp/control-plane/manager.core.ts Adds background task tracking to ACP turn execution: creates record at turn start, updates progress summary on each text delta, marks done/failed at turn completion.
src/agents/acp-spawn.ts Adds createTaskRecord for session and run mode ACP spawns. Inline delivery disabled for all run-mode spawns. Ordering issue: notifyStarted fires before createTaskRecord causing Started. event to always be a no-op.
src/agents/acp-spawn-parent-stream.ts Adds surfaceUpdates flag to suppress parent-channel event emission for background sessions. emitStartNotice update is always a no-op due to ordering (task not yet created).
src/agents/subagent-registry-lifecycle.ts Hooks subagent lifecycle completion and give-up paths into the task registry. Status mapping (ok→done, timeout→timed_out, error→failed) is correct.
src/agents/subagent-registry-run-manager.ts Creates task records for subagent runs at registration time. not_applicable delivery status when expectsCompletionMessage === false is correct.
src/auto-reply/reply/routing-policy.ts New module extracting routing decision logic. suppressDirectUserDelivery correctly zeroes shouldRouteToOriginating but does not propagate into shouldSuppressTyping, potentially allowing typing indicators to leak for suppressed sessions.
src/auto-reply/reply/dispatch-from-config.ts Refactors inline routing computation into resolveReplyRoutingDecision. Background ACP session suppression via isParentOwnedBackgroundAcpSession is cleanly integrated.
src/auto-reply/reply/dispatch-acp-delivery.ts Adds suppressUserDelivery gate in the deliver function. Text accumulation still works while actual channel delivery is blocked.
src/commands/tasks.ts New CLI commands: list, show, notify, cancel. Formatting is clear and JSON output is well-structured. Missing input validation for the notify policy argument.
src/cli/program/register.status-health-sessions.ts Registers tasks parent command and four subcommands. Parent-opts forwarding is correct. No validation on the notify free argument.
src/gateway/server-methods/agent.ts Creates background_cli task records for async gateway agent runs with deliveryStatus: not_applicable. Best-effort (errors swallowed), appropriate for non-blocking task tracking.

Comments Outside Diff (2)

  1. src/agents/acp-spawn-parent-stream.ts, line 517-523 (link)

    P2 "Started." task event is always a no-op for spawned tasks

    updateTaskStateByRunId is called here inside emitStartNotice, which is triggered by notifyStarted(). In acp-spawn.ts, parentRelay.notifyStarted() is called before either createTaskRecord call:

    parentRelay?.notifyStarted();        // ← fires emitStartNotice → updateTaskStateByRunId(runId, "Started.")
    try {
      createTaskRecord({ ...streamLogPath... });  // ← task doesn't exist yet
      ...
    }
    

    Since updateTaskStateByRunId only updates existing tasks (it looks up taskIdsByRunId.get(runId.trim()) which is empty), the "Started." progress event is silently dropped for every spawn-path task. The task lifecycle and terminal events are unaffected, but the recentEvents array will never include this entry for spawn-path tasks.

    Consider calling createTaskRecord before parentRelay.notifyStarted(), or buffering the "Started." event and replaying it once the task is registered.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/agents/acp-spawn-parent-stream.ts
    Line: 517-523
    
    Comment:
    **"Started." task event is always a no-op for spawned tasks**
    
    `updateTaskStateByRunId` is called here inside `emitStartNotice`, which is triggered by `notifyStarted()`. In `acp-spawn.ts`, `parentRelay.notifyStarted()` is called _before_ either `createTaskRecord` call:
    
    ```
    parentRelay?.notifyStarted();        // ← fires emitStartNotice → updateTaskStateByRunId(runId, "Started.")
    try {
      createTaskRecord({ ...streamLogPath... });  // ← task doesn't exist yet
      ...
    }
    ```
    
    Since `updateTaskStateByRunId` only updates _existing_ tasks (it looks up `taskIdsByRunId.get(runId.trim())` which is empty), the `"Started."` progress event is silently dropped for every spawn-path task. The task lifecycle and terminal events are unaffected, but the `recentEvents` array will never include this entry for spawn-path tasks.
    
    Consider calling `createTaskRecord` before `parentRelay.notifyStarted()`, or buffering the "Started." event and replaying it once the task is registered.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. src/cli/program/register.status-health-sessions.ts, line 1477-1492 (link)

    P2 No validation on the notify argument value

    The <notify> argument is cast directly to TaskNotifyPolicy without checking it against the valid set (done_only, state_changes, silent). Commander.js does not validate free-argument values, so passing an arbitrary string (e.g. openclaw tasks notify <id> typo) will silently store an invalid policy value in the registry.

    Consider validating before calling tasksNotifyCommand:

    const VALID_NOTIFY_POLICIES = ["done_only", "state_changes", "silent"] as const;
    if (!VALID_NOTIFY_POLICIES.includes(notify)) {
      console.error(`Invalid notify policy: ${notify}. Valid values: ${VALID_NOTIFY_POLICIES.join(", ")}`);
      process.exit(1);
    }
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/cli/program/register.status-health-sessions.ts
    Line: 1477-1492
    
    Comment:
    **No validation on the `notify` argument value**
    
    The `<notify>` argument is cast directly to `TaskNotifyPolicy` without checking it against the valid set (`done_only`, `state_changes`, `silent`). Commander.js does not validate free-argument values, so passing an arbitrary string (e.g. `openclaw tasks notify <id> typo`) will silently store an invalid policy value in the registry.
    
    Consider validating before calling `tasksNotifyCommand`:
    
    ```ts
    const VALID_NOTIFY_POLICIES = ["done_only", "state_changes", "silent"] as const;
    if (!VALID_NOTIFY_POLICIES.includes(notify)) {
      console.error(`Invalid notify policy: ${notify}. Valid values: ${VALID_NOTIFY_POLICIES.join(", ")}`);
      process.exit(1);
    }
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/tasks/task-registry.ts
Line: 33

Comment:
**`var` should be `let`**

`restoreAttempted` is declared with `var` while every other module-level mutable is declared with `let`. This is an inconsistency that can be confusing (e.g. `var` hoists, `let` doesn't, even at module scope).

```suggestion
let restoreAttempted = false;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 517-523

Comment:
**"Started." task event is always a no-op for spawned tasks**

`updateTaskStateByRunId` is called here inside `emitStartNotice`, which is triggered by `notifyStarted()`. In `acp-spawn.ts`, `parentRelay.notifyStarted()` is called _before_ either `createTaskRecord` call:

```
parentRelay?.notifyStarted();        // ← fires emitStartNotice → updateTaskStateByRunId(runId, "Started.")
try {
  createTaskRecord({ ...streamLogPath... });  // ← task doesn't exist yet
  ...
}
```

Since `updateTaskStateByRunId` only updates _existing_ tasks (it looks up `taskIdsByRunId.get(runId.trim())` which is empty), the `"Started."` progress event is silently dropped for every spawn-path task. The task lifecycle and terminal events are unaffected, but the `recentEvents` array will never include this entry for spawn-path tasks.

Consider calling `createTaskRecord` before `parentRelay.notifyStarted()`, or buffering the "Started." event and replaying it once the task is registered.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/auto-reply/reply/routing-policy.ts
Line: 15-30

Comment:
**Typing indicator suppression not applied for suppressed-delivery background sessions**

When `suppressDirectUserDelivery` is `true`, `shouldRouteToOriginating` is correctly forced to `false`. However, `shouldSuppressTyping` is derived only from `shouldRouteToOriginating` and the originating channel:

```ts
shouldSuppressTyping: shouldRouteToOriginating || originatingChannel === INTERNAL_MESSAGE_CHANNEL,
```

This means for a parent-owned background ACP session (where `suppressDirectUserDelivery` is `true` but the channel is not `internal`), `shouldSuppressTyping` remains `false`. As confirmed by the routing-policy test:

```ts
// suppresses direct user delivery for parent-owned background ACP children
shouldSuppressTyping: false,  // ← typing is still enabled
```

If the typing policy allows indicators when `shouldSuppressTyping = false`, users could see a "typing…" indicator from the child session with no message to follow — since message delivery is suppressed via `suppressUserDelivery` in the ACP delivery coordinator.

Consider including `suppressDirectUserDelivery` in the typing suppression logic:

```ts
shouldSuppressTyping:
  params.suppressDirectUserDelivery ||
  shouldRouteToOriginating ||
  originatingChannel === INTERNAL_MESSAGE_CHANNEL,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cli/program/register.status-health-sessions.ts
Line: 1477-1492

Comment:
**No validation on the `notify` argument value**

The `<notify>` argument is cast directly to `TaskNotifyPolicy` without checking it against the valid set (`done_only`, `state_changes`, `silent`). Commander.js does not validate free-argument values, so passing an arbitrary string (e.g. `openclaw tasks notify <id> typo`) will silently store an invalid policy value in the registry.

Consider validating before calling `tasksNotifyCommand`:

```ts
const VALID_NOTIFY_POLICIES = ["done_only", "state_changes", "silent"] as const;
if (!VALID_NOTIFY_POLICIES.includes(notify)) {
  console.error(`Invalid notify policy: ${notify}. Valid values: ${VALID_NOTIFY_POLICIES.join(", ")}`);
  process.exit(1);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Gateway: harden background task lifecycl..." | Re-trigger Greptile

@mbelinky
Copy link
Copy Markdown
Contributor Author

Addressed the core correctness risks in this PR without broadening it into the SQLite/pluggable-store follow-up.

What changed in the latest push:

  • fallback internal/session delivery no longer marks tasks as delivered; it now records session_queued
  • read-only inspection no longer mutates task state or emits user-facing notifications
  • operator cancellation now produces a terminal cancelled notification instead of going silent

Validated again on mb-server with src/tasks/task-registry.test.ts and pnpm build.

I kept the persistence backend JSON-backed on purpose for this PR so we can prove the lifecycle/delivery contract first. A SQLite/hooks abstraction still makes sense as a follow-up once maintainers are happy with the basic behavior.

@mbelinky mbelinky merged commit 17c36b5 into main Mar 29, 2026
9 checks passed
@mbelinky mbelinky deleted the mariano/task-lifecycle-review branch March 29, 2026 10:48
alexjiang1 pushed a commit to alexjiang1/openclaw that referenced this pull request Mar 31, 2026
Merged via squash.

Prepared head SHA: 7c45542
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky
livingghost pushed a commit to livingghost/openclaw that referenced this pull request Mar 31, 2026
Merged via squash.

Prepared head SHA: 7c45542
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky
pgondhi987 pushed a commit to pgondhi987/openclaw that referenced this pull request Mar 31, 2026
Merged via squash.

Prepared head SHA: 7c45542
Co-authored-by: mbelinky <[email protected]>
Co-authored-by: mbelinky <[email protected]>
Reviewed-by: @mbelinky
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes commands Command implementations gateway Gateway runtime maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant