Skip to content

feat(orchestration): Phase 5 — Aggregator + resume/retry (#1240)#1258

Merged
bug-ops merged 2 commits intomainfrom
orchestration-aggregator
Mar 6, 2026
Merged

feat(orchestration): Phase 5 — Aggregator + resume/retry (#1240)#1258
bug-ops merged 2 commits intomainfrom
orchestration-aggregator

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 6, 2026

Summary

  • Add Aggregator trait and LlmAggregator: synthesizes completed task outputs into a single coherent response via LLM call with per-task character budget, ContentSanitizer spotlighting, skipped-task descriptions, and raw-concatenation fallback on LLM failure
  • Add DagScheduler::resume_from(): resumes Paused/Failed graphs, reconstructs running-task map, sets status to Running
  • Add dag::reset_for_retry(): BFS reset of FailedReady and Skipped/CanceledPending for targeted graph re-runs
  • Extend PlanCommand with Resume(Option<String>) and Retry(Option<String>) variants (with graph-id validation)
  • Wire /plan resume and /plan retry in agent loop; call aggregator on graph completion
  • Add aggregator_max_tokens to OrchestrationConfig (default: 4096)
  • +8 tests: resume_from() coverage (6) and reset_for_retry() with Canceled handling (2)

Test plan

  • cargo +nightly fmt --check — clean
  • cargo clippy --workspace --features full -- -D warnings — clean
  • cargo nextest run --workspace --features full --lib --bins — 4297 pass (1 pre-existing failure in bootstrap::tests::create_skill_matcher_when_semantic_disabled on VersionMissing(23), present on main before this branch)
  • Security: build_fallback() sanitizes via ContentSanitizer (SEC-P5-02)
  • resume_from() reconstructs running map from Running tasks in graph (IC1)
  • reset_for_retry() resets CanceledPending in addition to FailedReady (IC2)
  • handle_plan_retry counts failed tasks before reset, not after (IC3)
  • Docs updated: task-orchestration.md, configuration.md, README.md, crates/zeph-core/README.md

Closes #1240. Part of epic #1235.

Add result aggregation and graph lifecycle management for the task
orchestration system.

### New
- `aggregator.rs`: `Aggregator` trait + `LlmAggregator` implementation.
  Synthesizes completed task outputs via a single LLM call with per-task
  character budget (`aggregator_max_tokens / num_completed_tasks`),
  `ContentSanitizer` spotlighting on all task outputs, skipped-task
  descriptions, and raw-concatenation fallback on LLM failure.
- `dag::reset_for_retry()`: BFS-based reset — `Failed`→`Ready`,
  `Skipped`/`Canceled`→`Pending`. Enables targeted re-runs without
  discarding completed work.
- `DagScheduler::resume_from()`: accepts `Paused`/`Failed` graphs,
  reconstructs `running` HashMap from in-flight tasks, sets
  `graph.status = Running`.

### Extended
- `PlanCommand`: add `Resume(Option<String>)` and `Retry(Option<String>)`
  variants with graph-id validation.
- `OrchestrationConfig`: add `aggregator_max_tokens` field (default 4096).
- Agent loop: wire `handle_plan_resume`, `handle_plan_retry`, call
  aggregator on graph completion.
- Dropped event guard upgraded to `error!`-level logging (PERF-SCHED-02).

### Tests (+8)
- 6 tests for `DagScheduler::resume_from()` (MT-1/II4 fix)
- 2 tests for `reset_for_retry()` with `Canceled` task handling (IC2 fix)

### Docs
- `docs/src/concepts/task-orchestration.md`: Result Aggregation section,
  `/plan resume`/`/plan retry` documentation.
- `docs/src/reference/configuration.md`: `dependency_context_budget`,
  `confirm_before_execute`, `aggregator_max_tokens` fields.
- `README.md` and `crates/zeph-core/README.md` updated.

Closes #1240. Part of epic #1235.
@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate enhancement New feature or request size/XL Extra large PR (500+ lines) labels Mar 6, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 6, 2026 02:29
@bug-ops bug-ops merged commit c3ff145 into main Mar 6, 2026
28 checks passed
@bug-ops bug-ops deleted the orchestration-aggregator branch March 6, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(orchestration): Phase 5 — Aggregator + resume/retry

1 participant