Skip to content

fix(orchestration): defer task to Ready on concurrency-limit spawn failure#1514

Merged
bug-ops merged 1 commit intomainfrom
dag-scheduler-concurrency-defer
Mar 10, 2026
Merged

fix(orchestration): defer task to Ready on concurrency-limit spawn failure#1514
bug-ops merged 1 commit intomainfrom
dag-scheduler-concurrency-defer

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 10, 2026

Summary

  • DagScheduler::record_spawn_failure() now detects transient concurrency-limit rejections and reverts the task to TaskStatus::Ready instead of Failed
  • This prevents spurious graph failure cascades when SubAgentManager refuses a spawn because all concurrency slots are occupied (e.g., max_concurrent=1)
  • Permanent spawn failures (invalid agent definition, config error) continue to mark the task Failed as before
  • 2 unit tests added covering both error string variants produced by SubAgentManager

Root cause

DagScheduler.tick() optimistically marks tasks as Running before calling SubAgentManager::spawn(). When the manager rejects the spawn with "concurrency limit N reached", the old code called record_spawn_failure() which unconditionally set the task to Failed and cascaded failure to all dependents — aborting the entire plan.

Test plan

  • cargo +nightly fmt --check — pass
  • cargo clippy --workspace --features full -- -D warnings — pass
  • cargo nextest run --workspace --features full --lib --bins — 4986 passed (+2 new tests), 11 skipped

Follow-up

  • Typed SubAgentError::ConcurrencyLimit variant to replace string matching (filed separately)
  • Additional edge-case tests: multi-task deferral, max_concurrent=0, deadlock detection

Closes #1513

…ilure

When DagScheduler.tick() spawns tasks and SubAgentManager rejects the
spawn because all concurrency slots are occupied, record_spawn_failure()
now reverts the task to TaskStatus::Ready instead of marking it Failed.

This prevents spurious failure cascades in multi-task plans with
max_concurrent=1 (the default), where only one task can run at a time.
Tasks deferred this way are retried automatically on the next tick.

Permanent spawn failures (invalid agent definition, config error, etc.)
continue to mark the task Failed as before.

Closes #1513
@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate bug Something isn't working labels Mar 10, 2026
@github-actions github-actions bot added the size/M Medium PR (51-200 lines) label Mar 10, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 10, 2026 01:25
@bug-ops bug-ops merged commit 97c4e02 into main Mar 10, 2026
18 checks passed
@bug-ops bug-ops deleted the dag-scheduler-concurrency-defer branch March 10, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/M Medium PR (51-200 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DagScheduler: concurrency limit hit marks task as Failed instead of deferring

1 participant