fix(orchestration): defer task to Ready on concurrency-limit spawn failure#1514
Merged
fix(orchestration): defer task to Ready on concurrency-limit spawn failure#1514
Conversation
…ilure When DagScheduler.tick() spawns tasks and SubAgentManager rejects the spawn because all concurrency slots are occupied, record_spawn_failure() now reverts the task to TaskStatus::Ready instead of marking it Failed. This prevents spurious failure cascades in multi-task plans with max_concurrent=1 (the default), where only one task can run at a time. Tasks deferred this way are retried automatically on the next tick. Permanent spawn failures (invalid agent definition, config error, etc.) continue to mark the task Failed as before. Closes #1513
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DagScheduler::record_spawn_failure()now detects transient concurrency-limit rejections and reverts the task toTaskStatus::Readyinstead ofFailedSubAgentManagerrefuses a spawn because all concurrency slots are occupied (e.g.,max_concurrent=1)Failedas beforeSubAgentManagerRoot cause
DagScheduler.tick()optimistically marks tasks asRunningbefore callingSubAgentManager::spawn(). When the manager rejects the spawn with"concurrency limit N reached", the old code calledrecord_spawn_failure()which unconditionally set the task toFailedand cascaded failure to all dependents — aborting the entire plan.Test plan
cargo +nightly fmt --check— passcargo clippy --workspace --features full -- -D warnings— passcargo nextest run --workspace --features full --lib --bins— 4986 passed (+2 new tests), 11 skippedFollow-up
SubAgentError::ConcurrencyLimitvariant to replace string matching (filed separately)max_concurrent=0, deadlock detectionCloses #1513