-
Notifications
You must be signed in to change notification settings - Fork 2
DagScheduler: concurrency limit hit marks task as Failed instead of deferring #1513
Description
Bug
When DagScheduler spawns tasks on a tick and hits the sub-agent concurrency limit (max_concurrent=1 default), it calls record_spawn_failure() which unconditionally sets the task to Failed and propagates failure to dependents.
For transient errors like "concurrency limit reached", the task should remain Pending and be retried on the next scheduler tick — not permanently failed.
Reproduction
- Create a plan with 3+ tasks (e.g. LlmPlanner creates 3 independent root tasks)
- Config has default
max_concurrent = 1 /plan confirm→ DagScheduler tick → task 0 spawned → task 1 "concurrency limit 1 reached" → Failed → task 2 skipped
Expected
Task 1 should stay Pending and be picked up on the next tick after task 0 completes.
Actual
Task 1 permanently fails with "spawn failed: concurrency limit 1 reached", propagating failure to dependents.
Root cause
record_spawn_failure() at scheduler.rs:484 unconditionally marks task Failed. It needs to distinguish transient errors (concurrency limit, temporary resource unavailability) from permanent errors (invalid agent definition, configuration error).
Suggested fix
In spawn_for_task() caller (or in record_spawn_failure() itself), check if the error is "concurrency limit reached" and revert the task to Pending instead of Failed. Alternatively, check concurrency availability before attempting spawn.
Related
- fix(orchestration): include tool definitions in inline task execution #1509 (inline task execution fix)
- LlmPlanner may also need improvement: sequential goals ("first X, then Y, then Z") should create task dependencies (0→1→2), not 3 independent root tasks
Live test evidence
spawn_for_task failed error=spawn failed: concurrency limit 1 reached task_id=1
scheduler: spawn failed, marking task failed task_id=1
Plan failed. 1/3 tasks failed: