Skip to content

DagScheduler: concurrency limit hit marks task as Failed instead of deferring #1513

@bug-ops

Description

@bug-ops

Bug

When DagScheduler spawns tasks on a tick and hits the sub-agent concurrency limit (max_concurrent=1 default), it calls record_spawn_failure() which unconditionally sets the task to Failed and propagates failure to dependents.

For transient errors like "concurrency limit reached", the task should remain Pending and be retried on the next scheduler tick — not permanently failed.

Reproduction

  1. Create a plan with 3+ tasks (e.g. LlmPlanner creates 3 independent root tasks)
  2. Config has default max_concurrent = 1
  3. /plan confirm → DagScheduler tick → task 0 spawned → task 1 "concurrency limit 1 reached" → Failed → task 2 skipped

Expected

Task 1 should stay Pending and be picked up on the next tick after task 0 completes.

Actual

Task 1 permanently fails with "spawn failed: concurrency limit 1 reached", propagating failure to dependents.

Root cause

record_spawn_failure() at scheduler.rs:484 unconditionally marks task Failed. It needs to distinguish transient errors (concurrency limit, temporary resource unavailability) from permanent errors (invalid agent definition, configuration error).

Suggested fix

In spawn_for_task() caller (or in record_spawn_failure() itself), check if the error is "concurrency limit reached" and revert the task to Pending instead of Failed. Alternatively, check concurrency availability before attempting spawn.

Related

Live test evidence

spawn_for_task failed error=spawn failed: concurrency limit 1 reached task_id=1
scheduler: spawn failed, marking task failed task_id=1
Plan failed. 1/3 tasks failed:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions