Skip to content

perf(orchestration): parallel task dispatch in DagScheduler#1718

Merged
bug-ops merged 2 commits intomainfrom
async-parallel-dag-dispatch
Mar 14, 2026
Merged

perf(orchestration): parallel task dispatch in DagScheduler#1718
bug-ops merged 2 commits intomainfrom
async-parallel-dag-dispatch

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 14, 2026

Resolves #1628.

Removes the slots_available cap from DagScheduler::tick(), dispatching all
tasks that are ready in a single tick rather than at most max_parallel - running.
Concurrency is enforced at the SubAgentManager layer via ConcurrencyLimit
rejection.

Changes

  • DagScheduler::tick(): dispatches all ready tasks; removes slots_available
    calculation based on max_parallel.
  • DagScheduler::record_batch_backoff(): new method for batch-aware exponential
    backoff; resets counter on any success, increments only when the whole batch
    is deferred by concurrency limits.
  • wait_event(): buffer guard updated from max_parallel * 2 to
    graph.tasks.len() * 2 to prevent dropped completion events under burst
    dispatch.
  • run_scheduler_loop() (zeph-core): tracks per-tick any_spawn_success and
    any_concurrency_failure flags; calls record_batch_backoff after each action
    batch; moves spawn_counter increment into the success path only.
  • 6 new tests, 4 updated tests.

Notes

  • max_parallel config field is retained for buffer sizing but no longer caps
    dispatch count per tick.
  • RunInline tasks still block the tick loop for their duration (pre-existing
    limitation, documented in code comment).

@github-actions github-actions bot added documentation Improvements or additions to documentation performance Performance improvements rust Rust code changes core zeph-core crate size/L Large PR (201-500 lines) labels Mar 14, 2026
bug-ops added 2 commits March 14, 2026 01:20
Remove the slots_available cap from DagScheduler::tick() so all ready
tasks are dispatched in a single tick. Concurrency is enforced by
SubAgentManager::spawn() which returns ConcurrencyLimit when capacity is
exhausted; tasks revert to Ready and are retried on the next tick.

- tick(): remove running_in_graph/slots_available/.take(slots_available)
- wait_event(): buffer guard uses graph.tasks.len()*2 instead of
  max_parallel*2 to prevent dropped completion events during bursts
- record_spawn_failure(): no longer increments consecutive_spawn_failures
- add record_batch_backoff(any_success, any_concurrency_failure) for
  batch-aware backoff: counter increments once per all-failed tick, not
  once per rejected spawn in the same batch
- run_scheduler_loop(): track any_spawn_success/any_concurrency_failure
  across the batch, call record_batch_backoff() after all Spawn actions
- spawn_counter incremented only on successful spawn (Ok path)
- update and add scheduler tests reflecting new semantics (8 tests)
…1628)

- CHANGELOG: rebase restores #1646 ToolCallDag and #1652 Gemini thinking
  entries; reverts #1387 description to "info-level log message"
- record_spawn(): add doc comment explaining intentional reset overlap
  with record_batch_backoff (counter #3)
- tick(): add deadlock detection comment explaining transient vs fatal
  failure handling and ConcurrencyLimit revert-to-Ready path (R1)
- test_buffer_guard_uses_task_count: add structural regression note
  explaining the test guards against reversion to max_parallel*2 (fix #4)
- test_batch_mixed_concurrency_and_fatal_failure: new test covering
  mixed batch where task 0 gets ConcurrencyLimit and task 1 gets a
  non-transient Spawn error with FailureStrategy::Skip (fix #5)
@bug-ops bug-ops force-pushed the async-parallel-dag-dispatch branch from 8fb79a3 to d2c7b50 Compare March 14, 2026 00:20
@bug-ops bug-ops enabled auto-merge (squash) March 14, 2026 00:20
@bug-ops bug-ops merged commit efcbb02 into main Mar 14, 2026
15 checks passed
@bug-ops bug-ops deleted the async-parallel-dag-dispatch branch March 14, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate documentation Improvements or additions to documentation performance Performance improvements rust Rust code changes size/L Large PR (201-500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

research(orchestration): async parallel task dispatch in DagScheduler (DynTaskMAS pattern)

1 participant