perf(orchestration): parallel task dispatch in DagScheduler#1718
Merged
perf(orchestration): parallel task dispatch in DagScheduler#1718
Conversation
Remove the slots_available cap from DagScheduler::tick() so all ready tasks are dispatched in a single tick. Concurrency is enforced by SubAgentManager::spawn() which returns ConcurrencyLimit when capacity is exhausted; tasks revert to Ready and are retried on the next tick. - tick(): remove running_in_graph/slots_available/.take(slots_available) - wait_event(): buffer guard uses graph.tasks.len()*2 instead of max_parallel*2 to prevent dropped completion events during bursts - record_spawn_failure(): no longer increments consecutive_spawn_failures - add record_batch_backoff(any_success, any_concurrency_failure) for batch-aware backoff: counter increments once per all-failed tick, not once per rejected spawn in the same batch - run_scheduler_loop(): track any_spawn_success/any_concurrency_failure across the batch, call record_batch_backoff() after all Spawn actions - spawn_counter incremented only on successful spawn (Ok path) - update and add scheduler tests reflecting new semantics (8 tests)
…1628) - CHANGELOG: rebase restores #1646 ToolCallDag and #1652 Gemini thinking entries; reverts #1387 description to "info-level log message" - record_spawn(): add doc comment explaining intentional reset overlap with record_batch_backoff (counter #3) - tick(): add deadlock detection comment explaining transient vs fatal failure handling and ConcurrencyLimit revert-to-Ready path (R1) - test_buffer_guard_uses_task_count: add structural regression note explaining the test guards against reversion to max_parallel*2 (fix #4) - test_batch_mixed_concurrency_and_fatal_failure: new test covering mixed batch where task 0 gets ConcurrencyLimit and task 1 gets a non-transient Spawn error with FailureStrategy::Skip (fix #5)
8fb79a3 to
d2c7b50
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #1628.
Removes the
slots_availablecap fromDagScheduler::tick(), dispatching alltasks that are ready in a single tick rather than at most
max_parallel - running.Concurrency is enforced at the
SubAgentManagerlayer viaConcurrencyLimitrejection.
Changes
DagScheduler::tick(): dispatches all ready tasks; removesslots_availablecalculation based on
max_parallel.DagScheduler::record_batch_backoff(): new method for batch-aware exponentialbackoff; resets counter on any success, increments only when the whole batch
is deferred by concurrency limits.
wait_event(): buffer guard updated frommax_parallel * 2tograph.tasks.len() * 2to prevent dropped completion events under burstdispatch.
run_scheduler_loop()(zeph-core): tracks per-tickany_spawn_successandany_concurrency_failureflags; callsrecord_batch_backoffafter each actionbatch; moves
spawn_counterincrement into the success path only.Notes
max_parallelconfig field is retained for buffer sizing but no longer capsdispatch count per tick.
RunInlinetasks still block the tick loop for their duration (pre-existinglimitation, documented in code comment).