Skip to content

fix(orchestration): graceful drain on channel close, max_parallel drift, toposort 3-4x#2263

Merged
bug-ops merged 2 commits intomainfrom
fix/2246-plan-confirm-race
Mar 27, 2026
Merged

fix(orchestration): graceful drain on channel close, max_parallel drift, toposort 3-4x#2263
bug-ops merged 2 commits intomainfrom
fix/2246-plan-confirm-race

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 27, 2026

Summary

Test plan

  • cargo nextest run --config-file .github/nextest.toml -p zeph-orchestration --lib --bins — 251 tests pass
  • cargo nextest run --config-file .github/nextest.toml -p zeph-core --lib --bins — 1157 tests pass (includes 3 new shutdown-path tests replacing stale COV-04)
  • New tests: classify_with_depths_matches_classify_for_all_variants, 6 compute_max_parallel_*, config_max_parallel_initialized_from_config, max_parallel_does_not_drift_across_inject_tick_cycles, scheduler_loop_channel_close_supports_exit_returns_canceled, scheduler_loop_channel_close_no_exit_support_returns_failed, scheduler_loop_channel_close_drain_captures_completion

Closes #2246, #2237, #2236

@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate bug Something isn't working size/XL Extra large PR (500+ lines) labels Mar 27, 2026
@bug-ops bug-ops force-pushed the fix/2246-plan-confirm-race branch from 64ec1f9 to 7fe929a Compare March 27, 2026 12:25
@bug-ops bug-ops enabled auto-merge (squash) March 27, 2026 12:25
@bug-ops bug-ops force-pushed the fix/2246-plan-confirm-race branch 3 times, most recently from 3189bcc to 6b77ad6 Compare March 27, 2026 12:56
bug-ops added 2 commits March 27, 2026 14:10
… drift, reduce toposort passes

Fixes #2246, #2237, #2236.

- #2246: DagScheduler now drains buffered task-completion events before
  calling cancel_all() on channel close, preventing in-flight tasks from
  being silently counted as "did not run". Shutdown status is now
  channel-type-aware: supports_exit()=true (CLI/TUI) -> Canceled;
  supports_exit()=false (Telegram/Discord/Slack) -> Failed so users can
  /plan retry after reconnect.

- #2237: Extract compute_max_parallel(topology, base) as a single
  canonical method on TopologyClassifier. DagScheduler stores
  config_max_parallel (immutable config value) and uses it as the base
  in both analyze() and the tick() dirty-reanalysis path. After topology
  assignment in tick(), self.max_parallel is explicitly synced to
  self.topology.max_parallel, closing the drift bug.

- #2236: Add classify_with_depths(graph, longest_path, depths) that
  accepts pre-computed toposort values. analyze() and tick() call it
  with values from a single compute_longest_path_and_depths pass,
  reducing toposort work from 3-4x to 1x per analysis.

New tests: 8 topology unit tests, 2 scheduler regression tests, 3
agent-loop shutdown-path tests (including stale COV-04 replacement).
@bug-ops bug-ops force-pushed the fix/2246-plan-confirm-race branch from 6b77ad6 to d571264 Compare March 27, 2026 13:10
@bug-ops bug-ops merged commit aeeb06f into main Mar 27, 2026
25 checks passed
@bug-ops bug-ops deleted the fix/2246-plan-confirm-race branch March 27, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

1 participant