Skip to content

fix: properly fall back to greedy scheduler in case of pulp/cbc errors#3733

Merged
johanneskoester merged 1 commit intomainfrom
fix/scheduler-fallback
Sep 9, 2025
Merged

fix: properly fall back to greedy scheduler in case of pulp/cbc errors#3733
johanneskoester merged 1 commit intomainfrom
fix/scheduler-fallback

Conversation

@johanneskoester
Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester commented Sep 9, 2025

fixes #3719

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • New Features

    • Automatic fallback to a greedy scheduler when the ILP solver fails or times out, ensuring continued progress.
    • More informative logging that clearly indicates when and why a fallback occurs.
  • Bug Fixes

    • Improved robustness by gracefully handling solver errors and timeouts, preventing crashes or stalls during scheduling.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 9, 2025

📝 Walkthrough

Walkthrough

Adds a scheduler reference attribute in JobScheduler. In MILP Scheduler, introduces a technical failure flag, wraps ILP solve in try/except with logging, and adds a fallback path to the greedy scheduler, including early-return behavior indicating no ILP result.

Changes

Cohort / File(s) Summary
JobScheduler internal reference
src/snakemake/scheduling/job_scheduler.py
Adds private attribute self._scheduler assigned from constructor parameter; no other logic changes.
MILP scheduler fallback & robustness
src/snakemake/scheduling/milp.py
Adds _technical_failure_ flag and __post_init__; wraps ILP solve in try/except for PulpSolverError, logs, and triggers fallback; updates select_jobs control flow to support None/early return for fallback and extends logging; signature change noted in summary.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant JS as JobScheduler
  participant MILP as MILP Scheduler
  participant ILP as ILP Solver
  participant GS as Greedy Scheduler

  JS->>MILP: select_jobs(selectable, remaining, resources, sizes)
  alt Technical failure previously
    MILP-->>JS: None (signal fallback)
    JS->>GS: select_jobs(...)
    GS-->>JS: jobs
  else Attempt ILP
    MILP->>ILP: Solve model
    alt Success & optimal
      ILP-->>MILP: Solution
      MILP-->>JS: jobs
    else Error / timeout / non-optimal
      ILP-->>MILP: Failure/Status
      MILP->>MILP: Mark technical failure, log
      MILP-->>JS: None (trigger fallback)
      JS->>GS: select_jobs(...)
      GS-->>JS: jobs
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks (1 passed, 2 warnings)

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The description only contains “fixes #3719” but lacks any summary of the PR’s intended changes or rationale and does not use the required placeholder section to describe implementation details. Please add a prose summary of what this PR changes (e.g., fallback behavior in the MILP scheduler on pulp/cbc errors, logging enhancements, and new private attributes) above the QC checklist as specified by the repository description template.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the primary change of adding a fallback to the greedy scheduler when pulp/cbc errors occur, matching the main logic introduced in the MILP scheduler updates. It is clear, specific, and free of extraneous detail.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/scheduler-fallback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@johanneskoester johanneskoester changed the title fix: properly fall back to greedy scheduler in case of pulp/cbc errors fix: properly fall back to greedy scheduler in case of pulp/cbc errors Sep 9, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/snakemake/scheduling/job_scheduler.py (1)

160-160: New _scheduler reference is unused; verify necessity vs. workflow.scheduler.

Storing the scheduler instance here appears redundant since JobScheduler already binds scheduler.select_jobs and workflow exposes workflow.scheduler. If no external code reads JobScheduler._scheduler, consider removing it to avoid confusion, or rename to clarify intent (e.g., _scheduler_impl). Please confirm if anything relies on this attribute.

src/snakemake/scheduling/milp.py (2)

52-53: Make PulpSolverError import robust across pulp versions.

Some pulp versions expose PulpSolverError differently. Fallback to a generic Exception if the symbol isn’t importable to preserve the greedy fallback.

Apply:

-        from pulp import PulpSolverError
+        try:
+            from pulp import PulpSolverError
+        except Exception:
+            PulpSolverError = Exception

Additionally, consider guarding the import pulp itself to trigger fallback when pulp is missing.


171-178: Avoid hard-coding the 10s timeout in logs.

Tie the message to the actual time_limit used to keep logs accurate if this changes later.

Apply:

-                    "Failed to solve scheduling problem with ILP solver in time (10s), "
+                    f"Failed to solve scheduling problem with ILP solver in time ({time_limit}s), "
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d2bbe5 and aadfeba.

📒 Files selected for processing (2)
  • src/snakemake/scheduling/job_scheduler.py (1 hunks)
  • src/snakemake/scheduling/milp.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • src/snakemake/scheduling/job_scheduler.py
  • src/snakemake/scheduling/milp.py
🧠 Learnings (1)
📚 Learning: 2025-07-29T14:53:04.598Z
Learnt from: johanneskoester
PR: snakemake/snakemake#3676
File: src/snakemake/cli.py:0-0
Timestamp: 2025-07-29T14:53:04.598Z
Learning: In the Snakemake repository, deprecated scheduler interfaces (like scheduler_ilp_solver, --scheduler-solver-path, --scheduler-greediness) should be kept for backward compatibility during the migration to the plugin-based scheduler system, rather than being removed immediately.

Applied to files:

  • src/snakemake/scheduling/job_scheduler.py
  • src/snakemake/scheduling/milp.py
🧬 Code graph analysis (2)
src/snakemake/scheduling/job_scheduler.py (1)
src/snakemake/workflow.py (2)
  • scheduler (481-482)
  • scheduler (485-486)
src/snakemake/scheduling/milp.py (1)
src/snakemake/scheduling/greedy.py (1)
  • select_jobs (42-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (55)
  • GitHub Check: tests (10, ubuntu-latest, py313)
  • GitHub Check: tests (9, macos-latest, py312)
  • GitHub Check: tests (10, windows-2022, py312)
  • GitHub Check: tests (9, windows-2022, py313)
  • GitHub Check: tests (10, ubuntu-latest, py311)
  • GitHub Check: tests (10, ubuntu-latest, py312)
  • GitHub Check: tests (10, windows-2022, py313)
  • GitHub Check: tests (3, macos-latest, py312)
  • GitHub Check: tests (9, ubuntu-latest, py311)
  • GitHub Check: tests (8, ubuntu-latest, py312)
  • GitHub Check: tests (9, ubuntu-latest, py312)
  • GitHub Check: tests (9, windows-2022, py312)
  • GitHub Check: tests (9, ubuntu-latest, py313)
  • GitHub Check: tests (8, windows-2022, py313)
  • GitHub Check: tests (8, ubuntu-latest, py313)
  • GitHub Check: tests (7, ubuntu-latest, py311)
  • GitHub Check: tests (8, ubuntu-latest, py311)
  • GitHub Check: tests (5, windows-2022, py312)
  • GitHub Check: tests (7, windows-2022, py313)
  • GitHub Check: tests (8, windows-2022, py312)
  • GitHub Check: tests (6, ubuntu-latest, py313)
  • GitHub Check: tests (7, ubuntu-latest, py312)
  • GitHub Check: tests (7, windows-2022, py312)
  • GitHub Check: tests (5, ubuntu-latest, py311)
  • GitHub Check: tests (7, ubuntu-latest, py313)
  • GitHub Check: tests (6, ubuntu-latest, py312)
  • GitHub Check: tests (6, ubuntu-latest, py311)
  • GitHub Check: tests (5, windows-2022, py313)
  • GitHub Check: tests (6, windows-2022, py312)
  • GitHub Check: tests (4, windows-2022, py312)
  • GitHub Check: tests (5, ubuntu-latest, py312)
  • GitHub Check: tests (5, ubuntu-latest, py313)
  • GitHub Check: tests (6, windows-2022, py313)
  • GitHub Check: tests (4, ubuntu-latest, py312)
  • GitHub Check: tests (4, ubuntu-latest, py311)
  • GitHub Check: tests (3, ubuntu-latest, py312)
  • GitHub Check: tests (4, windows-2022, py313)
  • GitHub Check: tests (4, ubuntu-latest, py313)
  • GitHub Check: tests (3, windows-2022, py313)
  • GitHub Check: tests (3, ubuntu-latest, py313)
  • GitHub Check: tests (3, windows-2022, py312)
  • GitHub Check: tests (2, ubuntu-latest, py313)
  • GitHub Check: tests (2, windows-2022, py313)
  • GitHub Check: tests (3, ubuntu-latest, py311)
  • GitHub Check: tests (2, ubuntu-latest, py311)
  • GitHub Check: tests (2, windows-2022, py312)
  • GitHub Check: tests (1, windows-2022, py313)
  • GitHub Check: tests (1, macos-latest, py312)
  • GitHub Check: tests (1, macos-latest, py313)
  • GitHub Check: tests (1, windows-2022, py312)
  • GitHub Check: tests (2, ubuntu-latest, py312)
  • GitHub Check: tests (1, ubuntu-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py311)
  • GitHub Check: tests (1, ubuntu-latest, py312)
  • GitHub Check: apidocs
🔇 Additional comments (1)
src/snakemake/scheduling/milp.py (1)

46-49: Verify base select_jobs return type allows None
Overriding SchedulerBase.select_jobs to return Optional[Sequence[JobSchedulerInterface]] may violate its declared return type. Please confirm the signature of SchedulerBase.select_jobs (in snakemake_interface_scheduler_plugins.base) permits None or adjust the interface/override to stay type-safe.

Comment on lines +37 to +39
def __post_init__(self) -> None:
self._technical_failure = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential AttributeError: post_init won’t be called unless Scheduler is a dataclass.

Scheduler isn’t a dataclass, so __post_init__ might never run; accessing self._technical_failure later can crash. Initialize the flag defensively where it’s first used.

Apply this diff near the early-return check:

@@
-    ) -> Optional[Sequence[JobSchedulerInterface]]:
-        if self._technical_failure:
+    ) -> Optional[Sequence[JobSchedulerInterface]]:
+        # Ensure flag exists even if __post_init__ is not invoked by the base class.
+        if not hasattr(self, "_technical_failure"):
+            self._technical_failure = False
+        if self._technical_failure:
             # fallback early since we failed before already
             return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __post_init__(self) -> None:
self._technical_failure = False
) -> Optional[Sequence[JobSchedulerInterface]]:
# Ensure flag exists even if __post_init__ is not invoked by the base class.
if not hasattr(self, "_technical_failure"):
self._technical_failure = False
if self._technical_failure:
# fallback early since we failed before already
return None
🤖 Prompt for AI Agents
In src/snakemake/scheduling/milp.py around lines 37 to 39, __post_init__ is
defined but Scheduler is not a dataclass so that method may never run and
self._technical_failure can be missing; initialize the flag defensively by
setting self._technical_failure = False in the Scheduler.__init__ (or
immediately before the early-return check referenced in the review) so the
attribute always exists before any access, ensuring subsequent code never raises
AttributeError.

Comment on lines +158 to +167
try:
status = self._solve_ilp(prob, time_limit=10)
except PulpSolverError as e:
self._technical_failure = True
self.logger.warning(
"Failed to solve scheduling problem with ILP solver, falling back to "
"greedy scheduler. You likely have to fix your ILP solver "
f"installation. Error message: {e}"
)
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Broaden error handling to guarantee fallback on all ILP failures (not only PulpSolverError).

Solver invocation can raise other exceptions (e.g., ImportError/OSError when binaries are missing). Broaden the catch to ensure we always fall back to greedy and set the sticky failure flag.

Apply:

-        try:
-            status = self._solve_ilp(prob, time_limit=10)
+        time_limit = 10
+        try:
+            status = self._solve_ilp(prob, time_limit=time_limit)
         except PulpSolverError as e:
             self._technical_failure = True
             self.logger.warning(
                 "Failed to solve scheduling problem with ILP solver, falling back to "
                 "greedy scheduler. You likely have to fix your ILP solver "
                 f"installation. Error message: {e}"
             )
             return None
+        except Exception as e:
+            self._technical_failure = True
+            self.logger.warning(
+                "Unexpected error from ILP solver, falling back to greedy scheduler. "
+                f"Error: {type(e).__name__}: {e}"
+            )
+            return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
status = self._solve_ilp(prob, time_limit=10)
except PulpSolverError as e:
self._technical_failure = True
self.logger.warning(
"Failed to solve scheduling problem with ILP solver, falling back to "
"greedy scheduler. You likely have to fix your ILP solver "
f"installation. Error message: {e}"
)
return None
time_limit = 10
try:
status = self._solve_ilp(prob, time_limit=time_limit)
except PulpSolverError as e:
self._technical_failure = True
self.logger.warning(
"Failed to solve scheduling problem with ILP solver, falling back to "
"greedy scheduler. You likely have to fix your ILP solver "
f"installation. Error message: {e}"
)
return None
except Exception as e:
self._technical_failure = True
self.logger.warning(
"Unexpected error from ILP solver, falling back to greedy scheduler. "
f"Error: {type(e).__name__}: {e}"
)
return None
🤖 Prompt for AI Agents
In src/snakemake/scheduling/milp.py around lines 158 to 167, broaden the
exception handler around the ILP solver call so all failures (not only
PulpSolverError) trigger the fallback: catch a general Exception (e.g., except
Exception as e), set self._technical_failure = True, log the same warning
message including the caught exception details, and return None; keep the
existing message/context but ensure it runs for ImportError/OSError/other
runtime errors so the greedy scheduler is always used on any solver failure.

@johanneskoester johanneskoester merged commit 62f13b8 into main Sep 9, 2025
81 of 82 checks passed
@johanneskoester johanneskoester deleted the fix/scheduler-fallback branch September 9, 2025 12:02
johanneskoester pushed a commit that referenced this pull request Sep 9, 2025
🤖 I have created a release *beep* *boop*
---


##
[9.11.2](v9.11.1...v9.11.2)
(2025-09-09)


### Bug Fixes

* properly fall back to greedy scheduler in case of pulp/cbc errors
([#3733](#3733))
([62f13b8](62f13b8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
kjohnsen pushed a commit to kjohnsen/snakemake that referenced this pull request Dec 15, 2025
snakemake#3733)

fixes snakemake#3719 

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- New Features
- Automatic fallback to a greedy scheduler when the ILP solver fails or
times out, ensuring continued progress.
- More informative logging that clearly indicates when and why a
fallback occurs.

- Bug Fixes
- Improved robustness by gracefully handling solver errors and timeouts,
preventing crashes or stalls during scheduling.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kjohnsen pushed a commit to kjohnsen/snakemake that referenced this pull request Dec 15, 2025
🤖 I have created a release *beep* *boop*
---


##
[9.11.2](snakemake/snakemake@v9.11.1...v9.11.2)
(2025-09-09)


### Bug Fixes

* properly fall back to greedy scheduler in case of pulp/cbc errors
([snakemake#3733](snakemake#3733))
([62f13b8](snakemake@62f13b8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
@coderabbitai coderabbitai Bot mentioned this pull request Jan 5, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New "--scheduler" default value breaks my pipeline

1 participant