Skip to content

perf: retrieve storage inputs immediately before scheduling jobs instead of before running the entire workflow#3850

Merged
johanneskoester merged 8 commits intomainfrom
perf/lazy-storage
Nov 27, 2025
Merged

perf: retrieve storage inputs immediately before scheduling jobs instead of before running the entire workflow#3850
johanneskoester merged 8 commits intomainfrom
perf/lazy-storage

Conversation

@johanneskoester
Copy link
Copy Markdown
Contributor

@johanneskoester johanneskoester commented Nov 27, 2025

This saves local disk space. After each job is finished, Snakemake already ensures that local storage copies are deleted once not needed anymore. Hence, in combination with this PR, the lifetime of local copies on disk will be minimized.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • Refactor

    • Simplified and unified storage-input retrieval flow, changing when preflight retrieval runs and clarifying external-input determination.
  • New Features

    • Asynchronous prefetching of storage inputs for remote jobs when local copies are configured.
  • Style

    • Consistent file I/O formatting and logging output across job reporting.

✏️ Tip: You can customize this high-level summary in your review settings.

@johanneskoester johanneskoester changed the title Retrieve storage inputs immediately before scheduling jobs. perf: retrieve storage inputs immediately before scheduling jobs. Nov 27, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 27, 2025

Warning

Rate limit exceeded

@johanneskoester has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 14 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between c099f0f and 9975fba.

📒 Files selected for processing (1)
  • tests/common.py (1 hunks)
📝 Walkthrough

Walkthrough

Added a trivial utility func_true; changed when and how storage inputs are prefetched across DAG, scheduler, and workflow; updated is_external_input consideration logic; and replaced the boolean IO-format flag with explicit as_input/as_output parameters in job formatting.

Changes

Cohort / File(s) Summary
Common utility
src/snakemake/common/__init__.py
Added def func_true(job) that always returns True.
DAG logic
src/snakemake/dag.py
Imported func_true and is_local_file. Made jobs required in retrieve_storage_inputs. Simplified output iteration; removed post-update automatic storage retrieval. is_external_input now defaults consider to func_true, switching to self.needrun when not_needrun_is_external is True; dependency filtering uses consider(dep).
Job scheduler
src/snakemake/scheduling/job_scheduler.py
Added SharedFSUsage import. Local-job storage retrieval now occurs when not a dry run and not a subprocess. For remote jobs, prefetch storage inputs asynchronously when not a dry run and SharedFSUsage indicates local copies; guarded by main-process assertion.
Workflow flow
src/snakemake/workflow.py
Removed pre-flight asynchronous call to dag.retrieve_storage_inputs() before deployment/source decisions; logic changed around when storage inputs are retrieved.
Jobs IO formatting
src/snakemake/jobs.py
Changed format_files signature to format_files(io, as_input: bool = False, as_output: bool = False) and updated call sites to use explicit as_input/as_output flags; removed old boolean-flag contract.

Sequence Diagram(s)

sequenceDiagram
  participant Workflow
  participant JobScheduler
  participant DAG
  participant Storage

  Workflow->>JobScheduler: schedule jobs / decide execution mode
  alt Local job & not dry run & not subprocess
    JobScheduler->>DAG: retrieve_storage_inputs(jobs)
    DAG->>Storage: fetch inputs (possibly async)
  else Remote job & not dry run & SharedFSUsage == LOCAL_COPIES
    JobScheduler->>DAG: async retrieve_storage_inputs(jobs)
    DAG->>Storage: prefetch inputs to local copies
  else Other cases
    JobScheduler->>DAG: no prefetch (defer to job execution)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • Focus review on:
    • is_external_input changes: ensure callers handle the new consider predicate semantics.
    • retrieve_storage_inputs signature: verify all call sites pass jobs.
    • Async prefetch path in scheduler: correctness of SharedFSUsage check and main-process assertion.
    • format_files signature change: verify all formatting/logging call sites updated.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: moving storage input retrieval from before workflow execution to immediately before job scheduling for performance optimization.
Description check ✅ Passed The description explains the purpose (minimizing disk space usage), confirms QC requirements are met, but lacks detail on specific changes made to files and functions.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@johanneskoester johanneskoester changed the title perf: retrieve storage inputs immediately before scheduling jobs. perf: retrieve storage inputs immediately before scheduling jobs Nov 27, 2025
@johanneskoester johanneskoester changed the title perf: retrieve storage inputs immediately before scheduling jobs perf: retrieve storage inputs immediately before scheduling jobs instead of before running the entire workflow Nov 27, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5caad70 and 09d0a73.

📒 Files selected for processing (4)
  • src/snakemake/common/__init__.py (1 hunks)
  • src/snakemake/dag.py (4 hunks)
  • src/snakemake/scheduling/job_scheduler.py (3 hunks)
  • src/snakemake/workflow.py (0 hunks)
💤 Files with no reviewable changes (1)
  • src/snakemake/workflow.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • src/snakemake/common/__init__.py
  • src/snakemake/scheduling/job_scheduler.py
  • src/snakemake/dag.py
🧠 Learnings (3)
📚 Learning: 2024-10-04T16:12:18.927Z
Learnt from: lczech
Repo: snakemake/snakemake PR: 3113
File: snakemake/scheduler.py:912-914
Timestamp: 2024-10-04T16:12:18.927Z
Learning: In `snakemake/scheduler.py`, avoid suggesting the use of `asyncio.gather` in the `jobs_rewards` method due to overhead concerns and the need for immediate results.

Applied to files:

  • src/snakemake/scheduling/job_scheduler.py
📚 Learning: 2025-07-29T14:53:04.598Z
Learnt from: johanneskoester
Repo: snakemake/snakemake PR: 3676
File: src/snakemake/cli.py:0-0
Timestamp: 2025-07-29T14:53:04.598Z
Learning: In the Snakemake repository, deprecated scheduler interfaces (like scheduler_ilp_solver, --scheduler-solver-path, --scheduler-greediness) should be kept for backward compatibility during the migration to the plugin-based scheduler system, rather than being removed immediately.

Applied to files:

  • src/snakemake/scheduling/job_scheduler.py
📚 Learning: 2024-10-14T09:42:11.571Z
Learnt from: johanneskoester
Repo: snakemake/snakemake PR: 3140
File: snakemake/dag.py:1308-1308
Timestamp: 2024-10-14T09:42:11.571Z
Learning: In `snakemake/dag.py`, avoid flagging missing lines or indentation issues when there is no clear syntax or logical error to prevent false positives.

Applied to files:

  • src/snakemake/dag.py
🧬 Code graph analysis (2)
src/snakemake/scheduling/job_scheduler.py (4)
src/snakemake/settings/types.py (1)
  • MaxJobsPerTimespan (63-87)
src/snakemake/workflow.py (3)
  • dryrun (419-423)
  • is_main_process (273-274)
  • dag (520-521)
src/snakemake/common/__init__.py (1)
  • async_run (96-113)
src/snakemake/dag.py (2)
  • retrieve_storage_inputs (407-449)
  • jobs (605-607)
src/snakemake/dag.py (1)
src/snakemake/common/__init__.py (1)
  • func_true (51-52)
🪛 Ruff (0.14.6)
src/snakemake/common/__init__.py

51-51: Unused function argument: job

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (46)
  • GitHub Check: tests (10, ubuntu-latest, py312)
  • GitHub Check: tests (10, ubuntu-latest, py311)
  • GitHub Check: tests (10, macos-latest, py313)
  • GitHub Check: tests (10, ubuntu-latest, py313)
  • GitHub Check: tests (9, ubuntu-latest, py313)
  • GitHub Check: tests (9, windows-2022, py313)
  • GitHub Check: tests (10, windows-2022, py313)
  • GitHub Check: tests (9, ubuntu-latest, py311)
  • GitHub Check: tests (8, ubuntu-latest, py312)
  • GitHub Check: tests (9, ubuntu-latest, py312)
  • GitHub Check: tests (8, ubuntu-latest, py311)
  • GitHub Check: tests (8, ubuntu-latest, py313)
  • GitHub Check: tests (7, ubuntu-latest, py313)
  • GitHub Check: tests (5, windows-2022, py313)
  • GitHub Check: tests (8, windows-2022, py313)
  • GitHub Check: tests (6, windows-2022, py313)
  • GitHub Check: tests (7, ubuntu-latest, py312)
  • GitHub Check: tests (7, windows-2022, py313)
  • GitHub Check: tests (6, ubuntu-latest, py312)
  • GitHub Check: tests (6, ubuntu-latest, py311)
  • GitHub Check: tests (3, macos-latest, py313)
  • GitHub Check: tests (7, ubuntu-latest, py311)
  • GitHub Check: tests (4, ubuntu-latest, py312)
  • GitHub Check: tests (5, ubuntu-latest, py313)
  • GitHub Check: tests (6, ubuntu-latest, py313)
  • GitHub Check: tests (4, macos-latest, py313)
  • GitHub Check: tests (4, windows-2022, py313)
  • GitHub Check: tests (5, ubuntu-latest, py312)
  • GitHub Check: tests (2, windows-2022, py313)
  • GitHub Check: tests (2, ubuntu-latest, py311)
  • GitHub Check: tests (3, windows-2022, py313)
  • GitHub Check: tests (4, ubuntu-latest, py313)
  • GitHub Check: tests (3, ubuntu-latest, py311)
  • GitHub Check: tests (3, ubuntu-latest, py312)
  • GitHub Check: tests (5, ubuntu-latest, py311)
  • GitHub Check: tests (2, macos-latest, py313)
  • GitHub Check: tests (3, ubuntu-latest, py313)
  • GitHub Check: tests (4, ubuntu-latest, py311)
  • GitHub Check: tests (1, windows-2022, py313)
  • GitHub Check: tests (1, macos-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py312)
  • GitHub Check: tests (2, ubuntu-latest, py312)
  • GitHub Check: tests (2, ubuntu-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py311)
  • GitHub Check: tests (1, ubuntu-latest, py313)
  • GitHub Check: apidocs
🔇 Additional comments (6)
src/snakemake/scheduling/job_scheduler.py (1)

34-35: Storage-input retrieval for local and remote jobs is consistent with shared-FS semantics

The revised scheduling logic looks sound:

  • Local jobs: when not a dry run, storage inputs are prefetched via dag.retrieve_storage_inputs(..., also_missing_internal=True) before invoking self.run, which keeps behavior simple and predictable.
  • Remote jobs: guarded by assert self.workflow.is_main_process and SharedFSUsage.STORAGE_LOCAL_COPIES in self.workflow.storage_settings.shared_fs_usage, so prefetching happens only when local copies are actually shared; otherwise remote jobs download inputs themselves.

This is consistent with how SharedFSUsage.STORAGE_LOCAL_COPIES is used elsewhere (e.g. in DAG.cleanup_storage_objects) and doesn’t introduce obvious correctness or performance regressions.

Also applies to: 339-378

src/snakemake/dag.py (5)

38-44: Reusing func_true and is_local_file from snakemake.common looks good

Importing func_true for is_external_input and is_local_file for get_sources centralizes these helpers in snakemake.common and keeps behavior consistent across modules. No issues spotted with these imports or their usages.


163-165: Storage-input job bookkeeping (_storage_input_jobs) is coherent with cleanup logic

The combination of:

  • Initializing self._storage_input_jobs = defaultdict(list) in __init__,
  • Populating it in update_storage_inputs() from needrun_jobs(), and
  • Using it in handle_storage() to remove local storage copies only after all jobs that use a given storage input have finished,

is internally consistent and respects the “don’t drop inputs while other jobs still need them” invariant.

The call to update_storage_inputs() from postprocess() ensures this mapping is refreshed after DAG updates, so the cleanup decision in handle_storage() remains accurate throughout the run.

Also applies to: 451-457, 1070-1093


3069-3083: is_external_input’s predicate refactor matches the documented behavior

The refactored implementation:

  • Uses consider = func_true by default, so a file is non-external if any dependency produces it, regardless of its needrun status — matching “may not be created by any other job to be considered external.”
  • Switches to consider = self.needrun when not_needrun_is_external=True, so only producers that are still needrun are treated as internal; files produced solely by non-needrun jobs are treated as external, as described in the docstring.

This aligns both default usage (e.g. in archive()) and the special-case behavior used by retrieve_storage_inputs.


1907-1913: Calling update_storage_inputs() during postprocess keeps storage cleanup decisions up to date

Including self.update_storage_inputs() at the end of postprocess ensures _storage_input_jobs is recomputed after each DAG update (e.g. from checkpoints or queue-input updates) before new jobs start to run. That way, handle_storage() always bases its “all other jobs finished” check on the current needrun set rather than a stale snapshot.

This is the right place in the lifecycle to refresh that mapping; no issues here.


407-424: All call sites of retrieve_storage_inputs() pass the jobs argument explicitly—verification confirms no breaking changes.

Both call sites in src/snakemake/scheduling/job_scheduler.py (lines 352 and 374) already pass jobs=... as required by the updated signature. No legacy calls without arguments exist in the codebase.

Comment on lines +51 to +52
def func_true(job):
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Suppress Ruff ARG001 by marking the job argument intentionally unused

func_true needs to accept a job for signature compatibility but does not use it, which triggers Ruff ARG001. Renaming the parameter makes the intent explicit and should silence the warning:

-def func_true(job):
-    return True
+def func_true(_job):
+    return True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def func_true(job):
return True
def func_true(_job):
return True
🧰 Tools
🪛 Ruff (0.14.6)

51-51: Unused function argument: job

(ARG001)

🤖 Prompt for AI Agents
In src/snakemake/common/__init__.py around lines 51 to 52, func_true currently
takes a parameter named job but does not use it, triggering Ruff ARG001; rename
the parameter to _job (or _ ) to signal it is intentionally unused while
preserving the function signature, and update any local references or callers if
they rely on the parameter name so the change is safe.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/snakemake/scheduling/job_scheduler.py (1)

34-35: Scheduler-side storage prefetching wiring looks sound; also_missing_internal usage could be simplified

The integration here is coherent:

  • Local jobs: prefetching storage inputs only when not in a dry run and not in subprocess_exec avoids unnecessary IO in dry-run and subprocess contexts while still ensuring local jobs see their inputs.
  • Remote jobs: guarding retrieval with SharedFSUsage.STORAGE_LOCAL_COPIES in shared_fs_usage and asserting execution in the main process lines up with the intent that only a shared FS should host pre-fetched local copies; otherwise, remote workers continue to fetch from storage themselves.

One minor semantic nit: retrieve_storage_inputs() is always called with also_missing_internal=True, which means it will prefetch all storage-backed inputs for the scheduled jobs, and the new is_external_input(..., not_needrun_is_external=True) branch is effectively unused. If the long‑term intent is to distinguish “true external” inputs from internal ones for prefetching, consider either:

  • Passing also_missing_internal=False where you only want external inputs, or
  • Dropping/simplifying the also_missing_internal parameter if you never plan to call it with False.

This is about API clarity and avoiding dead branching rather than a functional bug.

Also applies to: 349-355, 362-377

src/snakemake/dag.py (1)

38-44: Batch-scoped storage-input retrieval and is_external_input semantics look consistent; note change to is_ondemand_eligible heuristic

  • The added func_true import and its use in is_external_input() give a clear default (“consider all deps”) while allowing the not_needrun_is_external flag to narrow producers to needrun jobs only. The new docstring accurately reflects the behavior.

  • retrieve_storage_inputs(self, jobs, also_missing_internal=False) now:

    • Works on an explicit jobs subset rather than the whole DAG.
    • Uses is_external_input(f, job, not_needrun_is_external=True) only when also_missing_internal is False; otherwise it simply prefetches all storage inputs for the given jobs.
    • Computes file_access_patterns only over that jobs subset and sets f.storage_object.is_ondemand_eligible accordingly.

The last point slightly changes the meaning of the comment about “multiple jobs” accessing the same file: when the same storage file is consumed by jobs that end up in different scheduling rounds, no single call will ever see more than one consumer, so len(file_access_patterns) can remain 1 and the file will often be marked is_ondemand_eligible=True even though it is reused across multiple jobs.

Given is_ondemand_eligible is just a performance hint to storage plugins, this is harmless from a correctness perspective, but it does weaken the global heuristic. If you still want to approximate “multiple jobs across the whole DAG”, you could consider augmenting this hint with information from self._storage_input_jobs or another precomputed aggregate; otherwise, it may be worth updating the comment to reflect that the heuristic is now batch-local.

Overall, the refactor of retrieve_storage_inputs and is_external_input is internally consistent and aligns with the new scheduler-side usage.

Also applies to: 407-449, 3069-3083

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09d0a73 and 1643b14.

📒 Files selected for processing (2)
  • src/snakemake/dag.py (5 hunks)
  • src/snakemake/scheduling/job_scheduler.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • src/snakemake/scheduling/job_scheduler.py
  • src/snakemake/dag.py
🧠 Learnings (3)
📚 Learning: 2024-10-04T16:12:18.927Z
Learnt from: lczech
Repo: snakemake/snakemake PR: 3113
File: snakemake/scheduler.py:912-914
Timestamp: 2024-10-04T16:12:18.927Z
Learning: In `snakemake/scheduler.py`, avoid suggesting the use of `asyncio.gather` in the `jobs_rewards` method due to overhead concerns and the need for immediate results.

Applied to files:

  • src/snakemake/scheduling/job_scheduler.py
📚 Learning: 2025-07-29T14:53:04.598Z
Learnt from: johanneskoester
Repo: snakemake/snakemake PR: 3676
File: src/snakemake/cli.py:0-0
Timestamp: 2025-07-29T14:53:04.598Z
Learning: In the Snakemake repository, deprecated scheduler interfaces (like scheduler_ilp_solver, --scheduler-solver-path, --scheduler-greediness) should be kept for backward compatibility during the migration to the plugin-based scheduler system, rather than being removed immediately.

Applied to files:

  • src/snakemake/scheduling/job_scheduler.py
📚 Learning: 2024-10-14T09:42:11.571Z
Learnt from: johanneskoester
Repo: snakemake/snakemake PR: 3140
File: snakemake/dag.py:1308-1308
Timestamp: 2024-10-14T09:42:11.571Z
Learning: In `snakemake/dag.py`, avoid flagging missing lines or indentation issues when there is no clear syntax or logical error to prevent false positives.

Applied to files:

  • src/snakemake/dag.py
🧬 Code graph analysis (2)
src/snakemake/scheduling/job_scheduler.py (4)
src/snakemake/settings/types.py (1)
  • MaxJobsPerTimespan (63-87)
src/snakemake/workflow.py (4)
  • dryrun (419-423)
  • subprocess_exec (462-463)
  • is_main_process (273-274)
  • dag (520-521)
src/snakemake/common/__init__.py (1)
  • async_run (96-113)
src/snakemake/dag.py (2)
  • retrieve_storage_inputs (407-449)
  • jobs (605-607)
src/snakemake/dag.py (1)
src/snakemake/common/__init__.py (1)
  • func_true (51-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (45)
  • GitHub Check: tests (10, ubuntu-latest, py312)
  • GitHub Check: tests (10, windows-2022, py313)
  • GitHub Check: tests (9, ubuntu-latest, py311)
  • GitHub Check: tests (10, ubuntu-latest, py313)
  • GitHub Check: tests (9, ubuntu-latest, py312)
  • GitHub Check: tests (8, windows-2022, py313)
  • GitHub Check: tests (9, ubuntu-latest, py313)
  • GitHub Check: tests (9, windows-2022, py313)
  • GitHub Check: tests (8, ubuntu-latest, py312)
  • GitHub Check: tests (10, ubuntu-latest, py311)
  • GitHub Check: tests (7, ubuntu-latest, py311)
  • GitHub Check: tests (8, ubuntu-latest, py313)
  • GitHub Check: tests (7, ubuntu-latest, py312)
  • GitHub Check: tests (8, ubuntu-latest, py311)
  • GitHub Check: tests (7, windows-2022, py313)
  • GitHub Check: tests (6, ubuntu-latest, py312)
  • GitHub Check: tests (6, ubuntu-latest, py313)
  • GitHub Check: tests (8, macos-latest, py313)
  • GitHub Check: tests (6, ubuntu-latest, py311)
  • GitHub Check: tests (7, ubuntu-latest, py313)
  • GitHub Check: tests (6, macos-latest, py313)
  • GitHub Check: tests (4, ubuntu-latest, py313)
  • GitHub Check: tests (6, windows-2022, py313)
  • GitHub Check: tests (5, ubuntu-latest, py311)
  • GitHub Check: tests (4, windows-2022, py313)
  • GitHub Check: tests (5, windows-2022, py313)
  • GitHub Check: tests (5, ubuntu-latest, py313)
  • GitHub Check: tests (4, ubuntu-latest, py312)
  • GitHub Check: tests (5, ubuntu-latest, py312)
  • GitHub Check: tests (3, macos-latest, py313)
  • GitHub Check: tests (3, windows-2022, py313)
  • GitHub Check: tests (2, macos-latest, py313)
  • GitHub Check: tests (4, ubuntu-latest, py311)
  • GitHub Check: tests (2, ubuntu-latest, py311)
  • GitHub Check: tests (3, ubuntu-latest, py313)
  • GitHub Check: tests (2, windows-2022, py313)
  • GitHub Check: tests (3, ubuntu-latest, py311)
  • GitHub Check: tests (2, ubuntu-latest, py312)
  • GitHub Check: tests (3, ubuntu-latest, py312)
  • GitHub Check: tests (2, ubuntu-latest, py313)
  • GitHub Check: tests (1, windows-2022, py313)
  • GitHub Check: tests (1, macos-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py313)
  • GitHub Check: tests (1, ubuntu-latest, py312)
  • GitHub Check: tests (1, ubuntu-latest, py311)
🔇 Additional comments (1)
src/snakemake/dag.py (1)

1075-1080: Avoiding cleanup of target storage outputs is a good safeguard

Extending unneeded_files() in handle_storage() to skip files that are in self.targetfiles before removing local storage copies tightens correctness wrt user expectations: final targets aren’t pruned locally just because no further jobs depend on them. This aligns well with typical “target files should remain available” semantics and doesn’t introduce any obvious edge cases given existing uses of self.targetfiles.

@johanneskoester johanneskoester merged commit 4ac6cda into main Nov 27, 2025
59 checks passed
@johanneskoester johanneskoester deleted the perf/lazy-storage branch November 27, 2025 17:01
johanneskoester pushed a commit that referenced this pull request Nov 27, 2025
🤖 I have created a release *beep* *boop*
---


##
[9.14.0](v9.13.7...v9.14.0)
(2025-11-27)


### Features

* Support Hy in script directive
([#3824](#3824))
([2329c9e](2329c9e))


### Bug Fixes

* Add support for pathlib in `notebook` field
([#3811](#3811))
([7b2180a](7b2180a))
* Addressed race condition in workdir_handler.py
([#3844](#3844))
([8dbfcfb](8dbfcfb))
* cleanup update-marked output files of failed jobs if there was no
backup to restore them
([#3843](#3843))
([41f1ce8](41f1ce8))
* correct Windows callable path handling
([#3832](#3832))
([5caad70](5caad70))
* expand env vars on resources
([#3823](#3823))
([fcfa1bc](fcfa1bc))
* fix backup for output marked by `update`
([#3839](#3839))
([09c64b7](09c64b7))
* Minor fixes/additions to logging module.
([#3802](#3802))
([3b3986d](3b3986d))
* mount local storage prefix into containers
([#3840](#3840))
([f1e8b62](f1e8b62))
* properly format input/output files in case of missing rule to produce
them ([#3849](#3849))
([69d5d24](69d5d24))
* Unpack AnnotatedString in _apply_wildcards
([#3798](#3798))
([7886508](7886508))


### Performance Improvements

* retrieve storage inputs immediately before scheduling jobs instead of
before running the entire workflow
([#3850](#3850))
([4ac6cda](4ac6cda))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
@jsquaredosquared
Copy link
Copy Markdown
Contributor

Hi @johanneskoester , I'm not sure why, but the following doesn't work for me on Snakemake 9.14.0, Slurm plugin 2.0.0:
snakemake --jobs 1 --executor slurm --default-resources slurm_partition=cpu-epyc,cpu-epyc-genoa

I get an AssertionError

See error
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Select jobs to execute...
Execute 1 jobs...
Traceback (most recent call last):

  File "[...]/.pixi/envs/default/lib/python3.13/site-packages/snakemake/cli.py", line 2193, in args_to_api
    dag_api.execute_workflow(
    ~~~~~~~~~~~~~~~~~~~~~~~~^
        executor=args.executor,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<46 lines>...
        scheduler_settings=scheduler_settings,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^

  File "[...]/.pixi/envs/default/lib/python3.13/site-packages/snakemake/api.py", line 634, in execute_workflow
    workflow.execute(
    ~~~~~~~~~~~~~~~~^
        executor_plugin=executor_plugin,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        updated_files=updated_files,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^

  File "[...]/.pixi/envs/default/lib/python3.13/site-packages/snakemake/workflow.py", line 1440, in execute
    raise e

  File "[...]/.pixi/envs/default/lib/python3.13/site-packages/snakemake/workflow.py", line 1436, in execute
    success = self.scheduler.schedule()

  File "[...]/.pixi/envs/default/lib/python3.13/site-packages/snakemake/scheduling/job_scheduler.py", line 400, in schedule
    raise e

  File "[...]/.pixi/envs/default/lib/python3.13/site-packages/snakemake/scheduling/job_scheduler.py", line 365, in schedule
    assert self.workflow.is_main_process
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError

The changes in this pull request may have something to do with it, so I thought I would ask for help here. Is there any way to fix this problem?

kjohnsen pushed a commit to kjohnsen/snakemake that referenced this pull request Dec 15, 2025
…ead of before running the entire workflow (snakemake#3850)

This saves local disk space. After each job is finished, Snakemake
already ensures that local storage copies are deleted once not needed
anymore. Hence, in combination with this PR, the lifetime of local
copies on disk will be minimized.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Simplified and unified storage-input retrieval flow, changing when
preflight retrieval runs and clarifying external-input determination.

* **New Features**
* Asynchronous prefetching of storage inputs for remote jobs when local
copies are configured.

* **Style**
* Consistent file I/O formatting and logging output across job
reporting.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kjohnsen pushed a commit to kjohnsen/snakemake that referenced this pull request Dec 15, 2025
🤖 I have created a release *beep* *boop*
---


##
[9.14.0](snakemake/snakemake@v9.13.7...v9.14.0)
(2025-11-27)


### Features

* Support Hy in script directive
([snakemake#3824](snakemake#3824))
([2329c9e](snakemake@2329c9e))


### Bug Fixes

* Add support for pathlib in `notebook` field
([snakemake#3811](snakemake#3811))
([7b2180a](snakemake@7b2180a))
* Addressed race condition in workdir_handler.py
([snakemake#3844](snakemake#3844))
([8dbfcfb](snakemake@8dbfcfb))
* cleanup update-marked output files of failed jobs if there was no
backup to restore them
([snakemake#3843](snakemake#3843))
([41f1ce8](snakemake@41f1ce8))
* correct Windows callable path handling
([snakemake#3832](snakemake#3832))
([5caad70](snakemake@5caad70))
* expand env vars on resources
([snakemake#3823](snakemake#3823))
([fcfa1bc](snakemake@fcfa1bc))
* fix backup for output marked by `update`
([snakemake#3839](snakemake#3839))
([09c64b7](snakemake@09c64b7))
* Minor fixes/additions to logging module.
([snakemake#3802](snakemake#3802))
([3b3986d](snakemake@3b3986d))
* mount local storage prefix into containers
([snakemake#3840](snakemake#3840))
([f1e8b62](snakemake@f1e8b62))
* properly format input/output files in case of missing rule to produce
them ([snakemake#3849](snakemake#3849))
([69d5d24](snakemake@69d5d24))
* Unpack AnnotatedString in _apply_wildcards
([snakemake#3798](snakemake#3798))
([7886508](snakemake@7886508))


### Performance Improvements

* retrieve storage inputs immediately before scheduling jobs instead of
before running the entire workflow
([snakemake#3850](snakemake#3850))
([4ac6cda](snakemake@4ac6cda))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants