Commit e10feef
feat: subsample jobs to speed-up scheduler (#3112)
<!--Add a description of your PR here-->
I am running a workflow with ~700k jobs and, at each given time, there
are around 230k jobs ready to be run. The initial building of the DAG is
quite slow (~2h, but I'll leave that for another PR 😄), but the
main issue is that the scheduler takes a lot of time deciding the next
jobs to be submitted.
In my case, all jobs are quite fast and similar in terms of resources,
so the cluster is idle most of the time. The greedy scheduler is
considerably faster, but still too slow.
The ILP should switch to the greedy after 10s, but it sometimes ignores
the timeout (coin-or/Cbc#487) and it has been reported being quite slow
instantiating large problems (coin-or/pulp#749). In my case, the ILP
runs for 60s (the pulp file is 100Mb) before switching to greedy. Apart
from that, and specially on slow file systems, the scheduler can still
be quite slow checking all temp and input files.
Here, I propose sampling ready jobs, so that only a subset of jobs
(instead of all ready jobs) are evaluated by the scheduler. In my tests,
this greatly reduces the scheduler time:
| | ILP | greedy |
|---|---|---|
| Native |15 - 20 mins |30s - 1 min |
| Sampling 1000 jobs | | 1 - 2 s |
### QC
<!-- Make sure that you can tick the boxes below. -->
* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [ ] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **New Features**
- Introduced a new argument `--scheduler-subsample` to optimize job
scheduling by limiting the number of jobs considered for execution.
- Added a method for inferring resource requirements, enhancing user
experience with better error handling.
- Updated settings to include a new attribute for job subsampling,
improving scheduling flexibility.
- **Bug Fixes**
- Improved error handling and logging for resource evaluation and
parsing, providing clearer guidance for users.
- Enhanced job selection process with a subsampling mechanism to
optimize scheduling efficiency.
- **Refactor**
- Enhanced structure and organization of job scheduling logic for better
integration with existing functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>1 parent 9504bf4 commit e10feef
4 files changed
+33
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1414 | 1414 | | |
1415 | 1415 | | |
1416 | 1416 | | |
| 1417 | + | |
| 1418 | + | |
| 1419 | + | |
| 1420 | + | |
| 1421 | + | |
| 1422 | + | |
| 1423 | + | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + | |
1417 | 1427 | | |
1418 | 1428 | | |
1419 | 1429 | | |
| |||
2127 | 2137 | | |
2128 | 2138 | | |
2129 | 2139 | | |
| 2140 | + | |
2130 | 2141 | | |
2131 | 2142 | | |
2132 | 2143 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
654 | 654 | | |
655 | 655 | | |
656 | 656 | | |
657 | | - | |
658 | 657 | | |
659 | 658 | | |
660 | 659 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
| |||
263 | 264 | | |
264 | 265 | | |
265 | 266 | | |
266 | | - | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
267 | 279 | | |
268 | 280 | | |
269 | 281 | | |
| |||
506 | 518 | | |
507 | 519 | | |
508 | 520 | | |
509 | | - | |
510 | 521 | | |
511 | 522 | | |
512 | 523 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
289 | 289 | | |
290 | 290 | | |
291 | 291 | | |
292 | | - | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
293 | 295 | | |
294 | 296 | | |
295 | 297 | | |
296 | 298 | | |
297 | 299 | | |
298 | 300 | | |
299 | 301 | | |
| 302 | + | |
300 | 303 | | |
301 | 304 | | |
302 | 305 | | |
| |||
312 | 315 | | |
313 | 316 | | |
314 | 317 | | |
315 | | - | |
316 | | - | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
317 | 323 | | |
318 | 324 | | |
319 | 325 | | |
| |||
0 commit comments