-
Notifications
You must be signed in to change notification settings - Fork 634
snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759
Description
Snakemake version
Version ≥26.0 (and possibly other newer versions)
Describe the bug
I am running a large snakemake pipeline with ~90k steps/jobs through a slurm cluster with the following submission commands:
snakemake --snakefile Snakefile -j 50 --use-conda --keep-target-files --keep-going --rerun-incomplete --latency-wait 30 --cluster "sbatch -A keblevin --mem=32768 -n 1 -c 8 -t 3:00:00 -e /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.err -o /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.out"
New jobs slowly stop being submitted such that after 30 minutes to an hour only 30 jobs are maintained in the queue, then 20, then 10, etc. After about 1-2 hours or 800-1500 jobs, new jobs are no longer submitted at all, and the main job hangs indefinitely. The slurm.err log has no errors and also supports the observation that the main job is just hanging without submitting new jobs. Here are the last 25 lines of a slurm.err, which remains un-updated until the job is manually cancelled. In this log, the main job ran over night ~8 hours without submitting any new jobs. I cancelled it the following morning.
Logs
Removing temporary output file output/New_Zealand_BLENHEIM_2000_3990/New_Zealand_BLENHEIM_2000_3990.sam.
[Tue Nov 17 23:32:05 2020]
Finished job 42963.
1080 of 86656 steps (1%) done
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.trimmed.fq.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.trimmed.fq.
[Tue Nov 17 23:35:55 2020]
Finished job 30728.
1081 of 86656 steps (1%) done
[Tue Nov 17 23:36:03 2020]
rule sam_to_bam:
input: output/Germany_2012_2110/Germany_2012_2110.sam
output: output/Germany_2012_2110/Germany_2012_2110.bam
jobid: 30727
wildcards: sample=Germany_2012_2110
Submitted job 30727 with external jobid 'Submitted batch job 6018753'.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110.sam.
[Tue Nov 17 23:41:39 2020]
Finished job 30727.
1082 of 86656 steps (1%) done
slurmstepd: error: *** JOB 6014276 ON cg17-3 CANCELLED AT 2020-11-18T07:56:39 ***
Minimal example
A minimal example to reproduce this would be any submission that submits thousands of jobs expected to run over several hours to a slurm system.
I was initially using v5.28.0, then tried downgrading after reading #724. I downgraded incrementally until 5.26 and kept encountering a stalled main job. I then jumped down to 5.3. After downgrading to 5.3.0, snakemake maintained the expected number of jobs in the queue until completion.
I have no idea why this snakemake-slurm timeout/miscommunication would be happening. I couldn't find a similar issue out there, and I thought others should be aware.
Thanks!
Kelly