Snakemake version
8.16
Describe the bug
There are many cluster environments where you are not allowed to run snakemake from login nodes, and you have run snakemake in a SLURM job context, i.e. submit a SLURM job to a compute node when then runs your snakemake workflow. The NIH HPC is one of those environments (and we are a big user of Snakemake!)
When you have to run snakemake in a SLURM job context then setting remote-job-local-storage-prefix to e.g. '/lscratch/\$SLURM_JOB_ID' or '/lscratch/$SLURM_JOB_ID' doesn't work properly. It incorrectly uses the $SLURM_JOB_ID of the snakemake parent job, not $SLURM_JOB_ID of the submitted remote job.
I also thought it would work if I put two backslashes /lscratch/\\$SLURM_JOB_ID as that would make intuitive sense, but that doesn't work either.
Minimal example
From within a SLURM job context run snakemake with the following:
snakemake --remote-job-local-storage-prefix '/lscratch/\$SLURM_JOB_ID'
rule all:
input:
"local.out"
rule a:
output:
storage.fs("remote.txt"),
shell:
"echo '{output}' > {output}"
rule b:
input:
storage.fs("remote.txt"),
output:
"local.out",
shell:
"echo '{input}' > {output}"
Building DAG of jobs...
You are running snakemake in a SLURM job context. This is not recommended, as it may lead to unexpected behavior. Please run Snakemake directly on the login node.
SLURM run ID: 00b8696d-9685-42a9-8b4b-ae6710a54755
Using shell: /usr/bin/bash
Provided remote nodes: 9223372036854775807
Job stats:
job count
----- -------
a 1
all 1
b 1
total 3
Select jobs to execute...
Execute 1 jobs...
[Mon Aug 26 09:22:16 2024]
rule a:
output: remote.txt (send to storage)
jobid: 2
reason: Missing output files: remote.txt (send to storage)
resources: mem_mb=3815, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_partition=quick, cpus_per_task=2, mem=4G, runtime=5, slurm_extra=--gres=lscratch:100
echo '.snakemake/storage/fs/remote.txt' > .snakemake/storage/fs/remote.txt
No SLURM account given, trying to guess.
Guessed SLURM account: ruppinen
Job 2 has been submitted with SLURM jobid 34563252 (log: /gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.log).
[Mon Aug 26 09:23:11 2024]
Error in rule a:
message: SLURM-job '34563252' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 2
output: remote.txt (send to storage)
log: /gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.log (check log file(s) for error details)
shell:
echo '.snakemake/storage/fs/remote.txt' > .snakemake/storage/fs/remote.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: 34563252
Shutting down, this might take some time.
/gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.log show it's using the wrong $SLURM_JOB_ID:
WorkflowError in file /gpfs/gsfs12/users/hermidalc/work/snakemake/Snakefile, line 8:
Failed to create local storage prefix /lscratch/34562833/fs
PermissionError: [Errno 13] Permission denied: '/lscratch/34562833'
Snakemake version
8.16
Describe the bug
There are many cluster environments where you are not allowed to run snakemake from login nodes, and you have run snakemake in a SLURM job context, i.e. submit a SLURM job to a compute node when then runs your snakemake workflow. The NIH HPC is one of those environments (and we are a big user of Snakemake!)
When you have to run snakemake in a SLURM job context then setting
remote-job-local-storage-prefixto e.g.'/lscratch/\$SLURM_JOB_ID'or'/lscratch/$SLURM_JOB_ID'doesn't work properly. It incorrectly uses the$SLURM_JOB_IDof the snakemake parent job, not$SLURM_JOB_IDof the submitted remote job.I also thought it would work if I put two backslashes
/lscratch/\\$SLURM_JOB_IDas that would make intuitive sense, but that doesn't work either.Minimal example
From within a SLURM job context run snakemake with the following:
snakemake --remote-job-local-storage-prefix '/lscratch/\$SLURM_JOB_ID'/gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.logshow it's using the wrong$SLURM_JOB_ID: