Skip to content

$SLURM_JOB_ID not set correctly in remote-job-local-storage-prefix when running snakemake in a SLURM job context (which many cluster environments require) #3049

@hermidalc

Description

@hermidalc

Snakemake version

8.16

Describe the bug

There are many cluster environments where you are not allowed to run snakemake from login nodes, and you have run snakemake in a SLURM job context, i.e. submit a SLURM job to a compute node when then runs your snakemake workflow. The NIH HPC is one of those environments (and we are a big user of Snakemake!)

When you have to run snakemake in a SLURM job context then setting remote-job-local-storage-prefix to e.g. '/lscratch/\$SLURM_JOB_ID' or '/lscratch/$SLURM_JOB_ID' doesn't work properly. It incorrectly uses the $SLURM_JOB_ID of the snakemake parent job, not $SLURM_JOB_ID of the submitted remote job.

I also thought it would work if I put two backslashes /lscratch/\\$SLURM_JOB_ID as that would make intuitive sense, but that doesn't work either.

Minimal example

From within a SLURM job context run snakemake with the following:

snakemake --remote-job-local-storage-prefix '/lscratch/\$SLURM_JOB_ID'
rule all:
    input:
        "local.out"

rule a:
    output:
        storage.fs("remote.txt"),
    shell:
        "echo '{output}' > {output}"

rule b:
    input:
        storage.fs("remote.txt"),
    output:
        "local.out",
    shell:
        "echo '{input}' > {output}"
Building DAG of jobs...
You are running snakemake in a SLURM job context. This is not recommended, as it may lead to unexpected behavior. Please run Snakemake directly on the login node.
SLURM run ID: 00b8696d-9685-42a9-8b4b-ae6710a54755
Using shell: /usr/bin/bash
Provided remote nodes: 9223372036854775807
Job stats:
job      count
-----  -------
a            1
all          1
b            1
total        3

Select jobs to execute...
Execute 1 jobs...

[Mon Aug 26 09:22:16 2024]
rule a:
    output: remote.txt (send to storage)
    jobid: 2
    reason: Missing output files: remote.txt (send to storage)
    resources: mem_mb=3815, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, slurm_partition=quick, cpus_per_task=2, mem=4G, runtime=5, slurm_extra=--gres=lscratch:100

echo '.snakemake/storage/fs/remote.txt' > .snakemake/storage/fs/remote.txt
No SLURM account given, trying to guess.
Guessed SLURM account: ruppinen
Job 2 has been submitted with SLURM jobid 34563252 (log: /gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.log).
[Mon Aug 26 09:23:11 2024]
Error in rule a:
    message: SLURM-job '34563252' failed, SLURM status is: 'FAILED'. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2
    output: remote.txt (send to storage)
    log: /gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.log (check log file(s) for error details)
    shell:
        echo '.snakemake/storage/fs/remote.txt' > .snakemake/storage/fs/remote.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 34563252

Shutting down, this might take some time.

/gpfs/gsfs12/users/hermidalc/work/snakemake/.snakemake/slurm_logs/rule_a/34563252.log show it's using the wrong $SLURM_JOB_ID:

WorkflowError in file /gpfs/gsfs12/users/hermidalc/work/snakemake/Snakefile, line 8:
Failed to create local storage prefix /lscratch/34562833/fs
PermissionError: [Errno 13] Permission denied: '/lscratch/34562833'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions