-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Background
When submitting a job to the TORQUE / PBS using something like:
qsub -l nodes=3:ppn=2 myjob.shthe scheduler will allocate 3 nodes with 2 cores each (= 6 cores total) for myjob.sh when launched. Exactly which 3 nodes is only known to myjob.sh at run time. This information is available in a file $PBS_NODEFILE written by TORQUE / PBS, e.g.
$ cat $PBS_NODEFILE
n1
n1
n8
n8
n9
n9Other HPC job schedulers use other files / environment variables for this.
Actions
Add an availableNodes() file that searches for common environment variables and returns a vector of node names, e.g.
> availableNodes()`
[1] "n1" "n1" "n8" "n8" "n9" "n9"If no known environment variables are found, the default fallback could be to return rep("localhost", times = availableCores().
The above would allow us to make workers = availableNodes() the new default for cluster futures (currently workers = availableCores()).
Identify these settings for the following schedulers:
- PBS (Portable Batch System): Environment variable
PBS_NODEFILE(the name of a file containing one node per line where each node is repeated "ppn" times). - Oracle Grid Engine (aka Sun Grid Engine, CODINE, GRD). Environment variable
PE_HOSTFILE(a file, format unclear), cf. https://www.ace-net.ca/wiki/Sun_Grid_Engine - Slurm (Simple Linux Utility for Resource Management). Environment variable
SLURM_JOB_NODELIST(list of nodes in a compressed format, e.g. instead of "tux1,tux3,tux4" it is stored as "tux[1,3-4]". Note that multiple "compressions" may exist, e.g. "compute-[0-6]-[0-15]". The number of nodes is can be verified bySLURM_JOB_NUM_NODES. The "ppn" information is in stored inSLURM_TASKS_PER_NODE). - LSF/OpenLava (Platform Load Sharing Facility).
-
LSB_HOSTS
-
- Spark
- OAR
- HTCondor
- Moab
- PJM (https://staff.cs.manchester.ac.uk/~fumie/internal/Job_Operation_Software_en.pdf)
-
PJM_O_NODEINF- "Path of the allocated node list file. For a job to which virtual nodes are allocated, the IP addresses of the nodes where the virtual nodes are placed are written one per line."
-