-
-
Notifications
You must be signed in to change notification settings - Fork 1k
splitting engines into smaller groups #243
Description
I'm trying to use ipyparallel to run a Monte Carlo simulation with thousands of runs. If I attempt to start, say, 500 engines (5 to a node) on SLURM, the current architecture attempts to allocate 100 nodes at once, leading to long waits. I was looking at how best to modify this to split the these into chunks that can start independently of one another, and thus get going sooner.
It would be ideal if there were config options for batch systems:
- (int) engines_per_node, default 1
- (bool) separate_engine_jobs, default False (legacy mode)
If separate_engine_jobs is True, queue ((N/engines_per_node) + (1 if N % engines_per_node else 0)) jobs with each setting ntasks={engines_per_node}. Of course, this would require tracking a list of job IDs to kill them when stopping the cluster.
I tested this by manually calling sbatch repeatedly to start 5 engines/node. The additional engines were seen by the schedulers when they start running, as I'd hoped. The only problem was that stopping the cluster didn't kill the separately launched engines.
The simplest approach for me would be to write this in my app that wraps all the ipyparallel stuff, but I thought it sounded like a good general feature, thus this note.