splitting engines into smaller groups

I'm trying to use ipyparallel to run a Monte Carlo simulation with thousands of runs. If I attempt to start, say, 500 engines (5 to a node) on SLURM, the current architecture attempts to allocate 100 nodes at once, leading to long waits. I was looking at how best to modify this to split the these into chunks that can start independently of one another, and thus get going sooner.

It would be ideal if there were config options for batch systems:
- (int) engines_per_node, default 1
- (bool) separate_engine_jobs, default False (legacy mode)

If separate_engine_jobs is True, queue ((N/engines_per_node) + (1 if N % engines_per_node else 0)) jobs  with each setting ntasks={engines_per_node}. Of course, this would require tracking a list of job IDs to kill them when stopping the cluster.

I tested this by manually calling sbatch repeatedly to start 5 engines/node. The additional engines were seen by the schedulers when they start running, as I'd hoped. The only problem was that stopping the cluster didn't kill the separately launched engines.

The simplest approach for me would be to write this in my app that wraps all the ipyparallel stuff, but I thought it sounded like a good general feature, thus this note.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

splitting engines into smaller groups #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

splitting engines into smaller groups #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions