Skip to content

splitting engines into smaller groups #243

@rjplevin

Description

@rjplevin

I'm trying to use ipyparallel to run a Monte Carlo simulation with thousands of runs. If I attempt to start, say, 500 engines (5 to a node) on SLURM, the current architecture attempts to allocate 100 nodes at once, leading to long waits. I was looking at how best to modify this to split the these into chunks that can start independently of one another, and thus get going sooner.

It would be ideal if there were config options for batch systems:

  • (int) engines_per_node, default 1
  • (bool) separate_engine_jobs, default False (legacy mode)

If separate_engine_jobs is True, queue ((N/engines_per_node) + (1 if N % engines_per_node else 0)) jobs with each setting ntasks={engines_per_node}. Of course, this would require tracking a list of job IDs to kill them when stopping the cluster.

I tested this by manually calling sbatch repeatedly to start 5 engines/node. The additional engines were seen by the schedulers when they start running, as I'd hoped. The only problem was that stopping the cluster didn't kill the separately launched engines.

The simplest approach for me would be to write this in my app that wraps all the ipyparallel stuff, but I thought it sounded like a good general feature, thus this note.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions