Skip to content

[Feature] Cluster Manager Python API #460

@sahil1105

Description

@sahil1105

It would be good to have a ClusterManager object which can start the engines using MPI, keep tabs on the mpiexec process, report crash to controller (with exit code, etc.), etc. This should also remove the need for heartbeats since mpiexec already monitors health of the engines. In case of a crash, we can quickly inform the controller (which can then inform the client).
We can also provide pythonic APIs to start and restart engines. Users could provide additional parameters for mpiexec when starting the engines, like rank-placement (round-robin, etc.) and so on.
Interrupts can be handled through this object as well (send signal to mpiexec, etc.).
Deleting the object should stop the engines and clean up the memory. Running the code again would start a new set of engines.


EDIT: linked issues: #22, #44, #216, #243, #241

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions