-
-
Notifications
You must be signed in to change notification settings - Fork 1k
[Feature] Cluster Manager Python API #460
Description
It would be good to have a ClusterManager object which can start the engines using MPI, keep tabs on the mpiexec process, report crash to controller (with exit code, etc.), etc. This should also remove the need for heartbeats since mpiexec already monitors health of the engines. In case of a crash, we can quickly inform the controller (which can then inform the client).
We can also provide pythonic APIs to start and restart engines. Users could provide additional parameters for mpiexec when starting the engines, like rank-placement (round-robin, etc.) and so on.
Interrupts can be handled through this object as well (send signal to mpiexec, etc.).
Deleting the object should stop the engines and clean up the memory. Running the code again would start a new set of engines.