-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add interrupt/terminate option to Trainers & Evaluators #4554
Description
Is your feature request related to a problem? Please describe.
It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.
Describe the solution you'd like
Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.
For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL's MonaiAlgo class; private repo - contact me if you need access)
def abort(self):
self.trainer.terminate()
# save current iteration for next round
setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)
if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
# if current iteration is end of 1 epoch, manually trigger epoch completed event
self.trainer._fire_event(Events.EPOCH_COMPLETED)
def finalize(self):
self.trainer.terminate()
Describe alternatives you've considered
n/a
Additional context
n/a