autorunner support multinode multigpu devices

Update the autorunner to support multinode training and ensembling.
Requires some refactoring of auto-runner.
Currently, autorunner uses set_training_params to define  a variable called "CUDA_VISIBLE_DEVICES", which is not a global env variable for autorunner, only for subprocess train.py. This creates a problem for storing and sharing device settings across functions in autorunner. 
Plans:
1. Create autorunner.set_device() function to set self.device_setting->Dict to  store "NUM_NODE", "CUDA_VISIBLE_DEVICE", "MULTI_NODE_START_METHOD" e.t.c.. Separate those values from self.train_params. If user did not pass any parameters, use os.environ.get("NUM_NODES", 1) e.t.c.

2. Data analyzer: currently use torch.cuda.device_count for multi-gpu processing. The autorunner should pass self.device_setting['CUDA_VISIBLE_DEVICE'] to data analyzer for which GPU to use. Not supporting multinode due to nested-dict gathering problem

3.  BundleAlgo.train: currently autorunner passes self.train_params to bundleAlgo subprocess train. Change it to self.device_setting, and update all the logic in subprocess cmd creation. Launch the multinode or single node subprocess 
4. Ensemble: currently running in sequential, single gpu. Rewrite it to support multinode,multigpu as device defined in self.device_setting.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autorunner support multinode multigpu devices #6259

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

autorunner support multinode multigpu devices #6259

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions