Skip to content

autorunner support multinode multigpu devices #6259

@heyufan1995

Description

@heyufan1995

Update the autorunner to support multinode training and ensembling.
Requires some refactoring of auto-runner.
Currently, autorunner uses set_training_params to define a variable called "CUDA_VISIBLE_DEVICES", which is not a global env variable for autorunner, only for subprocess train.py. This creates a problem for storing and sharing device settings across functions in autorunner.
Plans:

  1. Create autorunner.set_device() function to set self.device_setting->Dict to store "NUM_NODE", "CUDA_VISIBLE_DEVICE", "MULTI_NODE_START_METHOD" e.t.c.. Separate those values from self.train_params. If user did not pass any parameters, use os.environ.get("NUM_NODES", 1) e.t.c.

  2. Data analyzer: currently use torch.cuda.device_count for multi-gpu processing. The autorunner should pass self.device_setting['CUDA_VISIBLE_DEVICE'] to data analyzer for which GPU to use. Not supporting multinode due to nested-dict gathering problem

  3. BundleAlgo.train: currently autorunner passes self.train_params to bundleAlgo subprocess train. Change it to self.device_setting, and update all the logic in subprocess cmd creation. Launch the multinode or single node subprocess

  4. Ensemble: currently running in sequential, single gpu. Rewrite it to support multinode,multigpu as device defined in self.device_setting.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Relationships

None yet

Development

No branches or pull requests

Issue actions