Skip to content

create_multigpu_supervised_trainer can't work with multiple devices #3765

@Nic-Ma

Description

@Nic-Ma

Is your feature request related to a problem? Please describe.
Got feedback that when distributed=True in create_multigpu_supervised_trainer, the doc-string and typehints show it supports multiple devices and output the first device:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/engines/multi_gpu_supervised_trainer.py#L77
But actually, it can't work with multiple devices and the None is against PyTorch API:

trainer = create_multigpu_supervised_trainer(net, opt, fake_loss, None, distributed=True)
  File "/workspace/data/medical/MONAI/monai/engines/multi_gpu_supervised_trainer.py", line 90, in create_multigpu_supervised_trainer
    net = DistributedDataParallel(net, device_ids=devices_)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 518, in __init__
    self._log_and_throw(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 642, in _log_and_throw
    raise err_type(err_msg)
ValueError: device_ids can only be None or contain a single element.

We don't have unit test to cover distributed=True so far.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Relationships

None yet

Development

No branches or pull requests

Issue actions