-
Notifications
You must be signed in to change notification settings - Fork 1.5k
create_multigpu_supervised_trainer can't work with multiple devices #3765
Copy link
Copy link
Closed
Labels
enhancementNew feature or requestNew feature or request
Milestone
Description
Is your feature request related to a problem? Please describe.
Got feedback that when distributed=True in create_multigpu_supervised_trainer, the doc-string and typehints show it supports multiple devices and output the first device:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/engines/multi_gpu_supervised_trainer.py#L77
But actually, it can't work with multiple devices and the None is against PyTorch API:
trainer = create_multigpu_supervised_trainer(net, opt, fake_loss, None, distributed=True)
File "/workspace/data/medical/MONAI/monai/engines/multi_gpu_supervised_trainer.py", line 90, in create_multigpu_supervised_trainer
net = DistributedDataParallel(net, device_ids=devices_)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 518, in __init__
self._log_and_throw(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 642, in _log_and_throw
raise err_type(err_msg)
ValueError: device_ids can only be None or contain a single element.
We don't have unit test to cover distributed=True so far.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request