-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
- Using DistributedDataParallel
- on a model that has at-least one non-floating point dtype parameter with requires_grad=False
- with a WORLD_SIZE <= nGPUs/2 on the machine
results in an error "Only Tensors of floating point dtype can require gradients".
To Reproduce
Steps to reproduce the behavior:
- Use a machine which has at least 4 GPUs
- Build pytorch from source for python3.6 OR use one of the available docker images.
- Run the following command: "BACKEND=nccl WORLD_SIZE=2 TEMP_DIR=/tmp python3.6 test_distributed.py --verbose TestDistBackend.test_DistributedDataParallel"
The model used in the test has a long (int64) parameter with requires_grad=False: https://github.com/pytorch/pytorch/blob/master/test/test_distributed.py#L59
On a ROCm build of PyTorch, I get the below stack trace (although this issue isn't ROCm-specific):
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "test_distributed.py", line 2097, in _run
getattr(self, self.id().split(".")[2])()
File "test_distributed.py", line 2023, in wrapper
fn(self)
File "test_distributed.py", line 117, in wrapper
return func(*args, **kwargs)
File "test_distributed.py", line 133, in wrapper
return func(*args, **kwargs)
File "test_distributed.py", line 1849, in test_DistributedDataParallel
self._test_DistributedDataParallel(gpu_subset=gpus, rank=rank)
File "test_distributed.py", line 1784, in _test_DistributedDataParallel
model_DDP, device_ids=gpu_subset
File "/root/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
self._ddp_init_helper()
File "/root/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
self._module_copies = replicate(self.module, self.device_ids, detach=True)
File "/root/.local/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 147, in replicate
setattr(replica, key, Parameter(param))
File "/root/.local/lib/python3.6/site-packages/torch/nn/parameter.py", line 26, in __new__
return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point dtype can require gradients
FAIL
Expected behavior
Test should pass.
Environment
Collecting environment information...
PyTorch version: 1.4.0a0+b8f50d9
Is debug build: No
CUDA used to build PyTorch: Could not collect
OS: Ubuntu 16.04.5 LTS
GCC version: Could not collect
CMake version: version 3.6.3
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
ROCm version: 2.10
Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.4.0a0+b8f50d9
[pip3] torchvision==0.4.2
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module