Skip to content

DistributedDataParallel non-floating point dtype parameter with requires_grad=False #32018

@jithunnair-amd

Description

@jithunnair-amd

🐛 Bug

  1. Using DistributedDataParallel
  2. on a model that has at-least one non-floating point dtype parameter with requires_grad=False
  3. with a WORLD_SIZE <= nGPUs/2 on the machine

results in an error "Only Tensors of floating point dtype can require gradients".

To Reproduce

Steps to reproduce the behavior:

  1. Use a machine which has at least 4 GPUs
  2. Build pytorch from source for python3.6 OR use one of the available docker images.
  3. Run the following command: "BACKEND=nccl WORLD_SIZE=2 TEMP_DIR=/tmp python3.6 test_distributed.py --verbose TestDistBackend.test_DistributedDataParallel"

The model used in the test has a long (int64) parameter with requires_grad=False: https://github.com/pytorch/pytorch/blob/master/test/test_distributed.py#L59

On a ROCm build of PyTorch, I get the below stack trace (although this issue isn't ROCm-specific):

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "test_distributed.py", line 2097, in _run
    getattr(self, self.id().split(".")[2])()
  File "test_distributed.py", line 2023, in wrapper
    fn(self)
  File "test_distributed.py", line 117, in wrapper
    return func(*args, **kwargs)
  File "test_distributed.py", line 133, in wrapper
    return func(*args, **kwargs)
  File "test_distributed.py", line 1849, in test_DistributedDataParallel
    self._test_DistributedDataParallel(gpu_subset=gpus, rank=rank)
  File "test_distributed.py", line 1784, in _test_DistributedDataParallel
    model_DDP, device_ids=gpu_subset
  File "/root/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 305, in __init__
    self._ddp_init_helper()
  File "/root/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 323, in _ddp_init_helper
    self._module_copies = replicate(self.module, self.device_ids, detach=True)
  File "/root/.local/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 147, in replicate
    setattr(replica, key, Parameter(param))
  File "/root/.local/lib/python3.6/site-packages/torch/nn/parameter.py", line 26, in __new__
    return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point dtype can require gradients
FAIL

Expected behavior

Test should pass.

Environment

Collecting environment information...
PyTorch version: 1.4.0a0+b8f50d9
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Ubuntu 16.04.5 LTS
GCC version: Could not collect
CMake version: version 3.6.3

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
ROCm version: 2.10

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.4.0a0+b8f50d9
[pip3] torchvision==0.4.2
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions