Skip to content

If a module passed to DistributedDataParallel has no parameter required gradient, expect_sparse_gradient[0] in _ddp_init_helper function will raise error. #25550

@hypercost

Description

@hypercost

🐛 Bug

If a module passed to DistributedDataParallel has no parameter required gradient, expect_sparse_gradient[0] in _ddp_init_helper function will raise error.

In self.criterionPerceptron, it's set that for param in xxx.parameters(): param.requires_grad=False

    self.criterionPerceptron = nn.parallel.DistributedDataParallel(self.criterionPerceptron, device_ids=[opt.local_rank], output_device=opt.local_rank)
  File "/opt/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 300, in __init__
    self._ddp_init_helper()
  File "/opt/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 368, in _ddp_init_helper
    expect_sparse_gradient[0])
RuntimeError: tensors.size() > 0 INTERNAL ASSERT FAILED at pytorch/torch/csrc/distributed/c10d/reducer.cpp:672, please report a bug to PyTorch.  (compute_bucket_assignment_by_size at pytorch/torch/csrc/distributed/c10d/reducer.cpp:672)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f52bbd5877a in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::compute_bucket_assignment_by_size(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<bool, std::allocator<bool> > const&) + 0x951 (0x7f52d4e3f8a1 in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x6c2cb1 (0x7f52d4e2fcb1 in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c2f0e (0x7f52d4e2ff0e in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1d24f0 (0x7f52d493f4f0 in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: __libc_start_main + 0xf0 (0x7f52e2b66830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x107f (0x55e911d9b07f in /opt/miniconda2/bin/python)

To Reproduce

Steps to reproduce the behavior:

  1. create a module that does not required gradients
  2. pass it to DistributedDataParallel

Environment

  • PyTorch Version : 1.2+

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions