-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
If a module passed to DistributedDataParallel has no parameter required gradient, expect_sparse_gradient[0] in _ddp_init_helper function will raise error.
In self.criterionPerceptron, it's set that for param in xxx.parameters(): param.requires_grad=False
self.criterionPerceptron = nn.parallel.DistributedDataParallel(self.criterionPerceptron, device_ids=[opt.local_rank], output_device=opt.local_rank)
File "/opt/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 300, in __init__
self._ddp_init_helper()
File "/opt/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 368, in _ddp_init_helper
expect_sparse_gradient[0])
RuntimeError: tensors.size() > 0 INTERNAL ASSERT FAILED at pytorch/torch/csrc/distributed/c10d/reducer.cpp:672, please report a bug to PyTorch. (compute_bucket_assignment_by_size at pytorch/torch/csrc/distributed/c10d/reducer.cpp:672)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f52bbd5877a in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::compute_bucket_assignment_by_size(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<bool, std::allocator<bool> > const&) + 0x951 (0x7f52d4e3f8a1 in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x6c2cb1 (0x7f52d4e2fcb1 in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c2f0e (0x7f52d4e2ff0e in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1d24f0 (0x7f52d493f4f0 in /opt/miniconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: __libc_start_main + 0xf0 (0x7f52e2b66830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: <unknown function> + 0x107f (0x55e911d9b07f in /opt/miniconda2/bin/python)
To Reproduce
Steps to reproduce the behavior:
- create a module that does not required gradients
- pass it to DistributedDataParallel
Environment
- PyTorch Version : 1.2+
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module