-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Make DistributedDataParallel usable with CPU models #20236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Differential Revision: D15245428 Differential Version: 81320054
torch/nn/parallel/distributed.py
Outdated
| self.broadcast_bucket_size) | ||
| # Synchronize buffers across processes. | ||
| # The process with rank 0 is considered the authoritative copy. | ||
| self._broadcast_coalesced(self.modules_buffers[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where it might be confusing. We have two broadcast_coalesced, but one is distributed and another is local. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They come from different modules. For clarity we can proxy them with a function on the class, one for local broadcast and the other for distributed broadcast.
Differential Revision: D15245428 Differential Version: 81326665
Differential Revision: D15245428 Differential Version: 81461820
|
@pytorchbot retest this please |
|
This pull request has been merged in 558c6c4. |
Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: #87712 Approved by: https://github.com/soumith
Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: pytorch#87712 Approved by: https://github.com/soumith
Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: pytorch#87712 Approved by: https://github.com/soumith
Stack:
:black_circle: #20236 Make DistributedDataParallel usable with CPU models 💚
:white_circle: #20235 Refactor core DistributedDataParallel tests 💚
:white_circle: #20234 Add c10d::broadcast_coalesced and tests 💚
Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.
Closes #17757.
Differential Revision: D15245428