Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented May 7, 2019

Stack:
    :black_circle:  #20236 Make DistributedDataParallel usable with CPU models  💚
    :white_circle:  #20235 Refactor core DistributedDataParallel tests  💚
    :white_circle:  #20234 Add c10d::broadcast_coalesced and tests  💚

Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.

Closes #17757.

Differential Revision: D15245428

Differential Revision: D15245428
Differential Version: 81320054
@pietern pietern requested review from apaszke and mrshenli as code owners May 7, 2019 19:25
@pytorchbot pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nn Related to torch.nn labels May 7, 2019
self.broadcast_bucket_size)
# Synchronize buffers across processes.
# The process with rank 0 is considered the authoritative copy.
self._broadcast_coalesced(self.modules_buffers[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where it might be confusing. We have two broadcast_coalesced, but one is distributed and another is local. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They come from different modules. For clarity we can proxy them with a function on the class, one for local broadcast and the other for distributed broadcast.

Differential Revision: D15245428
Differential Version: 81326665
Differential Revision: D15245428
Differential Version: 81461820
@mrshenli
Copy link
Contributor

mrshenli commented May 9, 2019

@pytorchbot retest this please

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 558c6c4.

@pietern pietern deleted the export-D15245428 branch May 9, 2019 22:37
facebook-github-bot pushed a commit that referenced this pull request May 10, 2019
Summary:
Pull Request resolved: #20351

This was broken because of a merge race between #20282 and the stack in #20236.

Cleaned up the test and comments a bit as well.

Differential Revision: D15292786

fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228
pytorchmergebot pushed a commit that referenced this pull request Oct 27, 2022
Fixes:
```
/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead
```

cc @ngimel
Pull Request resolved: #87712
Approved by: https://github.com/soumith
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
Fixes:
```
/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead
```

cc @ngimel
Pull Request resolved: pytorch#87712
Approved by: https://github.com/soumith
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Fixes:
```
/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead
```

cc @ngimel
Pull Request resolved: pytorch#87712
Approved by: https://github.com/soumith
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: nn Related to torch.nn oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants