Make DistributedDataParallel usable with CPU models #20236

pietern · 2019-05-07T19:25:52Z

Stack:
    :black_circle: #20236 Make DistributedDataParallel usable with CPU models  💚
    :white_circle: #20235 Refactor core DistributedDataParallel tests  💚
    :white_circle: #20234 Add c10d::broadcast_coalesced and tests  💚

Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.

Closes #17757.

Differential Revision: D15245428

Differential Revision: D15245428 Differential Version: 81320054

mrshenli · 2019-05-07T19:48:39Z

torch/nn/parallel/distributed.py

-                                               self.broadcast_bucket_size)
+                # Synchronize buffers across processes.
+                # The process with rank 0 is considered the authoritative copy.
+                self._broadcast_coalesced(self.modules_buffers[0],


This is where it might be confusing. We have two broadcast_coalesced, but one is distributed and another is local. Is this intentional?

They come from different modules. For clarity we can proxy them with a function on the class, one for local broadcast and the other for distributed broadcast.

Differential Revision: D15245428 Differential Version: 81326665

Differential Revision: D15245428 Differential Version: 81461820

mrshenli · 2019-05-09T20:58:53Z

@pytorchbot retest this please

facebook-github-bot · 2019-05-09T22:36:12Z

This pull request has been merged in 558c6c4.

Summary: Pull Request resolved: #20351 This was broken because of a merge race between #20282 and the stack in #20236. Cleaned up the test and comments a bit as well. Differential Revision: D15292786 fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228

@ngimel

Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: #87712 Approved by: https://github.com/soumith

@ngimel

Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: pytorch#87712 Approved by: https://github.com/soumith

@ngimel

Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning pytorch#20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: pytorch#87712 Approved by: https://github.com/soumith

V1: Initial commit

a8cc4bd

Differential Revision: D15245428 Differential Version: 81320054

pietern requested review from apaszke and mrshenli as code owners May 7, 2019 19:25

pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nn Related to torch.nn labels May 7, 2019

This was referenced May 7, 2019

Add c10d::broadcast_coalesced and tests #20234

Closed

Refactor core DistributedDataParallel tests #20235

Closed

mrshenli reviewed May 7, 2019

View reviewed changes

V2: Address comments

8d9c7b0

Differential Revision: D15245428 Differential Version: 81326665

mrshenli approved these changes May 8, 2019

View reviewed changes

V3: Merge with parent diff changes

7e7d35e

Differential Revision: D15245428 Differential Version: 81461820

facebook-github-bot closed this in 558c6c4 May 9, 2019

facebook-github-bot added the merged label May 9, 2019

pietern deleted the export-D15245428 branch May 9, 2019 22:37

pietern mentioned this pull request May 10, 2019

Fix DistributedDataParallelTest.test_accumulate_gradients #20351

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make DistributedDataParallel usable with CPU models #20236

Make DistributedDataParallel usable with CPU models #20236

Uh oh!

pietern commented May 7, 2019 •

edited

Loading

Uh oh!

mrshenli May 7, 2019

Uh oh!

pietern May 7, 2019

Uh oh!

mrshenli commented May 9, 2019

Uh oh!

facebook-github-bot commented May 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Make DistributedDataParallel usable with CPU models #20236

Make DistributedDataParallel usable with CPU models #20236

Uh oh!

Conversation

pietern commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli May 7, 2019

Choose a reason for hiding this comment

Uh oh!

pietern May 7, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli commented May 9, 2019

Uh oh!

facebook-github-bot commented May 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pietern commented May 7, 2019 •

edited

Loading