Add sparse tensor allreduce #22036

pietern · 2019-06-20T19:37:16Z

Stack:
:white_circle: #22037 Support sparse gradients in DistributedDataParallel 💛
:black_circle: #22036 Add sparse tensor allreduce 💛

Implemented only on ProcessGroupGloo, as an allgather of metadata
(sparse_dim, dense_dim, and nnz), followed by an allgather of indices,
followed by an allgather of values. Once these operations have
finished, all ranks locally compute a reduction over these sparse
tensors. Works for both CPU and CUDA tensors.

This surfaced a problem with the existing assumption of only modifying
tensors that are passed at the call site, because for sparse tensors
we don't know the dimensions of the output tensors before we run the
collective. To deal with this unknown, this commit adds a result
function to the c10d::ProcessGroup::Work class that returns a vector
of tensors.

It's a bit odd to have to retrieve the result through this function
only for operations on sparse tensors. To make this work irrespective
of tensor layout, we can create a follow-up commit to make all in
place operations make their results accessible through this function
as well. This doesn't break any existing contracts but does have the
potential to add interface ambiguity.

This is a resubmission of #19146.

Differential Revision: D15926384

Differential Revision: D15926384 Differential Version: 85311082

pietern · 2019-06-21T08:45:33Z

@pytorchbot retest this please

The Windows failures are likely unrelated and to be fixed by #22029.

pietern · 2019-06-24T06:44:04Z

@pytorchbot retest this please

facebook-github-bot · 2019-06-24T16:04:13Z

This pull request has been merged in a7ec889.

V1: Initial commit

785af4b

Differential Revision: D15926384 Differential Version: 85311082

pietern requested review from apaszke and mrshenli as code owners June 20, 2019 19:37

pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 20, 2019

pietern mentioned this pull request Jun 20, 2019

Support sparse gradients in DistributedDataParallel #22037

Closed

pietern added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 21, 2019

mrshenli approved these changes Jun 24, 2019

View reviewed changes

facebook-github-bot closed this in a7ec889 Jun 24, 2019

facebook-github-bot added the merged label Jun 24, 2019

ezyang deleted the export-D15926384 branch July 19, 2019 15:54

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sparse tensor allreduce #22036

Add sparse tensor allreduce #22036

Uh oh!

pietern commented Jun 20, 2019 •

edited

Loading

Uh oh!

pietern commented Jun 21, 2019

Uh oh!

pietern commented Jun 24, 2019

Uh oh!

facebook-github-bot commented Jun 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add sparse tensor allreduce #22036

Add sparse tensor allreduce #22036

Uh oh!

Conversation

pietern commented Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern commented Jun 21, 2019

Uh oh!

pietern commented Jun 24, 2019

Uh oh!

facebook-github-bot commented Jun 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pietern commented Jun 20, 2019 •

edited

Loading