DDP communication hook: skip dividing grads by world_size if hook registered. #42400

sinannasir · 2020-08-01T02:55:54Z

Stack from ghstack:

DDP communication hook: skip dividing grads by world_size if hook registered. #42400 DDP communication hook: skip dividing grads by world_size if hook registered.

@mcarilli spotted that in the original DDP communication hook design described in 39272, the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:

GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

Update: We discovered and fixed a bug with the sparse tensors case. See new unit test called test_ddp_comm_hook_sparse_gradients and changes in reducer.cpp.

Differential Revision: D22883905

@mcarilli

…istered. @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]

@mcarilli

…istered. @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) ghstack-source-id: 109007166 Pull Request resolved: #42400

dr-ci · 2020-08-01T03:20:38Z

💊 CI failures summary and remediations

As of commit d8dba7e (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 17 times.

pritamdamania87

Looks good overall, requesting changes since I just realized that we don't test the communication hook with sparse tensors.

torch/csrc/distributed/c10d/reducer.cpp

torch/nn/parallel/distributed.py

pritamdamania87 · 2020-08-03T20:02:41Z

torch/csrc/distributed/c10d/reducer.cpp

+    if (comm_hook_ == nullptr) {
+      replica.contents.div_(process_group_->getSize());
+    }


We should probably test the communication hook with sparse tensors as well. Using nn.EmbeddingBag with sparse=True will generate sparse gradients for you.

@pritamdamania87 Thanks for this comment! It was quite useful to discover and fix a bug with sparse tensors case.
I think we should just copy in case of sparse gradients, since bucket_view is not used. Please see test_ddp_comm_hook_sparse_gradients and changes in reducer.cpp to make it work.

@mcarilli

…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]

@mcarilli

…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109248106 Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

@mcarilli

…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]

@mcarilli

…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109291981 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

@mcarilli

…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]

@mcarilli

…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109397244 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

@mcarilli

…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]

@mcarilli

…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109548696 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

facebook-github-bot · 2020-08-10T22:18:42Z

This pull request has been merged in 752f433.

@mcarilli

…istered. Pull Request resolved: pytorch/pytorch#42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](pytorch/pytorch#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109556429 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

sinannasir requested review from apaszke, mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners August 1, 2020 02:55

sinannasir mentioned this pull request Aug 1, 2020

[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback #42335

Closed

sinannasir requested a review from rohan-varma August 1, 2020 02:59

pritamdamania87 suggested changes Aug 3, 2020

View reviewed changes

sinannasir requested a review from pritamdamania87 August 5, 2020 16:29

pritamdamania87 approved these changes Aug 5, 2020

View reviewed changes

sinannasir mentioned this pull request Aug 7, 2020

[NCCL] [For Test] In DDP's reducer merge work and future_work #41840

Closed

facebook-github-bot closed this in 752f433 Aug 10, 2020

facebook-github-bot added the merged label Aug 10, 2020

sinannasir mentioned this pull request Aug 11, 2020

[NCCL] Changed FutureNCCL's then callback logic for better efficiency. #42869

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP communication hook: skip dividing grads by world_size if hook registered. #42400

DDP communication hook: skip dividing grads by world_size if hook registered. #42400

Uh oh!

sinannasir commented Aug 1, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Aug 1, 2020 •

edited

Loading

Uh oh!

pritamdamania87 left a comment

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 Aug 3, 2020

Uh oh!

sinannasir Aug 5, 2020

Uh oh!

facebook-github-bot commented Aug 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DDP communication hook: skip dividing grads by world_size if hook registered. #42400

DDP communication hook: skip dividing grads by world_size if hook registered. #42400

Uh oh!

Conversation

sinannasir commented Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 Aug 3, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sinannasir commented Aug 1, 2020 •

edited

Loading

dr-ci bot commented Aug 1, 2020 •

edited

Loading