Skip to content

Conversation

@sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Aug 1, 2020

Stack from ghstack:

@mcarilli spotted that in the original DDP communication hook design described in 39272, the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:

GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

Update: We discovered and fixed a bug with the sparse tensors case. See new unit test called test_ddp_comm_hook_sparse_gradients and changes in reducer.cpp.

Differential Revision: D22883905

…istered.

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 1, 2020
…istered.

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

ghstack-source-id: 109007166
Pull Request resolved: #42400
@sinannasir sinannasir requested a review from rohan-varma August 1, 2020 02:59
@dr-ci
Copy link

dr-ci bot commented Aug 1, 2020

💊 CI failures summary and remediations

As of commit d8dba7e (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 17 times.

Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, requesting changes since I just realized that we don't test the communication hook with sparse tensors.

Comment on lines +395 to +397
if (comm_hook_ == nullptr) {
replica.contents.div_(process_group_->getSize());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably test the communication hook with sparse tensors as well. Using nn.EmbeddingBag with sparse=True will generate sparse gradients for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 Thanks for this comment! It was quite useful to discover and fix a bug with sparse tensors case.
I think we should just copy in case of sparse gradients, since bucket_view is not used. Please see test_ddp_comm_hook_sparse_gradients and changes in reducer.cpp to make it work.

…if hook registered."

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 5, 2020
…istered.

Pull Request resolved: #42400

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109248106

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
…if hook registered."


@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 5, 2020
…istered.

Pull Request resolved: #42400

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109291981

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
…if hook registered."


@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 6, 2020
…istered.

Pull Request resolved: #42400

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109397244

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
…if hook registered."


@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 10, 2020
…istered.

Pull Request resolved: #42400

@mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109548696

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 752f433.

MauiDesign pushed a commit to MauiDesign/PyTorchPyTorch that referenced this pull request Aug 16, 2020
…istered.

Pull Request resolved: pytorch/pytorch#42400

@mcarilli spotted that in the original DDP communication hook design described in [39272](pytorch/pytorch#39272), the hooks receive grads that are already predivided by world size.

It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.

We also included a warning in the register_comm_hook API as:
> GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce.
ghstack-source-id: 109556429

**Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`.

Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants