-
Notifications
You must be signed in to change notification settings - Fork 26.3k
DDP communication hook: skip dividing grads by world_size if hook registered. #42400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…istered. @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]
…istered. @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) ghstack-source-id: 109007166 Pull Request resolved: #42400
💊 CI failures summary and remediationsAs of commit d8dba7e (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 17 times. |
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, requesting changes since I just realized that we don't test the communication hook with sparse tensors.
| if (comm_hook_ == nullptr) { | ||
| replica.contents.div_(process_group_->getSize()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably test the communication hook with sparse tensors as well. Using nn.EmbeddingBag with sparse=True will generate sparse gradients for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 Thanks for this comment! It was quite useful to discover and fix a bug with sparse tensors case.
I think we should just copy in case of sparse gradients, since bucket_view is not used. Please see test_ddp_comm_hook_sparse_gradients and changes in reducer.cpp to make it work.
…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]
…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109248106 Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]
…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109291981 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]
…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109397244 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
…if hook registered." @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/) [ghstack-poisoned]
…istered. Pull Request resolved: #42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109548696 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
|
This pull request has been merged in 752f433. |
…istered. Pull Request resolved: pytorch/pytorch#42400 @mcarilli spotted that in the original DDP communication hook design described in [39272](pytorch/pytorch#39272), the hooks receive grads that are already predivided by world size. It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea. We also included a warning in the register_comm_hook API as: > GradBucket bucket's tensors will not be predivided by world_size. User is responsible to divide by the world_size in case of operations like allreduce. ghstack-source-id: 109556429 **Update:** We discovered and fixed a bug with the sparse tensors case. See new unit test called `test_ddp_comm_hook_sparse_gradients` and changes in `reducer.cpp`. Differential Revision: [D22883905](https://our.internmc.facebook.com/intern/diff/D22883905/)
Stack from ghstack:
@mcarilli spotted that in the original DDP communication hook design described in 39272, the hooks receive grads that are already predivided by world size.
It makes sense to skip the divide completely if hook registered. The hook is meant for the user to completely override DDP communication. For example, if the user would like to implement something like GossipGrad, always dividing by the world_size would not be a good idea.
We also included a warning in the register_comm_hook API as:
Update: We discovered and fixed a bug with the sparse tensors case. See new unit test called
test_ddp_comm_hook_sparse_gradientsand changes inreducer.cpp.Differential Revision: D22883905