-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Moving python allgather_coalesced impl from Py to C #28857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
753f5b0 to
0482133
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
0482133 to
e5c9943
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
e5c9943 to
05c47f2
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
05c47f2 to
02a8faa
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In torch/csrc/distributed/c10d/comm.cpp, we have implemented broadcast_coalesced as a function that takes a process group and a vector of tensors. I think we should do the same for allgather_coalesced, since it can be built on top of a regular allgather, regardless of the process group implementation. In the current approach, we'd need to duplicate the functionality between the process groups.
@pietern just to make sure we understand your comment. Are you suggesting moving the allgather_coalesced implementation in this PR to |
|
Yes, I think it should live as a separate function. I just checked it out and see there is indeed the |
02a8faa to
618c817
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
Agreed. As discussed offline, let's do it in a followup PR? |
618c817 to
6a310ab
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I added some cosmetic comments. Based on the offline discussion, we will first land this PR. Then, @zappa- will work on a followup PR to move this implementation to comm.cpp.
Thanks a lot for working on this! Please get another stamp from @pietern before landing.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for not using assertNonEmpty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you are trying to keep consistent with ProcessGroupGloo::allgather.
torch/lib/c10d/ProcessGroupGloo.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks are already done in ProcessGroupGloo::allgather_coalesced, any reason for doing it again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to linter complaining about possible out of bound access on element 0 of these vectors
6a310ab to
15b2fc8
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this.
@zappa- Can you create issues for the follow up work, one to move this functionality in comm.cpp, and one to move allreduce_coalesced into comm.cpp?
CircleCI build failures summaryAs of commit 15b2fc8:
Here are the reasons each build failed. This comment was automatically generated by Dr. CI. Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 2 time(s). |
|
added issue #29033 to move allgather_coalesced to comm.h. Unfortunately I cannot assign this since I don't have write permissions to repo. |
|
Thank you for reviews! Shipping it... |
15b2fc8 to
ac3b71d
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
ac3b71d to
1091cb5
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
1091cb5 to
70f7c0f
Compare
Summary: Pull Request resolved: pytorch#28857 Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: mrshenli Differential Revision: D18209289 fbshipit-source-id: f5f394635e60a9d4ce016e366415d90f1daa4423
70f7c0f to
733fb07
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18209289 |
|
This breaks CUDA builds. |
|
This pull request has been merged in 22a346e. |
|
I'm getting a similar error to @ezyang: https://circleci.com/api/v1.1/project/github/pytorch/pytorch/3443618/output/106/0?file=true. Is this going to be fixed/reverted? |
|
Sorry, spoke too soon. it has been reverted by 9041e29 |
Summary: Pull Request resolved: pytorch#29059 This is a resubmit of reverted diff D18209289 ( PR pytorch#28857 ). Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: pietern Differential Revision: D18277097 fbshipit-source-id: 3e16c4c5f71e5c051ffef280e021bd253caf127c
Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo
Differential Revision: D18209289