-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Move allgather_coalesced implementation from Python to C++ #29059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D18277097 |
|
#29059 caused a broken build due to unimplemented function in MPI backend. Fixed here. |
36b882e to
efd623b
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18277097 |
efd623b to
0542a6c
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18277097 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D18277097 |
0542a6c to
b21acc4
Compare
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have checked CI before approving #28857.
It's all green now, so it should be good to go.
Summary: Pull Request resolved: pytorch#29059 This is a resubmit of reverted diff D18209289 ( PR pytorch#28857 ). Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: pietern Differential Revision: D18277097 fbshipit-source-id: 3e16c4c5f71e5c051ffef280e021bd253caf127c
b21acc4 to
557c40b
Compare
|
This pull request was exported from Phabricator. Differential Revision: D18277097 |
|
This pull request has been merged in 23695ab. |
| store = c10d.FileStore(self.file_name, self.world_size) | ||
| pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size, self.opts()) | ||
| dummy_input = [torch.Tensor([1])] | ||
| dummy_input = [torch.zeros([1], dtype=torch.float32)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry -- why are these not translated exactly?
torch.Tensor([1]) is torch.ones([1]), not zeros, right?
also same with the line below, why did that change from -1 to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only tests error handling, so they underlying values here should not be important (all_gather_coalesced never copies anything in this function). I am happy to change it back if you prefer
|
|
||
| inline void assertSameDevice( | ||
| std::function<void(const std::string&)> fn, | ||
| const at::ArrayRef<at::Tensor>& tensors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we have a TensorList for this? (Also I wouldn't expect const reference to it, it's trivial to copy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, a good point. I was trying to be consistent with the other functions in this module mostly use const reference to ArrayRef, and not TensorList (TensorList = ArrayRef).
Actually, I just need to verify tensors in a vector, so I might just accept a const ref to a vector.
Would you prefer that?
Summary:
Pull Request resolved: #29059
Resubmit of reverted PR #28857.
Differential Revision: D18277097