-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[c10d] Mixed precision DDP hang fix and fine-grained option for DDP perf #13496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the finegrained version in a different PR when you use it somewhere?
I don't see how it is different btw. Would be good to add a comment describing it.
Nice find on the std::unordered_map problem!
torch/csrc/utils/tensor_flatten.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/utils/tensor_flatten.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@pietern , I think it's OK to keep the other function here. Please see the comments I added for each function in header for differences. I believe theoretically, the function I added can make fp16 DDP faster. |
|
@pietern It's not just because of |
torch/csrc/utils/tensor_flatten.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/utils/tensor_flatten.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/utils/tensor_flatten.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/utils/tensor_flatten.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/utils/tensor_flatten.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
74aeae4 to
5c97156
Compare
|
@pietern Refactored and added comments |
5c97156 to
11b6ca1
Compare
|
@pietern Completed test added |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Two additional comments: 1) there's a TODO on using it in DDP (defaults to false now), and 2) the tests can be deduplicated between the two backends. I like the pattern of having a def _test_dist_broadcast_coalesced(self, args) and calling it however many times.
|
Looks like CircleCI didn't trigger for this PR... |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen?
It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix.
Now fp16 DDP training should be fully working now.
Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.
Fixed: #12150