Skip to content

Conversation

@teng-li
Copy link
Contributor

@teng-li teng-li commented Nov 2, 2018

When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen?

It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix.

Now fp16 DDP training should be fully working now.

Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.

Fixed: #12150

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the finegrained version in a different PR when you use it somewhere?

I don't see how it is different btw. Would be good to add a comment describing it.

Nice find on the std::unordered_map problem!

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li
Copy link
Contributor Author

teng-li commented Nov 2, 2018

@pietern , I think it's OK to keep the other function here. Please see the comments I added for each function in header for differences. I believe theoretically, the function I added can make fp16 DDP faster.

@teng-li
Copy link
Contributor Author

teng-li commented Nov 2, 2018

@pietern It's not just because of unordered_map, it's a problem of using a pointer as the key (which is the Tensor type, Fp16, Fp32, etc) to that map. No wonder the randomness of hang

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li teng-li force-pushed the ddp_mixed_precision branch 2 times, most recently from 74aeae4 to 5c97156 Compare November 2, 2018 20:16
@teng-li
Copy link
Contributor Author

teng-li commented Nov 2, 2018

@pietern Refactored and added comments

@teng-li teng-li force-pushed the ddp_mixed_precision branch from 5c97156 to 11b6ca1 Compare November 2, 2018 21:58
@teng-li teng-li changed the title [c10d] Mixed precision DDP hang fix and perf improvement helper function [c10d] Mixed precision DDP hang fix and fine-grained option for DDP perf Nov 3, 2018
@teng-li
Copy link
Contributor Author

teng-li commented Nov 3, 2018

@pietern Completed test added

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Two additional comments: 1) there's a TODO on using it in DDP (defaults to false now), and 2) the tests can be deduplicated between the two backends. I like the pattern of having a def _test_dist_broadcast_coalesced(self, args) and calling it however many times.

@pietern
Copy link
Contributor

pietern commented Nov 5, 2018

Looks like CircleCI didn't trigger for this PR...

@teng-li teng-li closed this Nov 5, 2018
@teng-li teng-li reopened this Nov 5, 2018
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi Processing C10D Distributed Data Parallel Hang with FP16

5 participants