[c10d] Mixed precision DDP hang fix and fine-grained option for DDP perf #13496

teng-li · 2018-11-02T03:19:56Z

When go to mixed precision fp16 training, DDP randomly hangs. Initially, I thought this smells like a similar NCCL bug I filed a while ago. It turns out it's not. Again, I am seeing different rank process has different size. How could this even happen?

It turns out that take_tensors will generate a list of bucketed tensors in an un deterministic order, because, the key to the map is a pointer. An interesting bug digging and fix.

Now fp16 DDP training should be fully working now.

Also, added another take_tensor fine grained helper that aims to improve the performance of DDP, making it a TODO to replace the DDP take_tensors with that.

Fixed: #12150

pietern

Can you put the finegrained version in a different PR when you use it somewhere?

I don't see how it is different btw. Would be good to add a comment describing it.

Nice find on the std::unordered_map problem!

torch/csrc/utils/tensor_flatten.cpp

teng-li · 2018-11-02T04:54:07Z

@pietern , I think it's OK to keep the other function here. Please see the comments I added for each function in header for differences. I believe theoretically, the function I added can make fp16 DDP faster.

teng-li · 2018-11-02T04:59:36Z

@pietern It's not just because of unordered_map, it's a problem of using a pointer as the key (which is the Tensor type, Fp16, Fp32, etc) to that map. No wonder the randomness of hang

torch/csrc/utils/tensor_flatten.h

torch/csrc/utils/tensor_flatten.cpp

torch/csrc/utils/tensor_flatten.h

teng-li · 2018-11-02T20:18:08Z

@pietern Refactored and added comments

…improvement

teng-li · 2018-11-03T01:08:04Z

@pietern Completed test added

facebook-github-bot

@teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pietern

LGTM

Two additional comments: 1) there's a TODO on using it in DDP (defaults to false now), and 2) the tests can be deduplicated between the two backends. I like the pattern of having a def _test_dist_broadcast_coalesced(self, args) and calling it however many times.

pietern · 2018-11-05T19:14:46Z

Looks like CircleCI didn't trigger for this PR...

facebook-github-bot

@teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

teng-li requested review from goldsborough and pietern November 2, 2018 03:19

pietern reviewed Nov 2, 2018

View reviewed changes

teng-li mentioned this pull request Nov 2, 2018

Random NCCL hang after several broadcasts in PyTorch Distributed Data Parallel for Mixed Precision Training NVIDIA/nccl#147

Closed

pietern reviewed Nov 2, 2018

View reviewed changes

apaszke reviewed Nov 2, 2018

View reviewed changes

teng-li force-pushed the ddp_mixed_precision branch 2 times, most recently from 74aeae4 to 5c97156 Compare November 2, 2018 20:16

[c10d] Mixed precision DDP hang fix and fine-grained option for perf …

11b6ca1

…improvement

teng-li force-pushed the ddp_mixed_precision branch from 5c97156 to 11b6ca1 Compare November 2, 2018 21:58

Added completed tests for both modes and both backends

63ebc4f

teng-li changed the title ~~[c10d] Mixed precision DDP hang fix and perf improvement helper function~~ [c10d] Mixed precision DDP hang fix and fine-grained option for DDP perf Nov 3, 2018

facebook-github-bot reviewed Nov 5, 2018

View reviewed changes

pietern approved these changes Nov 5, 2018

View reviewed changes

teng-li closed this Nov 5, 2018

teng-li reopened this Nov 5, 2018

Fixed lint errors

9f1e4ee

facebook-github-bot reviewed Nov 5, 2018

View reviewed changes

facebook-github-bot closed this in 7481908 Nov 6, 2018

pietern mentioned this pull request Nov 15, 2018

Distributed(c10d) occasionally locks up for mixed precision training #11672

Closed

pietern mentioned this pull request Mar 7, 2019

Device agnostic gradient reduction #17757

Closed

ezyang added the merged label Jun 25, 2019

jichan3751 mentioned this pull request Aug 9, 2019

[sgd] Tune interface for Pytorch MultiNode SGD ray-project/ray#5350

Merged

1 task

[c10d] Mixed precision DDP hang fix and fine-grained option for DDP perf #13496

[c10d] Mixed precision DDP hang fix and fine-grained option for DDP perf #13496

Uh oh!

Conversation

teng-li commented Nov 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li commented Nov 2, 2018

Uh oh!

teng-li commented Nov 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li commented Nov 2, 2018

Uh oh!

teng-li commented Nov 3, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

pietern commented Nov 5, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

teng-li commented Nov 2, 2018 •

edited

Loading

teng-li commented Nov 2, 2018 •

edited

Loading