Normalize gradients before reduction in DistributedDataParallelC10d #11109

myleott · 2018-08-30T21:17:58Z

Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training.

Differential Revision: D9594708

apaszke

Normalization should happen on the coalesced buffers instead of individual parameters

facebook-github-bot

myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

soumith · 2018-09-02T03:34:15Z

would be super dope if you added a test for this, so that we dont regress on this in the future.

facebook-github-bot

myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

myleott · 2018-09-08T21:38:45Z

@pytorchbot retest this please

teng-li · 2018-09-10T07:38:32Z

@myleott agreeing on the above comment, it's super risky to do any DDP change right before our release

myleott · 2018-09-10T11:32:51Z

I added the test yesterday :) But also this is a pretty trivial change and without it fp16 distributed training is much much worse, so I definitely think we should get it in before the release.

apaszke

This might be important for stability and has a test now, so I'd vote to merge it before the release.

facebook-github-bot

myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

myleott is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…11109) Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training. Pull Request resolved: #11109 Differential Revision: D9594708 fbshipit-source-id: 95fe299ce2776d664e6652a05f45d9471a80a326

…ytorch#11109) Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training. Pull Request resolved: pytorch#11109 Differential Revision: D9594708 Pulled By: myleott fbshipit-source-id: 93ab53cb782ee1cbe1264e529b333490a0940338

myleott requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 30, 2018 21:17

apaszke suggested changes Aug 30, 2018

View reviewed changes

facebook-github-bot reviewed Aug 31, 2018

View reviewed changes

myleott requested review from pietern and teng-li as code owners September 8, 2018 18:27

facebook-github-bot reviewed Sep 8, 2018

View reviewed changes

apaszke approved these changes Sep 10, 2018

View reviewed changes

facebook-github-bot reviewed Sep 10, 2018

View reviewed changes

myleott mentioned this pull request Sep 10, 2018

[c10d] C10d release to torch.distributed for PT1 #11405

Closed

facebook-github-bot closed this in 18e5fd3 Sep 10, 2018

teng-li added a commit to teng-li/pytorch that referenced this pull request Sep 11, 2018

Rebase to incoporate pytorch#11109

c1f4720

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize gradients before reduction in DistributedDataParallelC10d #11109

Normalize gradients before reduction in DistributedDataParallelC10d #11109

Uh oh!

myleott commented Aug 30, 2018

Uh oh!

apaszke left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

soumith commented Sep 2, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

myleott commented Sep 8, 2018

Uh oh!

teng-li commented Sep 10, 2018

Uh oh!

myleott commented Sep 10, 2018 •

edited

Loading

Uh oh!

apaszke left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Normalize gradients before reduction in DistributedDataParallelC10d #11109

Normalize gradients before reduction in DistributedDataParallelC10d #11109

Uh oh!

Conversation

myleott commented Aug 30, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Sep 2, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

myleott commented Sep 8, 2018

Uh oh!

teng-li commented Sep 10, 2018

Uh oh!

myleott commented Sep 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

myleott commented Sep 10, 2018 •

edited

Loading