-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Normalize gradients before reduction in DistributedDataParallelC10d #11109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalization should happen on the coalesced buffers instead of individual parameters
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
would be super dope if you added a test for this, so that we dont regress on this in the future. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
|
@myleott agreeing on the above comment, it's super risky to do any DDP change right before our release |
|
I added the test yesterday :) But also this is a pretty trivial change and without it fp16 distributed training is much much worse, so I definitely think we should get it in before the release. |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be important for stability and has a test now, so I'd vote to merge it before the release.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
myleott has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
myleott is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ytorch#11109) Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training. Pull Request resolved: pytorch#11109 Differential Revision: D9594708 Pulled By: myleott fbshipit-source-id: 93ab53cb782ee1cbe1264e529b333490a0940338
Summary: Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training.
Differential Revision: D9594708