-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Back out "[pytorch][PR] Optimization of the Embedding and Embedding-Bag CUDA Kernel" #22377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @madsbk Looks like #22016 sometimes leads NAN or INF in embedding grads for FP16 training. @chenyangyu1988 Could you please share some context here? |
|
It's possible that truncating |
|
@ngimel, yes that could be the problem and easy fixable. I will look at it tomorrow. @chenyangyu1988, do we have an example code that triggers the error? |
|
@madsbk I don't have a OSS example here, but if you have a FP16 training example, you could check the loss_scale after 1k iterations, it is correct if the loss_scale is much larger than 1. In our case, we still get NAN or INF in the embedding layer grad even loss_scale is some 2.2e-16 |
|
Do we have a target time for this, it is currently block our training. Or could we revert first? |
|
Yes, it should be reverted, #22016 has other problems that need to be fixed first. |
|
@mrshenli so let's revert it for now? |
…ag CUDA Kernel" (pytorch#22377) Summary: Pull Request resolved: pytorch#22377 Original commit changeset: 398d5f48826a D15944339 f123389994, P68573646 During FP16 training, we found char_embeddings.weight get NAN or INF grads Differential Revision: D16067446 fbshipit-source-id: 9ad54f67f73576a2c663cde9c9ee00a9aa669879
b0881a6 to
da0a784
Compare
|
@chenyangyu1988, can I get you to test #22401? |
Summary: Address the issue raised in #22377. The PR #22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: #22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Summary: Address the issue raised in pytorch/pytorch#22377. The PR pytorch/pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: pytorch/pytorch#22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Summary: Address the issue raised in pytorch#22377. The PR pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: pytorch#22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
|
Hi @chenyangyu1988! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
Closing as out-of-date; feel free to reopen if still relevant. |
Summary:
Original commit changeset: 398d5f48826a
f123389994, P68573646
During FP16 training, we found char_embeddings.weight get NAN or INF grads
Differential Revision: D16067446