-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Disable reduced precision reductions for fp16 GEMMs #67578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit e234557 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: It appears that most NVIDIA architectures (well, at least there haven't been many reports of this issue) don't do reduced precision reductions (e.g., reducing in fp16 given fp16 inputs), but this change attempts to ensure that a reduced precision reduction is never done. The included test case currently fails on Volta but passes on Pascal and Ampere; setting this flag causes the test to pass on all three. CC stas00 ngimel ptrblck Pull Request resolved: #67578 Reviewed By: mruberry Differential Revision: D32046030 Pulled By: ngimel fbshipit-source-id: ac9aa8489ad6835f34bd0300c5d6f4ea76f333d1
|
Why is commit information hidden from the public? e.g if I click on https://github.com/pytorch/pytorch-canary/commit/4e6cedb340cdeff97af3baf72888dc4b20b639c7 it gives me 404 not-found, which usually means it's a private repo. So how can a user see where did the PR go? It's already super confusing that pytorch PRs get closed rather than merged, as far as github PR status goes, since it's hard to quickly tell whether PR was merged or not, but if we can't even see where they go, that's troublesome... Thank you! update: I found it. e01279c |
|
@stas00 you can also look at hud https://hud.pytorch.org/ci/pytorch/pytorch/master for the latest merged commits (although sometimes there's a half an hour or so delay), or at the pytorch repo history. |
|
Doing some belated benchmarking on V100: and A100: |
…67946) Summary: #67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = ` rather than making it the default behavior. CC ngimel ptrblck stas00 Note that the behavior after the previous PR can be replicated with `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False` Pull Request resolved: #67946 Reviewed By: zou3519 Differential Revision: D32289896 Pulled By: ngimel fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
It appears that most NVIDIA architectures (well, at least there haven't been many reports of this issue) don't do reduced precision reductions (e.g., reducing in fp16 given fp16 inputs), but this change attempts to ensure that a reduced precision reduction is never done. The included test case currently fails on Volta but passes on Pascal and Ampere; setting this flag causes the test to pass on all three.
CC @stas00 @ngimel @ptrblck