-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add an option to disable reduced precision reductions for FP16 GEMM #67946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit b852f29 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
Some GEMM shapes benchmarked on V100: |
Could we please document all these nuances. Perhaps adding a new dedicated doc that speaks specifically to precision vs performance control? Or adding it to the performance doc? It could then include the tf32 enabling way as well from one of the recent PRs. Thank you! |
|
I found the most relevant doc for this change: https://github.com/pytorch/pytorch/blob/master/docs/source/notes/numerical_accuracy.rst . So may be it should belong there and adding an xref from cuda.rst? |
|
I agree with @stas00, it makes sense to move the main portion of the docs to numerical_accuracy and expand it to mention that most of the math for gemms is done in fp32 precision, but, if reduced precision reduction is allowed, some intermediate results can be truncated to low precision, and cross-link it from cuda. Does this apply to bf16 also, btw? It's harder to establish because bf16 will only truncate mantissa, there won't be glaring overflows there. |
Since the original change was only for |
| fp16 GEMMs are potentially done with reduced precision reductions (e.g., in fp16 rather than fp32). This reduction in precision can allow for higher performance on certain workloads (particularly those with a large `k` dimension) and GPU architectures at the cost of numerical precision and potential for overflow. | ||
|
|
||
| Some example benchmark data on V100 | ||
| .. code:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix this please
docs/source/notes/cuda.rst
Outdated
| Reduced Precision Reduction in FP16 GEMMs | ||
| ----------------------------------------- | ||
|
|
||
| fp16 GEMMs are potentially done with reduced precision reductions (e.g., in fp16 rather than fp32). This reduction in precision can allow for higher performance on certain workloads (particularly those with a large `k` dimension) and GPU architectures at the cost of numerical precision and potential for overflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the GEMM accumulation is still done in fp32 precision, there are only a few truncations that are done, so can you please make the wording more accurate to not imply that all the accumulation is done in fp16?
| fp16 GEMMs are potentially done with reduced precision reductions (e.g., in fp16 rather than fp32). This reduction in precision can allow for higher performance on certain workloads (particularly those with a large `k` dimension) and GPU architectures at the cost of numerical precision and potential for overflow. | ||
|
|
||
| Some example benchmark data on V100 | ||
| .. code:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix this please
|
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…16 GEMM (#89172) Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: #89172 Approved by: https://github.com/ngimel

#67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction =rather than making it the default behavior.
CC @ngimel @ptrblck
@stas00 Note that the behavior after the previous PR can be replicated with
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False